I want IPLD Strings to be valid UTF-8 only. This document provides counterarguments to @warpfork arguing for IPLD Strings as sequence of 8-bit bytes. If I took a parts from the @warpfork's document without any changes I prefix them with [verbatim]
.
FFI interfaces are concerned with interchange and at speed, the same choices are often correct for both.
I don't agree. FFI interfaces are concerned about making things work on the lowest common dominator. Which today is C types. Speed comes with it implicitly.
If you want lossless interchange for arbitrary 8-bit sequences, then use the Bytes kind.
[verbatim] Simpler definitions are preferable.
For example, the definition "strings are always valid UTF-8" is significantly simpler than "strings are a sequence of 8-bit bytes", because modern programming languages have good support for UTF-8. As a developer you don't even have to things about the details, you can just use strings as you are used to.
The difference between Bytes and Strings is also easy to explain. If Strings would be a sequence of 8-bit bytes, then they would be the same thing, but Strings should be valid UTF-8, but they don't have to be.
I fully agree with that paragraph. Now imagine if all those things, like pathing, wouldn't need any special things to support arbitrary bytes, but would just work with valid UTF-8 String.
[verbatim] Validating string encoding on every serialization boundary has the potential to produce noticeable computational overhead.
Some definitions avoid this entirely -- for example, the definition of "strings are always valid UTF-8" is generally "free" in modern programming languages. Other definitions are not so lucky -- for example, validating that a string contains only UTF-8 sequences requires a scan which examines the entire string; inspecting it for normalization requires even more operations.
[verbatim] If we make specification choices which are not conducive to performance, we should have excellent reasons for doing so. We should also be prepared to have clear definitions and understandings of what ecosystemic behavior will emerge if there are partial implementations of IPLD specs which eschew the implementing of rules which have performance costs; it's almost guaranteed that in a large and growing ecosystem, some implementations will do this, so it's important to plan and design accordingly.
It is useful if strings can contain only valid UTF-8, because that most common cases just work, while special cases are still possible.
Examples of special cases users may wish to use as map keys include:
- [verbatim] strings (in non-UTF-8 encodings),
- [verbatim] binary keys,
- [verbatim] or even (for example) custom "big int" encodings.
In those cases it would be up to the application layer to make sure that those keys have a lossless representation as valid UTF-8, e.g. as UTF8-C8 or Base64.
It also means that we have guaranteed renderability of map keys.
In this section I just respond to every bullet point.
- Using any domain restriction on strings results in reduced interoperability, which is against our fundamental goals.
It increased interoperability as things just work with all string implementations that also require being valid UTF-8, which is the case in most modern programming languages.
- Mandating UTF-8 strictly is neither necessary nor sufficient for canonicalization goals (further specification, such as NFC canonicalization, would be necessary -- and has also previously been rejected as being clearly not the job of IPLD due to running wildly contrary to goals of simplicity and friendliness to partial implementation).
True
- Verifying any domain restriction on strings implies performance costs: verifying a string encoding is necessarily at least an
O(n)
cost (i.e., linear in the length of the string). Such a cost is extremely nontrivial in the context of applications handling the exchange of large volumes of data.
No. If you guarantee that a string is valid UTF-8, you don't need to check every time you encounter it. As opposed to sequences of 8-bit bytes, where you would need to check whether it is UTF-8 or not every time you have that requirement.
- It is likely that in practice many implementations would ignore these requirements in exchange for speed, regardless of what we specify. In acknowledgement of this, we would simply rather avoid specifying contracts we know the ecosystem in practice will not keep.
This is not applicable given my comment above.
- Though UTF-8 has become common, it would be indisputably wrong to say it is the only character encoding in use today. See Objection 1.
True. For those uncommon cases you can use the Bytes kind, which is a sequences of 8-bit bytes.
- Even if UTF-8 were the only character encoding in use in the world, we expected IPLD to be used together with other systems that do not enforce strict checks (see Objection 3), and it is not desirable for IPLD systems to become lossy in comparison to those systems when used together. See Objection 1.
For interoperability with such sytems, you can always use the Bytes kind.
- Restriction of stringgs to UTF-8 reduces the viability of our maps. See Rationale: Container, as well as the other rejected alternatives regarding how we deal with maps.
See my version of "Rationale: Container".
- Restricting strings to UTF-8 does not automagically solve rendering concerns; visually similar characters are still possible; visually identical by byte-wise distinct characters are possible (unless nomralization forms are applied); unrenderable characters are still possible; grapheme combinators are still a whole ball of wax; etc.
This depends on what the goal of the rendering concerns are. Printing would work without errors on anything that supports UTF-8 properly.
- Restricting strings to UTF-8 simply is not necessary.
- Sorting can (and should) be defined on the byte-wise representation of strings, even when they are UTF-8.
- Equality can (and should) be defined on the byte-wise representation of strings, even when they are UTF-8.
- This pattern holds for essentially all questions one can ask of strings and how to compare them.
It is necessary for having better usability and interoperability with modern programming languages.
- To define maps as keyed by bytes provides no advantage over defining maps as keyed by strings which are defined as sequences of 8-bit bytes.
There are problems with supporting non-UTF-8 valid sequences as map keys natively in some popular dynamic languages. This can be one reason for not having them. With having strings being sequences of 8-bit bytes, we have exactly that problem.
- Defining map keys as bytes instead of strings does nothing to address any of the other reasons strings need to have a consistent and inclusive domain:
- Strings as applied in PathSegment would still need to be defined as sequences of 8-bit bytes -- they would need to be able to encode all map keys.
Exactly. When Strings are valid UTF-8, you won't need support for sequences of 8-bit bytes.
This is what happens when Strings are defined as sequences of 8-bit bytes. It's just implicit and not explicit. String will be Bytes with a different name, but without additional guarantees.