warpfork/ipld-strings.md

## ipld-strings.md

      
    Raw
  

              ipld-strings.md
            
          
    Strings in IPLD

"Strings" are a familiar concept in how programs handling digital data.
However, for all its familiarity, it is also subtle.
Solid systems require clear specifications.
In this document, we will describe Strings in IPLD, where they appear, and what specifications we make about their domain.
Prerequisites

This document will refer heavily to "domain".
In general, you will find that the specification is designed to be averse to varations in the size of domain in our regard of strings.
Appearances of Strings

Logical appearances of Strings

Strings appear in three logical places:

in scalar values in the Data Model;
in map keys in the Data Model;
and in Path Segments that describe how we traverse some data.

Note that whether these are distinct is a matter of perspective.
Through most of this document, we will intentionally conflate them, because
it is not desirable for them to be distinct.
Implementational appearances of Strings

In implementing IPLD libraries, strings appear in many places:

Strings as specified in the Data Model

Strings as scalar values
Strings as map keys


Strings as applied in PathSegments
Strings as implemented in an IPLD library
Strings as handled by Codecs
Strings where they appear in Schemas

Strings as type names
Strings as field names in structs
Strings in representation details (such as discriminants, rename directives, etc)


Specifying Strings in IPLD

For an IPLD library to have Complete support for the IPLD Data Model, Strings MUST support the full range of 8-bit bytes.  Strings SHOULD be encouraged to be UTF-8, but this MUST NOT result in the inability of a library to handle non-UTF-8 byte sequences where Strings are handled.
Libraries MAY choose to support only some domains of strings, such as only allowing Unicode characters, or only allowing UTF-8, or only allowing UTF-8 with NFC normalization.  However, such libraries are considered "Incomplete" / "Limited Domain" libraries.
Similarly, a Codec MAY limit its support to only some domains of strings, but this is known as an Incomplete Codec (more specifically, "incomplete(stringmangling)" -- see Codecs and Completeness).
Strings are sequences of bytes.

Strings are understood to be composed of sequences of 8-bit bytes.
The full range of 8-bit bytes must be possible to handle using IPLD libraries.
See the Design Rationale section for reasons behind this specification.
Character encodings are recommended by IPLD (but not enforced)

We recommend that users of IPLD choose UTF-8 as their string encoding,
and we further recommend that UTF-8 strings be canonicalized using NFC canonicalization.
However, IPLD libraries will not enforce this, and any kind of bytes may be used as strings.
IPLD libraries will typically presume UTF-8 encoding in any situations where relevant,
but must at the same time must still support operating on data that may contain non UTF-8 sequences.
Since applications and libraries typically presume UTF-8 encodings,
any applications that create IPLD data are strongly advised to produce UTF-8 encoded strings.
If they do not do so (or intentionally treat raw non-textual bytes as strings),
conformant IPLD libraries will handle their data losslessly (per the sequence of 8-bit bytes rule),
but the experience for users of these applications will be greatly degraded
since many parts of the IPLD ecosystem may render non-UTF-8 strings unappealingly.
8-bit clean handling is a superset of Unicode

It is important to note that not all byte sequence are valid in all character encodings.
One interesting sequence that exemplifies this is "\xc3\x21".
This sequence is not valid UTF-8.
Accordingly it can't be represented by \u-style escaping -- only by hex or other such general escaping sequences.
While we've written both bytes as hex-escaped here for clarity, the second byte should come across as a regular exclamation point:
this is the case no matter how one looks at it, because unicode (or at least certainly UTF-8) is supposed to be self-synchronizing.
Null bytes are part of 8-bit clean

"0b00000000" is a valid member of the set of 8-bit bytes, and no more special than any other 8-bit byte.
C-family languages take note: you'd better be using the "strn*" family of functions
(e.g. those that have explicit lengths, rather than null-terminators) when handling IPLD data.
(But then, you should be using those anyway, at all times in all cases, for a million other reasons anyway... right?)
The recommended character encoding is UTF-8, NFC normalized.

UTF-8 is the defacto winner of the character encoding popularity wars.
We see no reason to fight this overwhelmingly clear tide.
NFC normalization is one of the well-standardized normalization forms for UTF-8.
It is carefully shepherded to be stable and backwards compatible across unicode versions,
and widely recognized as a standard.
Of the other well-defined normalization forms for UTF-8, NFC is typically the most compact,
the most commonly occurring in the wild, and generally the most suitable to our needs.
You can read more about unicode normalization in the specs here:
https://www.unicode.org/reports/tr15/#Norm_Forms
(In particular, there are some excellent tables and example figures in the linked chapter.)
The W3C Character Model for the World Wide Web 1.0: Normalization
and other W3C Specifications such as XML 1.0 5th Edition recommend using
NFC normalization for all content (this being discussed in the unicode documentation above),
further adding weight to the observation that NFC is an uncontroversial and reasonable choice.
Libraries which do not support 8-bit clean strings are not IPLD spec compliant

Though we have a recommended character encoding (think: "SHOULD", in RFC6919 parlance),
the rule of "8-bit clean handling is required" entirely supersedes character encoding recommendations.
Fully correct and spec-compliant IPLD libraries MUST handle strings
faithfully and losslessly even if they contain non-normalized, or even non-UTF-8, content.
It is not correct for an IPLD library to apply encoding limitations
nor apply normalization mutations to a string unless it is done at the user's explicit direction.
Codecs which do not support 8-bit clean strings are lossy

Any codec that does not support encoding any value from the full
range of values described within IPLD Data Model is lossy;
any codec cannot produce any value from the full
range of values described within IPLD Data Model during decoding is limited;
Since the IPLD Data Model defines strings as 8-bit clean,
any codec which cannot encode all such strings uniquely is therefore lossy;
and any codec which cannot produce the full range of 8-bit clean strings
during decoding is limited.
Escaping mechanisms are a part of many codec's mechanisms to support wide ranges of strings.
Such escaping mechanisms must be defined for all bytes, not just those which have unicode codepoints.
There's nothing wrong with lossy and/or limited codecs -- however,
it is important that documentation of those codecs should identify them as such.
Strings in some areas of Schemas have additional constraints

Where strings appear in IPLD Schemas, they sometimes carry additional constraints.
For example, type names in IPLD Schemas cannot contain spaces;
neither can field names.
These restrictions are not interesting for the purpose of understanding strings in the IPLD Data Model at large, however.
Where these properties do intersect with the understanding of strings in the IPLD Data Model,
the greater strictness in the strings which will match to the Schema is clearly identified, and the process is one which can raise errors.
For example, while field names are stricter than IPLD Data Model strings,
this domain shrinkage is handled during the does-the-data-match-the-schema matching process,
which is by definition and intention a domain-shrinking process, can already emit rejections for many other reasons.
Appendix: Library Recommendations

IPLD libraries should determine one string implementation type, and use it for all appearances of strings in the core interfaces of the library.
Doing so will result in the simplest and most understandable interfaces, and will be likely to result in the fewest "cast"-like operations being foisted onto users.
In particular, we recommend that the type used for scalar string values and for map keys be the same type;
since the definitions of the IPLD Data Model state that these have the same domains, it is only reasonable for them to use the same types.
If the language and standard library environment an IPLD library is built within has a "native" string type that includes any kind of unicode strictness, we recommend that the IPLD library have a Node type which has an AsString method which returns either that "native" string type or an error when the data contains bytes not allowed by that string type, and also include another method that accesses the full byte sequence directly (eschewing the "native" string type, since it is unavoidable to do so).
See also: Appendix: Relationship to FFI.
If implementing an IPLD library in a new language, looking at how that language handles FFI
is likely to provide some good detailed information about how that language can handle strings.
Appendix: Relationship to FFI

In general, taking a cue from how a language handles strings in FFI ("Foreign Function Interfaces")
is very likely to be the right answer for how to handle Strings for IPLD.
Since IPLD is concerned with interchange and at speed, and FFI interfaces are concerned with interchange and at speed,
the same choices are often correct for both.
For example:

In Rust, ffi.OsString is suitable.  It contains any byte sequence faithfully.
In Golang, string is suitable.  It contains any byte sequence faithfully.
In C, a pair of *char, int is suitable (e.g. the str*n families of functions operate with this convention).  It contains any byte sequence faithfully.
In Python, byte strings are suitable.  (See [cffi.unpack](https://cffi.readthedocs.io/en/latest/ref.html#ffi-string-ffi-unpack) for an example of functions that produce these.)  It contains any byte sequence faithfully.

Design Rationale

Rationale: Lossless Interchange

IPLD is fundamentally a set of data interchange specifications.
It is deeply important than IPLD programs be able to exchange data without loss.
Any situation where we limit the domain of strings requires a clear specification of what we do when processing data that contains elements outside that domain.
Ideally, the IPLD specifications should be aligned by design with domains that are the most inclusive we are likely to encounter.
What this means is subject to some degree of interpretation!
(In essence, we're looking for Schelling points in how systems handle strings.)
Perhaps the best way to explore this is by asking "If we only support X, what will we say to requests for support of Y?",
and then weighing if the question seems concerning or not.
For example: if we say that we support only "UTF-8 sequences", do we worry about requests to support the wider range of "8-bit sequences"?
Yes -- quite a bit; we have users and projects regularly requesting support for this;
and some of our guiding star usecases (for example, the aim to support the description of filesystems -- wherein filenames are simply bytes, in the understanding of unix-y OSes) hinge directly on this.
Or even simply "UTF-16" sequences?  Or "JIS X 0208" encodings?
Both of these questions are already answered if we support "8-bit sequences" -- but are not answered if we support only "UTF-8".
Now take another example: if we say that we support "8-bit sequences", do we worry about requests to support "9-bit sequences"?
Perhaps this concern might be not so sharp.
Based on these series of questions, we might reasonably conclude that 8-bit sequences are a reasonable definition for lossless interchange in practice;
other definitions would be much less likely to support the kinds of interchange we expect to be able to accommodate.
Rationale: Simplicity

Simpler definitions are preferable.
For example, the definition of "strings are a sequence of 8-bit bytes" is significantly simpler than "strings are UTF8",
because defining UTF8 requires additional tables of various cases.
"Strings are UTF-8" would also be a significantly simpler definition than "strings are UTF-8 NFC-normalized",
because defining NFC-normalization requires yet more tables of yet more cases.
Rationale: Consistency

It's highly preferable that strings in the various places they appear --
namely, in the data model as values, in the data model as map keys, and in path segments as descriptions of how we navigate data --
should use exactly the same definition of "string".
Clearly, a PathSegment must describe at least the same domain of data as a map key;
otherwise, our pathing would simply impossible unspecified over some data, which would be an unacceptable failure.
For a map key and a string value to have different domains of data would be odd; there's little imaginable value to this.
(Imagine on the other hand the frustration of someone who makes a data structure where values in one map are intended to correspond to keys in another map, and then discovers this is impossible to do safely!)
Rationale: Performance

Validating string encoding on every serialization boundary has the potential to produce noticeable computational overhead.
Some definitions avoid this entirely -- for example, the definition of "strings are sequences of 8-bit bytes" is generally "free" on any contemporary computation platform on planet Earth.
Other definitions are not so lucky -- for example, validating that a string contains only UTF-8 sequences requires a scan which examines the entire string; inspecting it for normalization requires even more operations.
If we make specification choices which are not conducive to performance, we should have excellent reasons for doing so.
We should also be prepared to have clear definitions and understandings of what ecosystemic behavior will emerge if there are partial implementations of IPLD specs which eschew the implementing of rules which have performance costs;
it's almost guaranteed that in a large and growing ecosystem, some implementations will do this, so it's important to plan and design accordingly.
Rationale: Expedience yet Non-militancy

It is useful for IPLD to have a concept of strings, and carry with it an implication that a string is meant to be rendered as human-readable text.
Simultaneously, many systems exist in the world produce data which are regarded as strings, but may do so in various encodings.
It is useful for IPLD to suggest some default interpretation of strings and their encodings (e.g., UTF-8), and suggest that libraries support handling strings accordingly.
It is not ergonomic nor useful for IPLD to insist that all strings are any specific encoding;
it is not viable for IPLD to refuse to handle data that considers itself a string but does not have our recommended default interpretation of encoding;
and while it might be technically possible for IPLD to insist that all such data be treated as bytes (which are presumed non-renderable),
it remains not ergonomic nor useful to do so: to insist this would drive an unnecessarily militant wedge between the intentions of our users and the actual practice of using our specs and libraries.
Rationale: Containers

It is useful if strings can contain arbitrary byte sequences,
because that means users can use their choice of data as map keys, and without problems.
Examples of things users may wish to use as map keys include:

strings (in non-UTF-8 encodings),
binary keys,
or even (for example) custom "big int" encodings.

This is a double-edged sword: it also does mean we've explicitly chosen to give up guaranteed renderability of map keys.
However, see the rejected alternative for maps keyed by bytes,
the rejected alternative for multiple map kinds,
and the appendix clarifying IPLD's non-relationship to rendering.
Appendix: Rejected Alternatives

Rejected: Strings as UTF8-only


Using any domain restriction on strings results in reduced interoperability, which is against our fundamental goals.
Mandating UTF-8 strictly is neither necessary nor sufficient for canonicalization goals (further specification, such as NFC canonicalization, would be necessary -- and has also previously been rejected as being clearly not the job of IPLD due to running wildly contrary to goals of simplicity and friendliness to partial implementation).
Verifying any domain restriction on strings implies performance costs: verifying a string encoding is necessarily at least an O(n) cost (i.e., linear in the length of the string).  Such a cost is extremely nontrivial in the context of applications handling the exchange of large volumes of data.

It is likely that in practice many implementations would ignore these requirements in exchange for speed, regardless of what we specify.  In acknowledgement of this, we would simply rather avoid specifying contracts we know the ecosystem in practice will not keep.


Though UTF-8 has become common, it would be indisputably wrong to say it is the only character encoding in use today.  See Objection 1.
Even if UTF-8 were the only character encoding in use in the world, we expected IPLD to be used together with other systems that do not enforce strict checks (see Objection 3), and it is not desirable for IPLD systems to become lossy in comparison to those systems when used together.  See Objection 1.
Restriction of stringgs to UTF-8 reduces the viability of our maps.  See Rationale: Container, as well as the other rejected alternatives regarding how we deal with maps.
Restricting strings to UTF-8 does not automagically solve rendering concerns; visually similar characters are still possible; visually identical but byte-wise distinct characters are possible (unless normalization forms are applied); unrenderable characters are still possible; grapheme combinators are still a whole ball of wax; etc.
Restricting strings to UTF-8 simply is not necessary.

Sorting can (and should) be defined on the byte-wise representation of strings, even when they are UTF-8.
Equality can (and should) be defined on the byte-wise representation of strings, even when they are UTF-8.
This pattern holds for essentially all questions one can ask of strings and how to compare them.


Rejected: Maps as keyed by Bytes


To define maps as keyed by bytes provides vs defining maps as keyed by strings which are defined as sequences of 8-bit bytes is to use two different sets of words to describe the same outcome.

It's a pedagological distinction, in other words.  They differ only in how we talk about them, and what our terminology suggests to the casual reader.

Given this, we should prefer the least-daunting terminology, and prefer emphasizing the suggested uses first -- which means we should prefer to talk about strings first, and then subsequently, only when more detail is required, discuss the detail that strings can be any byte sequence.
Talking about map keys primarily as strings encourages users to work primarily in strings (which we prefer, and would like to encourage).


There is no advantage to describing map keys as bytes vs defining map keys as strings which are defined as sequences of 8-bit bytes.


Defining map keys as bytes instead of strings does nothing to address any of the other reasons strings need to have a consistent and inclusive domain:

Strings as applied in PathSegment would still need to be defined as sequences of 8-bit bytes -- they would need to be able to encode all map keys.
Strings as scalar values in the Data Model would still face all of the same concerns around lossless interchange.


Several of the codecs that inspire IPLD the most, as well as the practical usage in the wild of some of the codecs most represented in IPLD usage to date, use strings as map keys.

It would be unnecessarily odd to redefine our concept of the data model to say things are bytes instead of strings, and be deeply confusing to those looking at these codecs and the data already in circulation using string indicators in map keys.
By contrast, defining map keys as strings, while eluciding the definition of strings to include 8-bit sequences of bytes, does not provoke this confusing frission.


Rejected: Multiple map kinds

One design choice which would support non-string map keys (and thus impact the string situation overall) would be to add new kinds to the Data Model Kinds enumeration:
for example, map_with_string_keys instead of just map, and then also add map_with_bytes_keys and map_with_int_keys.

This would be possible, but is simply not the path IPLD has taken.

It would be necessary to distinguish those different kinds of maps in IPLD codecs, and these features are not present (and this distinguishment would not be clearly possible even if the feature were present) in many of the codecs we took as key inspiration when defining the IPLD Data Model.


We would have great difficulty mapping data with mixed key kinds onto many codecs that we would like to consider near-totally-bijective to the IPLD Data Model.

Defining the Data Model in a way that creates such conflicts is not desirable.


More Kinds in the enumeration that's at the heart of "What is the Data Model?" simply is not parsimonious or received as pleasing by any discussions we've had.
It is worth noting that even map_with_int_keys might not solve all issues where end-users want to use numeric keys in maps: namely, "big" integers would still require additional consideration.

The IPLD specification for numbers requires that integers up to 2^53 be supported in order to be minimally IPLD compliant.
Suppose someone wants to use a 128-bit number as a map key: they will necessarily need to figure out how to encode that as a string or byte sequence anyway.


Defining multiple kinds of maps with variations in key kinds does nothing to address any of the other reasons strings need to have a consistent and inclusive domain:

Strings as applied in PathSegment would still need to be defined as sequences of 8-bit bytes -- they would need to be able to encode all map keys (including those that are bytes).
Strings as scalar values in the Data Model would still face all of the same concerns around lossless interchange.


Rejected: Maps with mixed key kinds


This would be possible, but is simply not the path IPLD has taken.

Maps with multiple key kinds are not distinguishably expressible in many of the codecs we took as key inspiration when defining the IPLD Data Model.


We would have great difficulty mapping data with mixed key kinds onto many codecs that we would like to consider near-totally-bijective to the IPLD Data Model.

Defining the Data Model in a way that creates such conflicts is not desirable.


Mixed key kinds can be difficult to implement, or imply significant performance penalties, in some programming environments.
Defining the behavior of iteration in the presence of mixed key kinds can be difficult to implement, or imply significant performance penalties, in some programming environments.
Defining maps to allow mixed key kinds does nothing to address any of the other reasons strings need to have a consistent and inclusive domain:

Strings as applied in PathSegment would still need to be defined as sequences of 8-bit bytes -- they would need to be able to encode all map keys (including those that are bytes).
Strings as scalar values in the Data Model would still face all of the same concerns around lossless interchange.


Rejected: Remove strings from the Data Model entirely

One design choice we could make is to say "Strings don't exist at all; always just use Bytes".

This would be possible, but is simply not the path IPLD has taken.
Humans like having a concept of strings.  The usability of IPLD would be greatly imperiled if we made the crass swimming-against-the-current to say "IPLD doesn't support strings".
Strings and bytes have some connotational differences (even if they don't have denotational differences).

Generally speaking: "strings are suggesting themselves to be renderable as text" and "bytes are suggesting themselves to not be intended to be rendered as text".

Mind: this doesn't imply all strings are renderable as text without escaping, nor does it imply bytes are never renderable as text.  But: the connotations remain useful in practice.


Connotations (such as about concepts of rendering) can involve compromises.  Handling without loss cannot.


We would have great difficulty mapping many codecs onto the IPLD Data Model if we refused to disambiguate strings and bytes.

For example: CBOR already has a distinction between strings and bytes.
It is generally preferable to have even limited and incomplete codecs which cannot support the full domain of 8-bit sequences in strings document themselves as supporting e.g. "this codec does not support bytes, and strings are restricted to {some subset}": this description leaning on a common understanding of strings is easy to comprehend even to a reader not immersed deeply in the details of specifications.


Rejected: Strings are required to have encoding prefixes

One design choice would be to disallow strings entirely except those that include an encoding prefix
(like a BOM -- although a BOM would arguably be insufficient, since it is considered optional to unicode, and is not sufficient to identify non-Unicode documents).

First of all, no.  This is truly beyond the pale.  No.

However, just for the sake of true completeness, let's entertain and destroy this idea exhaustively, even though it is, truly, exhausting to bother to do so.


We would have impossible barriers to matching the IPLD Data Model to many existing codecs if we claimed that all strings must have encoding prefixes.

No codec this author has ever seen has a string encoding prefix in every single string in the document.


Spending serial space repeating a string encoding marker on every string in every document would expand the size of documents unreasonably.
We can simply reference empirical history: generally, formats which have attempted this angle of approach have consistently fallen out of favor.

The most obvious example of these kind of encoding markers -- BOMs -- are not often seen in the wild!  (And often cause loud cursing on the rare occasion of an encounter!)


Appendix: Rendering Concerns

The IPLD specifications do not regard rendering of strings.
We recommend the rendering of strings be based on the Unicode specifications,
and we recommend that when in doubt, strings should be presumed to be UTF8 until proven otherwise.
However, since IPLD only ever discusses string encodings in terms of "SHOULD",
it follows that we cannot offer any stronger statements about rendering of strings.
IPLD library implementations may choose to provide helpful functions for e.g.
counting graphemes, counting codepoints, and other forms of presentational reasoning about strings.
However, no such features are required for a library to be considered a full IPLD implementation.
IPLD libraries may include helper functions and features which help "escape" strings which contain
unprintable and/or non-unicode sequences.
However, no such features are required for a library to be considered a full IPLD implementation.
Appendix: Other Reading


https://boyter.org/posts/unicode-support-what-does-that-actually-mean/

Worthy reference for a quick overview of visualization vs byte issues, folding issues, potential distinguishability issues, etc.


https://hsivonen.fi/string-length/

Everything you never wanted to ask about the complexity of understanding string length when Unicode gets involved -- Extended Grapheme Clusters and more.