warpfork/the-string-situation.md

## the-string-situation.md

      
    Raw
  

              the-string-situation.md
            
          
    The String Situation


// This document is INCOMPLETE.  you have been warned.

This exploration report is a brief round-up of all the places the definition of
"string" -- a seemingly simple and common concept!  but not trivial, by any means --
becomes critically important to systemic comprehensibility and correctness.
forward: nothing is "simply"

Dear Reader, please know that your humble author has tried to write this document several times, in several forms, over several months.
Many of those drafts have languished and gone unfinished (or at best, been shared in smaller circles while still containing massive "todo"s).
Controlling the scope of this topic is surprisingly difficult: it touches, and is touched by, many things.
If there's one thing your humble author has learned from this, it is the following:
One cannot "simply" anything, in regards to this topic.
Kindly go into reading the remainder of this document with this context in mind.
forward: this is an exploration report

This document is framed as an "exploration report" -- not a "spec".
This means that it's a single-author document -- it may contain opinions and recommendations,
and they have not necessarily been widely ratified.
It's also a point-in-time snapshot of thinking --
the thinking may change over time, but this document will remain a historical reference moreso than a living document.
Any statements in spec documents supersede these.
What is the "String Situation"?


Strings are a part of the IPLD Data Model.  Therefore what they can contain must be specified.
Strings are a defacto part of many codecs.

Not all codecs will directly match the IPLD Data Model definition of strings.
(Emergences of these differences are inevitable, because IPLD interfaces with many codecs, not just codecs that have been defined by the IPLD team.)
In cases where they differ, the behavior around this must be specified.


Strings appear as more than just terminal node scalars in the IPLD Data Model: they also appear as part of the structure of maps, a recursive kind.

Recursive kinds in the Data Model require more specification than scalar kinds: it's necessary to define how iteration is expected to work, and so on.


Strings are deeply related to pathing.

Anything related to pathing is also directly implied to be related to...

traversals...
Selectors...
linking...
how we debug things...
any wire protocol that uses paths (of which there are many)...
etcetera!


Pathing is itself a topic with many important high-level rules that we need to maintain the integrity and consistency of:

we require that Paths must be universally understood by all IPLD library implementations
we require that there is no such thing as unpathable data structures -- any part of any data structure should be reachable by path from a parent position.
we want Paths to be human readable
we want Paths to be human writable without undue effort or fragility
we want Paths to be usable in debugging
etcetera!


"The String Situation" refers to the mixture of all of these concerns --
because they all turn out to be intertwined.
Questions that we have to answer in this realm include:

What are "strings", really?

What character encodings do we have?
When are character encodings normalized?
What do we do when encountering non-normalized content?  (Error, warn, fix?  How, in detail?  Does it vary per situation?  Is it configurable?  Etc.)
Are any of these choices going to pose correctness or interoperability problems?


Are there any places where we have "strings" that have more rules than other "strings"?

If so, are we identifying that sufficiently clearly, in both code, specs, and tests?


Do answers to any of these questions vary per codec?

If so, can we orient around a common set of variations and identify terse vocabulary for them, so that we can use it consistently in docs and specs?


Do answers to any of these questions create hurdles which make IPLD libraries drastically harder to implement?

... in terms of correct compliance to spec?
... in terms of how much non-standard-library code will need to be written to produce the library?
... in terms of how much user-facing burden the library interface will generate?
... in terms of performance costs, either in memory or in execution time?
Do the answers to these questions vary per programming language and environment?


Summary of results

This is an overview of the aggregate proposals in this report,
which have been chosen and defined such that they are all mutually reconcilable:

String encoding is recommended; 8-bit clean handling is required.

Null bytes are (unfortunately) part of 8-bit clean.
The recommended character encoding is UTF-8, NFC normalized.
Libraries which do not support 8-bit clean strings are not IPLD spec compliant.
Codecs which do not support 8-bit clean strings are lossy.


Map keys are strings.

Map keys are only strings (not integers, or anything else).

Advanced Layouts are free to do as they like internally, as long as they can also present this interface externally.
Custom complex Codecs are free to do as they like internally, as long as they can also present this interface externally.
Schemas are free to use any type as a key, as long as its representation is a string.


Map keys can be any strings, including empty.
Map keys can be any strings, including those with slashes.


Paths are defined to be a list of PathSegment.

Encoding these lists into a single string requires defining an encoding and escaping mechanism.


PathSegments are defined to be string-coercibles.

Strings are (trivially) string-coercibles.
Integers are string-coercibles.
This is single-directional: It is not possible to later disambiguate whether a string-coercible was originally an integer or a string.


Some of these proposals are more flexible than others:
some of them can't change at all, or the whole proposal set shatters into irreconcilability;
others are chosen more for convenience and clarity, but could be modified (carefully).
Read the full paragraphs of details below to find out which is which (and why).
String encoding is recommended; 8-bit clean handling is required

Strings are understood to be composed of sequences of 8-bit bytes.
We recommend that users of IPLD choose UTF-8 as their string encoding,
and we further recommend that UTF-8 strings be canonicalized using NFC canonicalization.
However, IPLD libraries will not enforce this, and any kind of bytes may be used as strings.
IPLD libraries will typically presume UTF-8 encoding in any situation where they need to render strings (e.g. in error messages),
and may choose to use escaping for
Applications in the IPLD ecosystem should also typically presume UTF-8 encoding in any situation where they need to render strings.
Since applications and libraries typically presume UTF-8 encodings,
any applications that create IPLD data are strongly advised to produce UTF-8 encoded strings.
If they do not do so (or intentionally treat raw non-textual bytes as strings),
conformant IPLD libraries will handle their data losslessly (per the sequence of 8-bit bytes rule),
but the experience for users of these applications will be greatly degraded
since many parts of the IPLD ecosystem may render non-UTF-8 strings unappealingly.
Null bytes are part of 8-bit clean

"0b00000000" is a valid member of the set of 8-bit bytes, and no more special than any other 8-bit byte.
C-family languages take note: you'd better be using the "strn*" family of functions
(e.g. those that have explicit lengths, rather than null-terminators) when handling IPLD data.
(But then, you should be using those anyway, at all times in all cases, for a million other reasons anyway... right?)

This choice seems the cleanest, most consistent, and least complicated route to me.
It also saves us from needing to define what happens when (not if) an IPLD library is handed "string" data that contains a null byte:
as long as the answer is "that's still just a string", everything is parsimonious.

The recommended character encoding is UTF-8, NFC normalized.

UTF-8 is the defacto winner of the character encoding popularity wars.
We see no reason to fight this overwhelmingly clear tide.
NFC normalization is one of the well-standardized normalization forms for UTF-8.
It is carefully shepherded to be stable and backwards compatible across unicode versions,
and widely recognized as a standard.
Of the other well-defined normalization forms for UTF-8, NFC is typically the most compact,
the most commonly occurring in the wild, and generally the most suitable to our needs.
You can read more about unicode normalization in the specs here:
https://www.unicode.org/reports/tr15/#Norm_Forms
(In particular, there are some excellent tables and example figures in the linked chapter.)
The W3C Character Model for the World Wide Web 1.0: Normalization
and other W3C Specifications such as XML 1.0 5th Edition recommend using
NFC normalization for all content (this being discussed in the unicode documentation above),
further adding weight to the observation that NFC is an uncontroversial and reasonable choice.
Libraries which do not support 8-bit clean strings are not IPLD spec compliant

Though we have a recommended character encoding (think: "SHOULD", in RFC6919 parlance),
the rule of "8-bit clean handling is required" entirely supersedes character encoding recommendations.
Fully correct and spec-compliant IPLD libraries MUST handle strings
faithfully and losslessly even if they contain non-normalized, or even non-UTF-8, content.
It is not correct for an IPLD library to apply encoding limitations
nor apply normalization mutations to a string unless it is done at the user's direction.
Codecs which do not support 8-bit clean strings are lossy

Any codec that does not support encoding any value from the full
range of values described within IPLD Data Model is lossy;
any codec cannot produce any value from the full
range of values described within IPLD Data Model during decoding is limited;
Since the IPLD Data Model defines strings as 8-bit clean,
any codec which cannot encode all such strings uniquely is therefore lossy;
and any codec which cannot produce the full range of 8-bit clean strings
during decoding is limited.
There's nothing wrong with lossy and/or limited codecs -- however,
it is important that documentation of those codecs should identify them as such.
Map keys are Strings

Map keys -- the parameter used to look up values in a map -- must be strings.
(If Schemas are involved, the rule becomes slightly more conditioned,
but retains the same flavor: in Schemas, the key type for any map
must be representable as strings.)
Map keys are only strings (not integers, or anything else).

This is perhaps easiest to justify by explaining the alternatives
and what would be required to make them workable:
Map keys can be any strings, including empty.

// TODO expand
Map keys can be any strings, including those with slashes.

// TODO expand
Paths are defined to be a list of PathSegment

// TODO expand
Encoding paths into a single string requires defining an encoding and escaping mechanism

// TODO expand
PathSegments are defined to be string-coercibles.


Strings are (trivially) string-coercibles.
Integers are string-coercibles.
This is single-directional: It is not possible to later disambiguate whether a string-coercible was originally an integer or a string.