lz-string is a very popular library that compresses UTF-16 strings using a variation of the LZ78 algorithm where literals are only encoded once (and referred as dictionary indexes afterwards).
Despite its impressive 10 million downloads per week at the time of this writing, there is no official documentation on the wire format implemented by this library, and the horrible code quality makes it hard to understand from it. Furthermore, the code contains many bugs / gotchas, some of which are enumerated below. This is probably why many of the numerous ports of the library blindly copy its code, doing little more than adapting syntax.
This gist provides a description of the format, as well as a simplified (but unoptimized) reference implementation in Python.
The bytes of a compressed blob are interpreted as a bitstream in big-endian order (meaning, MSB-first). This bitstream contains fixed-size integers in little-endian (LSB-first), which are "fixed-size" in the sense that their sizes are known by the code and not signalled in the bitstream, but each of the integers has a different one (see below).
The bitstream consists of a sequence of packets, each of which begins an opcode integer determining its type and payload:
-
Opcodes 0 and 1 are followed by a literal integer of 8 or 16 bits respectively. They are emitted when the encoder finds an UTF-16 code unit it hasn't seen before.
-
Opcode 2 marks the end of the stream, which is then zero-padded as needed to complete the last byte.
-
Opcodes 3 and above refer to an entry in the dictionary, whose index is given by
opcode - 3
. They do not carry any payload.
The size of each opcode is as high as needed so that all valid opcodes
at that point can be represented. In other words, an opcode is encoded
in
To decouple both things, we can say that
Because it was written in JS (and more than a decade ago, when TextEncoder
wasn't a thing) this scheme compresses UTF-16 strings rather than binary data.
More specifically, it compresses a list of UTF-16 code units (i.e. 16-bit
ints) into packets, which are then formatted into a bitstream / byte buffer
as indicated above.
The decoder starts with an empty dictionary. When a packet (that isn't the EOF marker) is received, a chunk of code units is emitted to the output. This chunk depends on the packet:
-
If the packet is a literal, then the chunk consists of a single code unit (the literal). In this case, the next available index in the dictionary is assigned to this chunk.
-
If the packet is a dictionary index, then the chunk consists of the dictionary entry given by that index.
- Special case: if the index is equal to the dictionary size (i.e. the first unassigned index) then the chunk is equal to the last emitted chunk plus the first code unit of that same chunk. (Such an index is only valid after the first packet, when there is a "last emitted chunk".)
Before emitting the chunk to the output, the decoder assigns the next available index in the dictionary to the last emitted chunk plus the first code unit of the chunk it's about to emit. (This is not done for the first chunk, when there is no "last emitted chunk".)
Thus, a literal packet (except the first one) results in 2 new entries assigned to the dictionary: first, one containing just the literal, then another containing the previously emitted chunk plus the literal. A dictionary index packet (which can never be the first one) results in only 1 new entry in the dictionary.
The "special case" above allows the compressor to produce a bitstream that never causes duplicate entries to be added to the dictionary.
The following bugs are of special relevance when interoperating with the library and its ports:
-
The library (though I believe this is fixed in newer versions) has methods to compress directly to Base64. However, if these are used, it generates invalid Base64 in some cases because the last quatriplet of Base64 characters has only 1 character, which makes no sense (those 6 bits don't form a complete byte, nor do they contribute to complete a previous byte).
- This is because the author decided to combine the base64
generation with the compression itself by having
compress()
take an arbitrary output word width (which is 6 in the case of Base64) and it zero-pads only until the last word.
Base64 decoders will generally either fail, or discard the last 6 bits, removing part of the EOF marker from the bitstream. It can be fixed by appending an
A
to the base64 (before the padding, if present) to make the Base64 decoder happy. - This is because the author decided to combine the base64
generation with the compression itself by having
-
The library has methods to compress to a
Uint8Array
, but these (due to historical legacy) zero-pad to 2 bytes, not 1. -
The library has methods to compress to "an encoded URI component", which suggests a URL-safe value requiring no percent-encoding. This is not true: rather than using urlsafe Base64 (characters
-_
) this library uses a custom alphabet (+-
) that still uses+
, causing a need for percent-encoding in query strings.
The provided reference implementation separates the wire format from the compression logic itself. You likely want to call the functions under "Main API" at the end. It should be equivalent to the JS library except for the following, which mainly concerns zero-padding:
-
Unlike the JS implementation, this one verifies that the bitstream contains no junk after the EOF marker (only padding). For the original behavior of accepting any input even if it has trailing junk, remove the assert for padding in
decode_packets
. -
Functions only produce legal Base64, which can differ from the one produced by (old versions of) the JS library by one or two extra trailing
A
. -
Functions that produce a byte buffer pad only until 1 byte.
Despite these differences, AFAIK it should be interoperable with all versions of the JS library (in both directions) though I have yet to verify that using the official test suite.