mildsunrise/lz_string.md

## lz_string.md

      
    Raw
  

              lz_string.md
            
          
    lz-string is a very popular library that compresses UTF-16 strings
using a variation of the LZ78 algorithm where literals are only
encoded once (and referred as dictionary indexes afterwards).
Despite its impressive 10 million downloads per week at the time of
this writing, there is no official documentation on the wire format
implemented by this library, and the horrible code quality makes it hard
to understand from it. Furthermore, the code contains many bugs / gotchas,
some of which are enumerated below. This is probably why many of the numerous
ports of the library blindly copy its code, doing little more
than adapting syntax.
This gist provides a description of the format, as well as a simplified
(but unoptimized) reference implementation in Python.
Wire format

The bytes of a compressed blob are interpreted as a bitstream in big-endian
order (meaning, MSB-first). This bitstream contains fixed-size integers
in little-endian (LSB-first), which are "fixed-size" in the sense that
their sizes are known by the code and not signalled in the bitstream, but
each of the integers has a different one (see below).
The bitstream consists of a sequence of packets, each of which begins
an opcode integer determining its type and payload:


Opcodes 0 and 1 are followed by a literal integer of 8 or 16 bits
respectively. They are emitted when the encoder finds an UTF-16 code
unit it hasn't seen before.


Opcode 2 marks the end of the stream, which is then zero-padded
as needed to complete the last byte.


Opcodes 3 and above refer to an entry in the dictionary, whose index
is given by opcode - 3. They do not carry any payload.


The size of each opcode is as high as needed so that all valid opcodes
at that point can be represented. In other words, an opcode is encoded
in $\lceil \log_2 (M+1) \rceil$ bits, where $M$ is the highest value that
this particular opcode is allowed to contain. This depends on the size of
the dictionary at that point, which couples the wire format to the
compression algorithm (explained below).
To decouple both things, we can say that $M = 2$ when reading the first
opcode (since there are no entries in the dictionary yet), and that $M$
increments by 2 for every literal packet consumed and by 1 for every
dictionary index packet consumed.
This is what the reference implementation does.

Example (compression of string "ABBAB"):


Compression scheme

Because it was written in JS (and more than a decade ago, when TextEncoder
wasn't a thing) this scheme compresses UTF-16 strings rather than binary data.
More specifically, it compresses a list of UTF-16 code units (i.e. 16-bit
ints) into packets, which are then formatted into a bitstream / byte buffer
as indicated above.
The decoder starts with an empty dictionary. When a packet (that isn't
the EOF marker) is received, a chunk of code units is emitted to
the output. This chunk depends on the packet:


If the packet is a literal, then the chunk consists of a single
code unit (the literal). In this case, the next available index
in the dictionary is assigned to this chunk.


If the packet is a dictionary index, then the chunk consists of
the dictionary entry given by that index.

Special case: if the index is equal to the dictionary size
(i.e. the first unassigned index) then the chunk is equal to
the last emitted chunk plus the first code unit of that
same chunk. (Such an index is only valid after the first
packet, when there is a "last emitted chunk".)


Before emitting the chunk to the output, the decoder assigns the next
available index in the dictionary to the last emitted chunk plus
the first code unit of the chunk it's about to emit. (This is not
done for the first chunk, when there is no "last emitted chunk".)
Thus, a literal packet (except the first one) results in 2 new entries
assigned to the dictionary: first, one containing just the literal,
then another containing the previously emitted chunk plus the literal.
A dictionary index packet (which can never be the first one)
results in only 1 new entry in the dictionary.
The "special case" above allows the compressor to produce a bitstream
that never causes duplicate entries to be added to the dictionary.
Gotchas / bugs

The following bugs are of special relevance when interoperating with
the library and its ports:


The library (though I believe this is fixed in newer versions)
has methods to compress directly to Base64. However, if these
are used, it generates invalid Base64 in some cases because
the last quatriplet of Base64 characters has only 1 character,
which makes no sense (those 6 bits don't form a complete byte,
nor do they contribute to complete a previous byte).

This is because the author decided to combine the base64
generation with the compression itself by having compress()
take an arbitrary output word width (which is 6 in the case
of Base64) and it zero-pads only until the last word.

Base64 decoders will generally either fail, or discard the last
6 bits, removing part of the EOF marker from the bitstream.
It can be fixed by appending an A to the base64 (before the
padding, if present) to make the Base64 decoder happy.


The library has methods to compress to a Uint8Array, but these
(due to historical legacy) zero-pad to 2 bytes, not 1.


The library has methods to compress to "an encoded URI component",
which suggests a URL-safe value requiring no percent-encoding.
This is not true: rather than using urlsafe Base64 (characters -_)
this library uses a custom alphabet (+-) that still uses +,
causing a need for percent-encoding in query strings.


Reference implementation

The provided reference implementation separates the wire format from
the compression logic itself. You likely want to call the functions
under "Main API" at the end. It should be equivalent to the JS library
except for the following, which mainly concerns zero-padding:


Unlike the JS implementation, this one verifies that the bitstream
contains no junk after the EOF marker (only padding). For the
original behavior of accepting any input even if it has trailing junk,
remove the assert for padding in decode_packets.


Functions only produce legal Base64, which can differ from the one
produced by (old versions of) the JS library by one or two extra
trailing A.


Functions that produce a byte buffer pad only until 1 byte.


Despite these differences, AFAIK it should be interoperable with all
versions of the JS library (in both directions) though I have yet to
verify that using the official test suite.

  
## lz_string.py
from typing import Iterator, Iterable

# GENERIC BITSTREAM UTILITIES
# these take/produce the bits in BE (MSB-first) order

class BitReader:
	def __init__(self, bytes: Iterable[int]):
		self.bits = ( (x >> i) & 1 for x in bytes for i in reversed(range(8)) )

	def read_int_le(self, size: int) -> int:
		''' consume a fixed-size LE (LSB-first) integer '''
		result = 0
		for i in range(size):
			result |= next(self.bits) << i
		return result

class BitWriter:
	def __init__(self):
		self.bytes = bytearray()
		self.pos = 0

	def write_bit(self, bit: int):
		if self.pos == 0:
			self.bytes.append(0)
			self.pos = 8
		self.pos -= 1
		self.bytes[-1] |= bit << self.pos

	def write_int_le(self, x: int, size: int):
		''' write a fixed-size LE (LSB-first) integer '''
		for i in range(size):
			self.write_bit((x >> i) & 1)
		assert not x >> size


# DECOMPRESSION

def decode_packets(data: bytes):
	''' decode the packets from a byte buffer '''
	bits = BitReader(data)
	max_opcode = 2
	while (opcode := bits.read_int_le(max_opcode.bit_length())) != 2:
		if opcode < 2:
			yield "literal", bits.read_int_le([8, 16][opcode])
			max_opcode += 2
		else:
			yield "dictionary index", opcode - 3
			max_opcode += 1
	padding = list(bits.bits)
	assert len(padding) < 24 and not any(padding)

def decompress(data: bytes):
	''' decompress a buffer of bytes into a stream of UTF-16 code units '''
	dictionary = []
	last_entry = None
	for kind, value in decode_packets(data):
		if kind == "literal":
			entry = [value]
			dictionary.append(entry)
		elif value == len(dictionary):
			entry = last_entry + entry[:1]
		else:
			entry = dictionary[value]

		yield from entry
		if last_entry:
			dictionary.append(last_entry + entry[:1])
		last_entry = entry


# COMPRESSION

def compress(data: list[int]) -> bytes:
	''' compress a sequence of UTF-16 code units into a byte buffer '''
	bits = BitWriter()
	max_opcode = 2
	write_opcode = lambda x: bits.write_int_le(x, max_opcode.bit_length())
	for kind, value in compress_to_packets(data):
		if kind == 'literal':
			write_opcode(opcode := 0 if value.bit_length() <= 8 else 1)
			bits.write_int_le(value, [8, 16][opcode])
			max_opcode += 2
		else:
			write_opcode(value + 3)
			max_opcode += 1
	write_opcode(2)
	return bytes(bits.bytes)

def compress_to_packets(data: list[int]):
	''' compress a sequence of UTF-16 code units into packets '''
	dictionary = {}
	last_entry = None
	while data:
		entry = ()
		while len(entry) < len(data) and entry + (data[len(entry)],) in dictionary:
			entry += (data[len(entry)],)

		if entry == last_entry and len(entry) < len(data) and data[len(entry)] == entry[0]:
			entry += (data[len(entry)],)
			yield 'dictionary index', len(dictionary)
		elif entry:
			yield 'dictionary index', dictionary[entry]
		else:
			entry += (data[len(entry)],)
			dictionary[entry] = len(dictionary)
			yield 'literal', data[0]

		data = data[len(entry):]
		if last_entry:
			dictionary[last_entry + entry[:1]] = len(dictionary)
		last_entry = entry


# MAIN API

from array import array

def decompress_to_string(data: bytes) -> str:
	code_units = array('H', [0xFEFF]) # BOM
	code_units.extend(decompress(data))
	return str(code_units, 'utf-16')

def compress_from_string(data: str) -> bytes:
	code_units = array('H', data.encode('utf-16'))
	return compress(code_units[1:])

from base64 import b64decode, b64encode

def decompress_uricomponent(data: str) -> str:
	''' emulates decompressFromEncodedURIComponent '''
	assert all(x.isalnum() or x in '+-' for x in data)
	# the original implementation (old version) contains a bug that causes incomplete
	# Base64 to be emitted, so correct for that or b64decode will fail:
	data += 'A' * ((-len(data)) % 4)
	# it doesn't use the standard urlsafe alphabet, and the custom alphabet
	# still keeps the (URL-sensitive) '+', which defeats the purpose :)
	return decompress_to_string(b64decode(data, '+-'))

def compress_uricomponent(data: str) -> str:
	b64 = b64encode(compress_from_string(data), b'+-').decode('ascii')
	while b64.endswith('='): b64 = b64[:-1] # remove padding
	return b64
	from typing import Iterator, Iterable

	# GENERIC BITSTREAM UTILITIES
	# these take/produce the bits in BE (MSB-first) order

	class BitReader:
	def __init__(self, bytes: Iterable[int]):
	self.bits = ( (x >> i) & 1 for x in bytes for i in reversed(range(8)) )

	def read_int_le(self, size: int) -> int:
	''' consume a fixed-size LE (LSB-first) integer '''
	result = 0
	for i in range(size):
	result \|= next(self.bits) << i
	return result

	class BitWriter:
	def __init__(self):
	self.bytes = bytearray()
	self.pos = 0

	def write_bit(self, bit: int):
	if self.pos == 0:
	self.bytes.append(0)
	self.pos = 8
	self.pos -= 1
	self.bytes[-1] \|= bit << self.pos

	def write_int_le(self, x: int, size: int):
	''' write a fixed-size LE (LSB-first) integer '''
	for i in range(size):
	self.write_bit((x >> i) & 1)
	assert not x >> size


	# DECOMPRESSION

	def decode_packets(data: bytes):
	''' decode the packets from a byte buffer '''
	bits = BitReader(data)
	max_opcode = 2
	while (opcode := bits.read_int_le(max_opcode.bit_length())) != 2:
	if opcode < 2:
	yield "literal", bits.read_int_le([8, 16][opcode])
	max_opcode += 2
	else:
	yield "dictionary index", opcode - 3
	max_opcode += 1
	padding = list(bits.bits)
	assert len(padding) < 24 and not any(padding)

	def decompress(data: bytes):
	''' decompress a buffer of bytes into a stream of UTF-16 code units '''
	dictionary = []
	last_entry = None
	for kind, value in decode_packets(data):
	if kind == "literal":
	entry = [value]
	dictionary.append(entry)
	elif value == len(dictionary):
	entry = last_entry + entry[:1]
	else:
	entry = dictionary[value]

	yield from entry
	if last_entry:
	dictionary.append(last_entry + entry[:1])
	last_entry = entry


	# COMPRESSION

	def compress(data: list[int]) -> bytes:
	''' compress a sequence of UTF-16 code units into a byte buffer '''
	bits = BitWriter()
	max_opcode = 2
	write_opcode = lambda x: bits.write_int_le(x, max_opcode.bit_length())
	for kind, value in compress_to_packets(data):
	if kind == 'literal':
	write_opcode(opcode := 0 if value.bit_length() <= 8 else 1)
	bits.write_int_le(value, [8, 16][opcode])
	max_opcode += 2
	else:
	write_opcode(value + 3)
	max_opcode += 1
	write_opcode(2)
	return bytes(bits.bytes)

	def compress_to_packets(data: list[int]):
	''' compress a sequence of UTF-16 code units into packets '''
	dictionary = {}
	last_entry = None
	while data:
	entry = ()
	while len(entry) < len(data) and entry + (data[len(entry)],) in dictionary:
	entry += (data[len(entry)],)

	if entry == last_entry and len(entry) < len(data) and data[len(entry)] == entry[0]:
	entry += (data[len(entry)],)
	yield 'dictionary index', len(dictionary)
	elif entry:
	yield 'dictionary index', dictionary[entry]
	else:
	entry += (data[len(entry)],)
	dictionary[entry] = len(dictionary)
	yield 'literal', data[0]

	data = data[len(entry):]
	if last_entry:
	dictionary[last_entry + entry[:1]] = len(dictionary)
	last_entry = entry


	# MAIN API

	from array import array

	def decompress_to_string(data: bytes) -> str:
	code_units = array('H', [0xFEFF]) # BOM
	code_units.extend(decompress(data))
	return str(code_units, 'utf-16')

	def compress_from_string(data: str) -> bytes:
	code_units = array('H', data.encode('utf-16'))
	return compress(code_units[1:])

	from base64 import b64decode, b64encode

	def decompress_uricomponent(data: str) -> str:
	''' emulates decompressFromEncodedURIComponent '''
	assert all(x.isalnum() or x in '+-' for x in data)
	# the original implementation (old version) contains a bug that causes incomplete
	# Base64 to be emitted, so correct for that or b64decode will fail:
	data += 'A' * ((-len(data)) % 4)
	# it doesn't use the standard urlsafe alphabet, and the custom alphabet
	# still keeps the (URL-sensitive) '+', which defeats the purpose :)
	return decompress_to_string(b64decode(data, '+-'))

	def compress_uricomponent(data: str) -> str:
	b64 = b64encode(compress_from_string(data), b'+-').decode('ascii')
	while b64.endswith('='): b64 = b64[:-1] # remove padding
	return b64