Here's where I understand the state of the art to be:
- In this INRIA tech report, Shapiro, Preguiça, Baquero and Zawirski (SPBZ) prove, amongst other things, that a sufficient condition for CRDTs to achieve eventual consistency on networks which may reorder and duplicate packets (which I'll call flaky networks, henceforth) is that
- the underlying datatype forms a semilattice,
- messages are full states,
- incoming messages are combined with the node's current state using the least-upper-bound operation in the semilattice.
- It's possible to relax condition 2 and still achieve eventual consistency over flaky networks by fragmenting the state into independent parts and transmitting updates to each part separately. For instance, in the G-Set CRDT (an add-only bitset) one can transmit only the index of the element to be added.
- In these slides from a talk at Dagstuhl, Almeida, Baquero and Cunha (ABC) consider various operators for building CRDTs out of smaller objects. These composition operators are mostly familiar to programmers: sets, maps, functions, products, lexicographic products, multisets. Perhaps less familar are linear sums (roughly disjoint union, with all elements of the left semilattice less than the elements of the right semilattice) and antichains (sets of maximal elements of some poset). They also introduce a special class of semilattice endomorphisms called inflations.
- In this paper, Almeida and Baquero (AB) describe a CRDT that functions as a distributed counter, and generalise the construction to other semilattices with extra monoid structure which plays nicely with the semilattice structure.
A word about me: I'm a former category theorist/universal algebraist, with no particular deep knowledge of order or lattice theory. Most of what I'm going to say proceeds from very general algebraic considerations, and may not actually be all that useful. With that said...
- In this gist, I prove that conditions 1 and 3 are necessary for eventual consistency over flaky networks: all such CRDTs must be based on semilattices.
- A consequence of this view is that the set M of messages can be viewed as a set of generators for the underlying semilattice S.
- This gives us a lower bound on the size of M. Since M is a generating set for S (see here for the equivalent notion for groups), S must be a quotient of P(M) by the homomorphic embedding theorem. So if |M| = n, then S can have at most 2^n elements. Hence |M| is at least log_2(|S|).
- In addition to the composition operations considered in (ABC), I suggested the following three operations in a Twitter conversation with Carlos Baquero and Sam Elliot:
- Coproducts (like Haskell's sum types). It turns out that coproducts of semilattices are the same thing as products. D'oh!
- Tensor products, which allow us to do currying. The tensor product A⊗B is defined such that Hom(A⊗B,C) is isomorphic to Hom(A, Hom(B,C)) - in many well-known categories tensor products are just ordinary products. A detailed construction of tensor products for semilattices is given in this paper.
- Quotients; think "this semilattice S but with a set equal to b, and all the consequences of that decision shaken out".
- As mentioned above, every semilattice is a quotient of powersets. We know we can work with powersets, so if we can work with quotients then we can build every possible flake-resistant CRDT. The problem of algorithmically determining whether two elements of a quotient algebra are equal is known as the word problem for that type of structure, and it turns out that the word problem for semilattices is fairly easy to solve. I'm going to copy-and-paste from Jeff Egger's description of the algorithm (from a Facebook discussion); by "free semilattice" he means "powerset":
First off, finitely free semilattices are obviously decidable. In fact, if you choose a linear ordering on the (finite) set of generators, you get a normal form: list in order those generators which appear in the word and associate to the left or right, as you prefer, and that's it.
Now let's say we add a relation s=t. That equality is equivalent to the pair of inequalities s<t and t<s, which are in turn equivalent to the pair of equalities t=st and s=st. SO, we get a nf for the quotient lattice as follows: given a word w (reduced as above), if you see all the generators which appear in s, throw in all the extra generators which appear in t, and re-reduce; if not, but you see all the generators which appear in t, throw in all the extra generators appearing in s and re-reduce.
This algorithm works because if L is a semilattice, every x in L is the least upper bound of { y in L | y <= x }. If G is a generating set for L, this can be simplified to x = lub { g in G | g <= x }.
Does this help? Well, it answers the question about universality on the last slide of (ABC). Is it an intuitive way of building data-structures? Probably not.
ABC came up with the name "inflation" independently, but there was already a name for these: a function f such that f(x) >= x is called "inflationary" or "progressive" (see http://en.wikipedia.org/wiki/Bourbaki-Witt_theorem). We use "inflationary" in our papers.