This program reads an export reference graph (i.e. a graph representing the runtime dependencies of a set of derivations) created by Nix and groups them in a way that is likely to match the grouping for other derivation sets with overlapping dependencies.
This is used to determine which derivations to include in which layers of a container image.
- a graph of Nix runtime dependencies, generated via exportReferenceGraph
- a file containing absolute popularity values of packages in the Nix package set (in the form of a direct reference count)
- a maximum number of layers to allocate for the image (the "layer budget")
It works by first creating a (directed) dependency tree:
img (root node)
│
├───> A ─────┐
│ v
├───> B ───> E
│ ^
├───> C ─────┘
│ │
│ v
└───> D ───> F
│
└────> G
Each node (i.e. package) is then visited to determine how important it is to separate this node into its own layer, specifically:
-
Is the node within a certain threshold percentile of absolute popularity within all of nixpkgs? (e.g.
glibc
,openssl
) -
Is the node's runtime closure above a threshold size? (e.g. 100MB)
In either case, a bit is flipped for this node representing each condition and an edge to it is inserted directly from the image root, if it does not already exist.
For the rest of the example we assume 'G' is above the threshold size and 'E' is popular.
This tree is then transformed into a dominator tree:
img
│
├───> A
├───> B
├───> C
├───> E
├───> D ───> F
└───> G
Specifically this means that the paths to A, B, C, E, G, and D always pass through the root (i.e. are dominated by it), whilst F is dominated by D (all paths go through it).
The top-level subtrees are considered as the initially selected layers.
If the list of layers fits within the layer budget, it is returned.
Otherwise layers are merged together in this order:
- layers whose root meets neither condition above
- layers whose root is popular
- layers whose root is big
- layers whose root meets both conditions
Threshold values for the partitioning conditions mentioned above have not yet been determined, but we will make a good first guess based on gut feeling and proceed to measure their impact on cache hits/misses.
Using the logic described above as well as the example presented in the introduction, this program would create the following layer groupings (assuming no additional partitioning):
Layer budget: 1 Layers: { A, B, C, D, E, F, G }
Layer budget: 2 Layers: { G }, { A, B, C, D, E, F }
Layer budget: 3 Layers: { G }, { E }, { A, B, C, D, F }
Layer budget: 4 Layers: { G }, { E }, { D, F }, { A, B, C }
...
Layer budget: 10 Layers: { E }, { D, F }, { A }, { B }, { C }