Last active
November 1, 2024 23:54
-
-
Save pervognsen/2470bd6e992d83bb59de11ddcd30895f to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// This can grow a Robin Hood linear probing hash table near word-at-a-time memcpy speeds. If you're confused why I use 'keys' | |
// to describe the hash values, it's because my favorite perspective on Robin Hood (which I learned from Paul Khuong) | |
// is that it's just a sorted gap array which is MSB bucketed and insertion sorted per chain: | |
// https://pvk.ca/Blog/2019/09/29/a-couple-of-probabilistic-worst-case-bounds-for-robin-hood-linear-probing/ | |
// The more widely known "max displacement" picture of Robin Hood hashing also has strengths since the max displacement | |
// can be stored very compactly. You can see a micro-optimized example of that here for small tables where the max displacement | |
// can fit in 4 bits: Sub-nanosecond Searches Using Vector Instructions, https://www.youtube.com/watch?v=paxIkKBzqBU | |
void grow(Table *table) { | |
u64 exp = 64 - table->shift; | |
// We grow the table downward in place by a factor of 2 (not counting the overflow area at table->end). | |
u64 *keys = table->keys - (1ull << exp); | |
assert(keys >= table->start); | |
// We rely on "MSB hashing" where the home slot index corresponding to a hash value k is k >> shift. By decrementing | |
// the shift amount we're exposing a new LSB in the slot index but the MSBs of the index are the same but shifted. | |
// The net result of this is that items in the table will spread out but stay sorted. | |
u64 shift = --table->shift; | |
// The scatter-based rehashing assumes the lower part of the table is pre-cleared to EMPTY. A proper implementation | |
// should probably do blocked clearing in the grow function itself (outside of the branchless inner loop) in | |
// blocks sized as a fraction of the L1 cache. | |
for (u64 *src = table->keys, *dest = keys; src != table->end; src++, dest++) { | |
u64 k = *src; | |
*src = EMPTY; | |
// The conditional is unnecessary if EMPTY = 0 but EMPTY = ~0 is useful as a search sentinel for Robin Hood. | |
// An alternative benefit of using EMPTY = 0 is that you can rely on the kernel's virtual memory manager to | |
// pre-zero pages which synergizes with reserving a worst-case virtual memory range and growing in place. | |
// That eats more memory bandwidth (vs blocked clearing which only operates on soon-to-be-hot cache lines) | |
// but a lot of it will be done asynchronously so the pre-zeroing bandwidth is spread out over time. And | |
// if you're already getting pre-zeroed pages directly from the kernel (as opposed to recycling memory in | |
// user space) you might as well take advantage of it. Still, you can memset an L1 cache line (64 bytes) | |
// per cycle with AVX2 instructions nowadays, so a nonzero EMPTY sentinel is nearly free if you do | |
// blocked clearing--simplifying the search loop is worth it. | |
u64 i = k != EMPTY ? k >> shift : 0; // Branchless | |
u64 *p = &keys[i]; | |
// With Robin Hood linear probing we already have the items in the right order and we just need to leave | |
// gaps in the right places. Two cases: Either append to the current chain or jump the gap to the home slot. | |
dest = p >= dest ? p : dest; // Branchless | |
*dest = k; | |
} | |
table->keys = keys; | |
// By the way, the out-of-place version of this algorithm should be ideal for multi-threaded rehashing. Contiguous | |
// source ranges map to contiguous destination ranges, so you can use the "split evenly and sync across the boundary" | |
// strategy to assign independent source/destination ranges to workers. In a concurrent hash table this would typically | |
// be done via a helper scheme where threads that want to insert while a rehash is ongoing will help finish the job | |
// by grabbing some outstanding ranges to work on. This can be made wait-free if you use fetch-and-add to grab work; | |
// instead of suffering the usual amortization spike on one thread, you're spreading it out across multiple threads, | |
// so no inserter thread can be starved as could be the case in a merely lock-free algorithm. Unfortunately, while | |
// rehashing a Robin Hood table looks to be ideally suited for multi threading, the same cannot be said for sorted | |
// inserts. With classical linear probing you can do an insert with a compare-and-swap into the first empty slot at | |
// the end of a probe chain. Robin Hood inserts require you to shift slots in a chain to make room. It's doable but | |
// it's inherently harder and more costly (I think you'd need to shift from the end of the chain and leave temporary | |
// duplicates with a flag bit). But the good news is that this style of multi-threaded contiguous range rehashing should | |
// be adaptable to classical linear probing with MSB hashing. You don't have the same strong monotonicity invariant | |
// as with Robin Hood linear probing but you do have monotonicity across gap-separated chains (i.e. items in different | |
// chains are ordered correctly even if items within the same chain may be out of order), which should be enough | |
// for the range partitioning. | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment