Murmur32 hash is implemented in uniform_hash in hash.cc. It takes a string and a seed and return uint64_t.
There are two hash modes specified by --hash option.
Strings mode (hashstring in parse_primitives.cc) use parsed uint64_t value plus seed for integer string(matched by [0-9]+) and murmur32 for other type.
uint64_t hash(string s, uint64_t seed) =
if s match [0-9]+
return uint64_t(s)+seed
else:
return murmur32(s, seed)
All mode (hashall in parse_primitives.cc) use murmur32.
hash(str, seed) = murmur32(str, seed)
const uint64_t constant = 11650396; // defined in constant.h
uint64_t constant_hash = 11650396 % (1<<b);
uint64_t feature_hash = hash(feature_name, 0)
uint64_t feature_index = feature_hash % (1<<b)
where 0 is seed, b is num of bits used.
uint64_t namespace_hash = hash(namespace, 0)
uint64_t feature_hash = hash(feature_name, namespace_hash)
uint64_t feature_index = feature_hash % (1<<b)
from interactions.h
given hashes(indices) h1, h2 of two features
const uint32_t FNV_prime = 16777619; //defined in constant.h
uint64_t h = (h1*FNV_prime) ^ h2 // ^ is xor
uint64_t i = h % (1<<b)
Whether h1, h2 are feature hashes(before mode) or indices (after mod) will give the same result.
sample:
1 | 123456789 has_car |gender 男 |interest 篮球 |age 42
index calculation(strings mode, b=18):
constant -> 11650396 % 1<<18 = 116060
123456789 -> 123456789 % 1<<18 = 249109
hashstring(hash_car, 0) = 2086702313
has_car -> 2086702313 % (1<<18) = 36073
hashstring(gender, 0) = 2678092942
hashstring(男, 2678092942) = 430992541
gender^男 -> 430992541 % 1<<18 = 27805
hashstring(interest, 0) = 570245443
hashstring(篮球, 570245443) = 2287792271
interest^篮球 -> 2287792271 % 1<<18 = 61583
hashstring(age, 0) -> 717653329
hashstring(42, 717653329) -> 717653329+42 = 717653371
age^42 -> 717653371 % 1<<18 = 165243
gender^男*interest^篮球 -> ((27805 * 16777619) ^ 61583) % 1<<18 = 134056
gender^男*gender^男 -> ((27805*16777619)^27805) % 1<<18 = 169914
interest^篮球*interest^篮球 -> ((61583*16777619)^61583) % 1<<18 = 147858