java - Seemingly easy FNV1 hashing implementation results in a lot of collisions -


i'm playing hash tables , using corpus of ~350,000 english words i'd try evenly distribute. thus, try fit them array of length 810,049 (the closest prime larger 2 times input size) , baffled see straightforward fnv1 implementation this:

public int gethash(string s, int mod) {         final biginteger mod = new biginteger(integer.tostring(mod));         final biginteger fnv_offset_basis = new biginteger("14695981039346656037");         final biginteger fnv_prime = new biginteger("1099511628211");          biginteger hash = new biginteger(fnv_offset_basis.tostring());          (int = 0; < s.length(); i++) {             int charvalue = s.charat(i);              hash = hash.multiply(fnv_prime).mod(mod);             hash = hash.xor(biginteger.valueof((int) charvalue & 0xffff)).mod(mod);         }          return hash.mod(mod).intvalue();     } 

results in 64,000 collisions a lot, 20% of input basically. what's wrong implementation? approach somehow flawed?

edit: add that, i've tried , implemented other hashing algorithms sdbm , djb2 , perform just same, equally poorly. have these ~65k collisions on corpus. when changed corpus 350,000 integers represented strings, bit of variance starts occur (like 1 algorithms has 20,000 collisions , other has 40,000) still number of collision astoundingly high. why?

edit2: i've tested , java's built-in .hashcode() results in equally many collisions , if ridiculously naive, hash being product of multiplicating charcodes of characters modulo 810,049, performs half worse notorious algorithms (60k collisions vs. 90k naive approach).

since mod parameter your hash function presume range want hash normalized, i.e. specific use case expecting 810,049. assume because:

  1. the algorithm calls calculations done modulo 2n n number of bits in desired hash.
  2. given offset basis , fnv prime constants within module, , equal parameters 64-bit hash, value of mod should fixed @ 264.
  3. since not, assume desired final output range.

in other words, given fixed offset basis , fnv prime, there no reason pass in mod parameter -- dictated other 2 fnv parameters. if above correct implementation wrong. should doing calculations mod 264 , applying final remainder operation 810,049.

also (but may not important), algorithm calls xoring lower 8 bits ascii character, whereas xoring 16 bits. not sure make difference since ascii high-order byte 0 anyway , behave if xoring 8 bits.


Popular posts from this blog

php - How should I create my API for mobile applications (Needs Authentication) -

python 3.x - PyQt5 - Signal : pyqtSignal no method connect -

5 Reasons to Blog Anonymously (and 5 Reasons Not To)