Compile Time hashing 2.0

Submit to DeliciousSubmit to DiggSubmit to FacebookSubmit to Google BookmarksSubmit to StumbleuponSubmit to TechnoratiSubmit to TwitterSubmit to LinkedIn
So I'm sitting here, all proud about my success with Compile time hashing, and I'm thinking, does it actually work? Short answer, not so much.

The good news is I fixed it and made it quite useful.

My first test to see how good my algorithm writing skillz are was to test for collision. I started by creating a 512 megabyte bit array (enough space to hold the entire 32 bit range, see where I'm going) and proceeded to process random strings to see what the collision rate would be. I just used the number of collisions divided by the total number of tests. My first couple tries with the algorithms had something like 90-95% collision rate. Not good. So I worked and poked and eventually came up with the following:
VALUE + (VALUE * (VALUE << 3) ^ (NEXT))>

This gave a great result on small super random strings. I searched around for big hash tests to try it against. I found the following sites:


There's a lot of good info in all those sites. Bob Jenkin's algorithm is old and seems to be a standard (I've seen around recently in some commercial projects). Paul Hsieh's is a nice all around fast one. Peter Kankowski's is really fast and seems to do a really good job, and, luckily, he had full source to his test application available.

I was in for a surprise when I got the test application setup. The first test seemed to go ok, then second one didn't go ok. I eventually hit a point where I had more collisions than there was data. Odd. I checked out the source code and it turns out Peter was adding up all data that was skipped when a collision occurred. While interesting I felt it clouded the data a bit, I only wanted to know when a collision occurred. So I fixed this up and also threw in a test to check if a true collision occurred (this is when the keys are exact, not just when the same table element is used). This was the part that surprised me the most. While most of the hashing algorithms had over a 100 collisions per test, they all had 0 true collisions. I had 100s. This is a huge problem.

Back to the drawing board. I started working on the best algorithm I came up with, adding shifts and multiplies. Seeding with random values. Eventually, to my surprise, I did manage to come up with an algorithm that beat most of the others. Then someone mentioned trying a bigger test, like the bible. So I grabbed that and voila... tons of collisions for me, very few for the others. I threw in an 80,000 word dictionary as well. I worked some more and now have a hash that has 0 collisions against the bible and a huge dictionary. One note about the bible: My const hash has a hard limit of 64 character strings. This is a compiler limit, I can go up to maybe 66 before the compile will not work anymore. Because of this I split up all the verses in the bible to 64 character parts to ensure that all tests were compared evenly.

I was wondering why my algorithm took a lot of work to get even close. It seems to mainly be from the limitations of getting it to work at compile time. In all the good algorithms you seed the hash state from the current character and then use the hash state to modify itself a few times. I can't do that, I can only reference the hash state once, so I have to rely on the character to affect the overall hash state. I managed to modify my macros to add some randomness based on the length of the string which helped everything come together. Another interesting thing I found out is that all the numbers I used are very specific. If I change any of the constants in the core algorithm it loses strength.

Well enough about details, how well does it stack up. Here are the results, using the same method as Peter Kankowski.

Legend:
Number by filename: number of strings hashed
First Number: Relative speed. The fastest hash will appear as 1.00
[]: Number of collision within a hashtable that is the next power of two bigger than the number of items in the file.
(): Number of times items hashed to the exact same value.
Green: One of the top three in this test.
Red: Has items that hashed to the same value.

  bible.txt
82932
dictionary.txt
80368
dic_common_words.txt
557
dic_longwords.txt
10145
dic_numbers.txt
33842
dic_postfix.txt
575
dic_prefix.txt
575
dic_win32.txt
1991
Average
Bernstein 1.27
[12355]
(1)
1.05
[12517]
(0)
1.10
[73]
(0)
1.39
[1504]
(0)
1.07
[2310]
(0)
1.29
[89]
(0)
1.25
[77]
(0)
1.13
[485]
(0)
1.19
[3676]
(0.125)
Kernighan&Ritchie 1.28
[12480]
(0)
1.03
[12339]
(2)
1.07
[59]
(0)
1.41
[1460]
(0)
1.18
[3418]
(0)
1.27
[87]
(0)
1.26
[91]
(0)
1.14
[506]
(0)
1.21
[3805]
(0.250)
x17 1.15
[12749]
(201)
1.00
[12793]
(310)
1.02
[69]
(0)
1.25
[1471]
(0)
2.22
[4187]
(0)
1.14
[78]
(0)
1.14
[88]
(0)
1.06
[483]
(0)
1.25
[3989]
(63.875)
x65599 1.14
[12492]
(0)
1.02
[12487]
(0)
1.00
[62]
(0)
1.21
[1504]
(0)
1.35
[5370]
(0)
1.11
[89]
(0)
1.08
[79]
(0)
1.01
[452]
(0)
1.12
[4066]
(0.000)
FNV-1a 1.25
[12469]
(1)
1.08
[12412]
(0)
1.08
[81]
(0)
1.38
[1531]
(0)
1.00
[3523]
(0)
1.21
[81]
(0)
1.20
[67]
(0)
1.12
[506]
(0)
1.17
[3833]
(0.125)
Weinberger 1.79
[13755]
(34)
1.25
[20220]
(692)
1.24
[73]
(0)
2.19
[1530]
(0)
6.07
[14667]
(0)
1.90
[114]
(1)
1.88
[112]
(0)
1.48
[491]
(0)
2.23
[6370]
(90.875)
Paul Hsieh 1.00
[12434]
(1)
1.04
[12522]
(45)
1.03
[81]
(0)
1.00
[1514]
(0)
1.12
[5933]
(2880)
1.00
[73]
(0)
1.00
[82]
(0)
1.00
[488]
(0)
1.02
[4140]
(365.750)
One At Time 1.45
[12332]
(1)
1.15
[12496]
(3)
1.17
[58]
(0)
1.70
[1524]
(0)
1.12
[3451]
(48)
1.46
[70]
(0)
1.44
[74]
(0)
1.28
[493]
(0)
1.35
[3812]
(6.500)
Arash Partow 1.39
[12802]
(0)
1.11
[12646]
(3)
1.14
[60]
(0)
1.60
[1514]
(0)
1.26
[2907]
(0)
1.50
[91]
(0)
1.50
[102]
(0)
1.20
[503]
(0)
1.34
[3828]
(0.375)
Const Hash (mine) 1.46
[12412]
(0)
1.18
[12348]
(0)
1.18
[62]
(0)
1.72
[1513]
(0)
1.11
[3458]
(0)
1.41
[66]
(0)
1.42
[88]
(0)
1.22
[500]
(0)
1.34
[3805]
(0.000)

That's a lot of info to digest so here's a summary:

I didn't come up with the fastest algorithm, unfortunately, but the only one other than mine with 0 true collisions is x65599. It's obvious that if you want the best runtime speed go with Paul Hsieh's algorithm, although it really doesn't work well with lots of consecutive strings (a0000-a9999). If you want a good spread of values then you Bernstein's seems the best. If you plan on referencing a lot of strings from your code base and you don't want to pay the price of hashing them up then that's where my algorithm is the winner. I think I came up with an ok algorithm, it's about 30% slower than the best on average, but it's tied for second best spread. It is also one of the only two that has the least chance of true collisions which is always a plus.

Full Hash Comparison Code

I've modified his test program to process all text files in the local directory and to split all the strings to ensure 64 character length. It also outputs nice html results now.

The Full comparison contains a separate header for my hashing algorithm. Unfortunately it needs a set of macros to achieve the compile time evaluation. It's released under the Beerware license so it's completely free to use.

Also check for an updated version of the original article which will include a new perl script with the latest changes.

Comments   

 
# st4lk3r 2008-06-26 17:31
Hi!
Where i can found your final source code ?
 
 
# danger 2008-07-01 17:41
Yeah, the download link contains back-slashes when it should use forward-slashes.
 
 
# st4lk3r 2008-06-26 17:43
Oh sorry XD I found it... but i have a question. It's not totally compile time or am i wrong ? cause i try to write code like:

switch(id)
{
case HASH("MyWonderfulID":
break;
}

but compiler give me some error about the non-const-value of HASH Macro.
 
 
# danger 2008-07-01 17:53
Yes, this is weird. The version of consthashmacro.h included in the source archive definitely doesn't seem to be compile-time. When I use it in a simple test program, it causes about 200 bytes of code bloat per call to CONSTHASH().

I tried regenerating the file with:
Code:GenConstHash.pl -l 64 -r CONSTHASH -t consthashmacro2.h
and the resulting file contained a slightly different definition of the CONSTHASH_FUNCTION() macro.

Yours: #define CONSTHASH_FUNCTION(next, value) ((value
 
 
# danger 2008-07-01 17:54
Yours: #define CONSTHASH_FUNCTION(next, value) ((value
 
 
# danger 2008-07-01 17:55
Okay, I give up; it's eating my comments :-) Suffice to say, the two macros are slightly different.

Other differences: the header I generated still caused a compile error when used in a case statement (see st4lk3r's comment above). However, it did successfully reduce CONSTHASH() calls to a compile-time constant, adding just a few bytes of code bloat per hash. Unfortunately, running the new version through your test suite generated a very large number of collisions.

Was the version of the header in your source archive generated using a different version of GenConstHash.pl? Do you have any ideas why these problems might be occurring?

(My tests were performed with GCC 3.4.2 under CygWin, for what it's worth).

One last thing: thanks a ton; this is an excellent article and an extremely useful piece of code, even if it does have a few kinks remaining. I look forward to future revisions!
 
 
csavoie
# csavoie 2008-07-08 18:54
:confused:

It's definitely possibly that the versions could be different. Though I do need to say that with the latest changes I was only testing in visual studio 2008 so it's entirely possible that certain macro configurations might not work.

I'll check them as soon as I get a chance, heading into alpha now so I don't have much mental energy.

Thanks for all the responses.
 
 
# danger 2008-07-09 00:14
Yeah, crunch time definitely takes priority :-)

Quickly, though -- I just ran the same tests with Visual C++ 2008 Express Edition, and ended up with even worse results than I had with GCC: now, neither the downloaded header nor the freshly-generated header generate fully compile-time hashes. Are you using the full Professional version of MSVC, perhaps?
 
 
# Which one is your best hash fuTommyknocker 2009-02-11 02:27
Hello Chris,
I am confused by your different hash functions. When I download "Full Hash Comparison Code" ans look into your consthashmacro.h, I see the following code:
#define CONSTHASH_FUNCTION(next, value) ((value
 
 
# Tommyknocker 2009-02-11 02:29
Hello Chris,
I am confused by your different hash functions. When I download "Full Hash Comparison Code" ans look into your consthashmacro.h, I see the following code:
#define CONSTHASH_FUNCTION(next, value) ((value
 
 
# Tommyknocker 2009-02-11 02:30
 
 
# Tommyknocker 2009-02-11 02:33
OK, whats wrong with this comment input thing??? It just cuts my comments!

Another try: look at the Hash functions from your test "Full Hash Comparison Code" and from your GenConstHash.pl script. They are different, not only slightly different. So which one is the better one, which gave you the best results?

Thanks in advance for your answer :-)
 
 
# RE: Compile Time hashing 2.0borislav 2010-12-10 08:23
you need compiler optimization turned on to have it generate compile time constant

for CONSTHASH("deba")
gcc -O3 generates
movl $2084699690, %eax

w/o optimization it will generate 2700 lines of computations
 
Powered By Saaraan