The good news is I fixed it and made it quite useful.
My first test to see how good my algorithm writing skillz are was to test for collision. I started by creating a 512 megabyte bit array (enough space to hold the entire 32 bit range, see where I'm going) and proceeded to process random strings to see what the collision rate would be. I just used the number of collisions divided by the total number of tests. My first couple tries with the algorithms had something like 90-95% collision rate. Not good. So I worked and poked and eventually came up with the following:
VALUE + (VALUE * (VALUE << 3) ^ (NEXT))>
This gave a great result on small super random strings. I searched around for big hash tests to try it against. I found the following sites:
There's a lot of good info in all those sites. Bob Jenkin's algorithm is old and seems to be a standard (I've seen around recently in some commercial projects). Paul Hsieh's is a nice all around fast one. Peter Kankowski's is really fast and seems to do a really good job, and, luckily, he had full source to his test application available.
I was in for a surprise when I got the test application setup. The first test seemed to go ok, then second one didn't go ok. I eventually hit a point where I had more collisions than there was data. Odd. I checked out the source code and it turns out Peter was adding up all data that was skipped when a collision occurred. While interesting I felt it clouded the data a bit, I only wanted to know when a collision occurred. So I fixed this up and also threw in a test to check if a true collision occurred (this is when the keys are exact, not just when the same table element is used). This was the part that surprised me the most. While most of the hashing algorithms had over a 100 collisions per test, they all had 0 true collisions. I had 100s. This is a huge problem.
Back to the drawing board. I started working on the best algorithm I came up with, adding shifts and multiplies. Seeding with random values. Eventually, to my surprise, I did manage to come up with an algorithm that beat most of the others. Then someone mentioned trying a bigger test, like the bible. So I grabbed that and voila... tons of collisions for me, very few for the others. I threw in an 80,000 word dictionary as well. I worked some more and now have a hash that has 0 collisions against the bible and a huge dictionary. One note about the bible: My const hash has a hard limit of 64 character strings. This is a compiler limit, I can go up to maybe 66 before the compile will not work anymore. Because of this I split up all the verses in the bible to 64 character parts to ensure that all tests were compared evenly.
I was wondering why my algorithm took a lot of work to get even close. It seems to mainly be from the limitations of getting it to work at compile time. In all the good algorithms you seed the hash state from the current character and then use the hash state to modify itself a few times. I can't do that, I can only reference the hash state once, so I have to rely on the character to affect the overall hash state. I managed to modify my macros to add some randomness based on the length of the string which helped everything come together. Another interesting thing I found out is that all the numbers I used are very specific. If I change any of the constants in the core algorithm it loses strength.
Well enough about details, how well does it stack up. Here are the results, using the same method as Peter Kankowski.
Legend:
Number by filename: number of strings hashed
First Number: Relative speed. The fastest hash will appear as 1.00
[]: Number of collision within a hashtable that is the next power of two bigger than the number of items in the file.
(): Number of times items hashed to the exact same value.
Green: One of the top three in this test.
Red: Has items that hashed to the same value.
bible.txt 82932 |
dictionary.txt 80368 |
dic_common_words.txt 557 |
dic_longwords.txt 10145 |
dic_numbers.txt 33842 |
dic_postfix.txt 575 |
dic_prefix.txt 575 |
dic_win32.txt 1991 |
Average | |
Bernstein | 1.27 [12355] (1) |
1.05 [12517] (0) |
1.10 [73] (0) |
1.39 [1504] (0) |
1.07 [2310] (0) |
1.29 [89] (0) |
1.25 [77] (0) |
1.13 [485] (0) |
1.19 [3676] (0.125) |
Kernighan&Ritchie | 1.28 [12480] (0) |
1.03 [12339] (2) |
1.07 [59] (0) |
1.41 [1460] (0) |
1.18 [3418] (0) |
1.27 [87] (0) |
1.26 [91] (0) |
1.14 [506] (0) |
1.21 [3805] (0.250) |
x17 | 1.15 [12749] (201) |
1.00 [12793] (310) |
1.02 [69] (0) |
1.25 [1471] (0) |
2.22 [4187] (0) |
1.14 [78] (0) |
1.14 [88] (0) |
1.06 [483] (0) |
1.25 [3989] (63.875) |
x65599 | 1.14 [12492] (0) |
1.02 [12487] (0) |
1.00 [62] (0) |
1.21 [1504] (0) |
1.35 [5370] (0) |
1.11 [89] (0) |
1.08 [79] (0) |
1.01 [452] (0) |
1.12 [4066] (0.000) |
FNV-1a | 1.25 [12469] (1) |
1.08 [12412] (0) |
1.08 [81] (0) |
1.38 [1531] (0) |
1.00 [3523] (0) |
1.21 [81] (0) |
1.20 [67] (0) |
1.12 [506] (0) |
1.17 [3833] (0.125) |
Weinberger | 1.79 [13755] (34) |
1.25 [20220] (692) |
1.24 [73] (0) |
2.19 [1530] (0) |
6.07 [14667] (0) |
1.90 [114] (1) |
1.88 [112] (0) |
1.48 [491] (0) |
2.23 [6370] (90.875) |
Paul Hsieh | 1.00 [12434] (1) |
1.04 [12522] (45) |
1.03 [81] (0) |
1.00 [1514] (0) |
1.12 [5933] (2880) |
1.00 [73] (0) |
1.00 [82] (0) |
1.00 [488] (0) |
1.02 [4140] (365.750) |
One At Time | 1.45 [12332] (1) |
1.15 [12496] (3) |
1.17 [58] (0) |
1.70 [1524] (0) |
1.12 [3451] (48) |
1.46 [70] (0) |
1.44 [74] (0) |
1.28 [493] (0) |
1.35 [3812] (6.500) |
Arash Partow | 1.39 [12802] (0) |
1.11 [12646] (3) |
1.14 [60] (0) |
1.60 [1514] (0) |
1.26 [2907] (0) |
1.50 [91] (0) |
1.50 [102] (0) |
1.20 [503] (0) |
1.34 [3828] (0.375) |
Const Hash (mine) | 1.46 [12412] (0) |
1.18 [12348] (0) |
1.18 [62] (0) |
1.72 [1513] (0) |
1.11 [3458] (0) |
1.41 [66] (0) |
1.42 [88] (0) |
1.22 [500] (0) |
1.34 [3805] (0.000) |
That's a lot of info to digest so here's a summary:
I didn't come up with the fastest algorithm, unfortunately, but the only one other than mine with 0 true collisions is x65599. It's obvious that if you want the best runtime speed go with Paul Hsieh's algorithm, although it really doesn't work well with lots of consecutive strings (a0000-a9999). If you want a good spread of values then you Bernstein's seems the best. If you plan on referencing a lot of strings from your code base and you don't want to pay the price of hashing them up then that's where my algorithm is the winner. I think I came up with an ok algorithm, it's about 30% slower than the best on average, but it's tied for second best spread. It is also one of the only two that has the least chance of true collisions which is always a plus.
Full Hash Comparison Code
I've modified his test program to process all text files in the local directory and to split all the strings to ensure 64 character length. It also outputs nice html results now.
The Full comparison contains a separate header for my hashing algorithm. Unfortunately it needs a set of macros to achieve the compile time evaluation. It's released under the Beerware license so it's completely free to use.
Also check for an updated version of the original article which will include a new perl script with the latest changes.