Reviewing the MSDN article on hash tables
published: Fri, 20-Feb-2004 | updated: Thu, 27-Oct-2005
There's a blog I've subscribed to that talks about technical writing (Scott on Writing). It's written by a writer/trainer/consultant in ASP.NET called Scott Mitchell. I can't remember from where I'd got the link originally but it's quite interesting, at least to a fellow writer.
He had a post the other day where he was drawing attention to a series of articles he was writing for MSDN on Data Structures (the first in the series is here). Wow, my area of expertise! Time to check it out.
All in all, it's a good series and well written too, if a little rambling. But, hey, I'm prone to that as well. It explains the basics very well. Nevertheless, there were a few issues that leapt out at me.
The first issue that I noticed was in the article that included hash tables, one of my favorite data structures. In it, Scott talks about the basics of hash tables before introducing the Hashtable class from System.Collections. In the section he talks about collision resolution, he talks about linear probing and the problems it engenders, and then dismisses quadratic probing in a single paragraph.
Now, I'm not quibbling about the dismissal of quadratic probing (it's not very good, all told), but I am about his definition of it. With linear probing, if you are adding an item, you hash the item's key to H, say, and check the slot in the hash table at H. If occupied, you then probe subsequent slots at H+1, H+2, H+3, etc, until you find an empty slot. (Of course, these additions are done modulo the hash table size, M.) He then defines quadratic probing as probing H+1, H-1, H+4, H-4, H+9, H-9, etc, instead. I've never seen this definition, so I checked through my algorithms books.
-
Cormen Thomas H., Charles E. Leiserson, Ronald L. Rivest
Introduction to Algorithms, MIT Press, Cambridge, MA, 1994,
uses the sequence
(H + an + bn^2) % M
, for some constants a, b and for probe number n. This is nothing like the sequence Scott quoted, yet the book is listed in his references. The relationship between a, b, and M is very subtle to ensure that you probe every slot in the table. -
Weiss, Mark Allen, Data Structures and Algorithm Analysis 2nd
Edition, Benjamin/Cummings, Redwood City, CA, 1995, uses
(H + n^2) % M
. This is the version I've always understood quadratic probing to mean. -
Wood, Derick, Data Structures, Algorithms, and Performance,
Addison Wesley, Reading, MA, 1993, uses
(H + n^2) % M
. -
Wirth, Niklaus, Algorithms + Data Structures = Programs,
Prentice Hall, Englewood Cliffs, NJ, 1976, uses
(H + n^2) % M
. - Robert Sedgewick has not written about quadratic probing, preferring to move from linear probing directly to double hashing when discussing collision resolution in his books.
And finally...
-
Knuth, Donald E., The Art of Computer Programming, Volume 3:
Sorting and Searching, 2nd Edition, Addison Wesley, Reading, MA, 1998,
does not really cover quadratic probing at all, except in Exercise 20
in section 6.4. The probing sequence used in the exercise is H, H-1,
H-3, H-6, H-10, etc, or
(H - n(n+1)/2) % M
, which is a special case of Cormen et al with a = b = -1/2. The exercise is to show that this sequence probes every slot if M is a power of 2. Since we never really create non-prime hash tables in practice, this just elicits a So what? for me.
And why is quadratic probing so bad? Using the generally accepted definition in a hash table with a prime number of slots, it's not guaranteed that all slots will be visited. Also, if two keys that hash to H, say, their probe sequences will be the same as well, leading to a condition called secondary clustering.
And then, we get the most appalling bit of proofreading. He talks about a different kind of collision resolution generally called double hashing, but he uses the term rehashing instead. Except that sometimes it's written in the article as rehasing (5 times) instead of rehashing (4 times).
The next mistake is in the description of the hash function that the Hashtable class uses for probing. It has an off-by-one error. In fact, the sequence starts off with H % M (H being the result of calling GetHash() and M being the size of the hash table, a prime), followed by:
(H + n * (1 + (((H >> 5) + 1) % (M - 1)))) % M
for each probe n. (In fact the expression E = (1 + (((H >> 5) + 1) %
(M - 1)))
is calculated once at the beginning of the probe loop, so
that the sequence can be quickly calculated as H % M
, (H + E) % M
, (H
+ 2E) % M
, etc.)
The penultimate mistake is in his assertion that this rehashing algorithm that the .NET developers used is somehow better than linear or quadratic probing. It "provides better collision avoidance." This is patently false, since the effect of secondary clustering is still present. Why? Because if two keys hash to the same value, H, then their probe chains will be exactly the same (this is the definition of secondary clustering).
The 'rigorous' double hashing algorithm (if I may call it that)
requires the use of two independent hash functions; the premise being
that if two keys hash equal with one hash function they won't with the
other. So if a key hashes to A and B with the two functions, then the
probe sequence is A % M
, (A + B) % M
, (A + 2B) % M
, etc. For another key
that also hashes to A with the first hash function but hashes to C
with the second, the probe sequence will be completely different,
removing the secondary clustering effect. If you like, the .NET
developers decided to use double hashing with two hash function where
the first does some real hashing, but the second is merely a
mathematical mapping of the first. The two hash functions are not
independent at all, which is why they produce the same secondary
clustering as quadratic probing.
This, incidentally, negates one of the final points that Scott writes
about, the effect of the load factor. He states "the expected number
of probes needed when a collision occurs is at most 1 / (1 – F)
, where
F is the load factor". As Knuth correctly points out, this is the
expected number of probes for an unsuccessful search using a double
hashing algorithm, provided that the two hash functions are
independent. Since we've seen that the .NET version does not use
independent hash functions, this is false, and in fact Knuth gives the
correct expression as
(1 / (1 - F)) - F - ln(1 - F)
So for a load factor of 0.72 (Microsoft's magic number), this evaluates to 4.1 probes per unsuccessful search (compared with the 3.5 Scott quoted, which should more accurately be 3.6). The expected number of probes for a successful search is 1.9, whereas if the hash functions were independent, it would be 1.8.