A brief introduction to Hashing and Rehashing

偶然发现一篇哈希的综述文章，虽然是1996年写的，里面的一些评测在今天看来早已经小case了。但是今天仍然极具参考价值。

地址：http://www.drdobbs.com/database/hashing-rehashed/184409859

正文：

Hashing algorithms occupy a unique place in the hearts of programmers. Discovered early on in computer science, they are among the most approachable algorithms and certainly the most enjoyable to tinker with. Moreover, the endless search for the Holy Grail, a perfect hash function for a given data set, still consumes considerable ink in each year's batch of computer-science journals. For developers who program for a living, however, recondite refinements to unusual hashing functions are of little use. In general, when you need a hash table, you want a quick, convenient hashing function. And unless you know where to get one, you will almost certainly fall prey to the fallacy that you can write and optimize one quickly for your particular application.

　　In this article, I will discuss a good, general-purpose hashing algorithm and then analyze some of the standard wisdom on how hashing can be made more efficient. Finally, I'll examine the surprising effect of hardware advances on hashing--in particular, the widening gap between CPU performance and memory-access latency.

Basics

　　Hash tables are just like most tables (the old-fashioned word for "arrays"). A table becomes a hash table when nonsequential access to a given element is achieved through the use of a hash function. A hash function is generally handed a data element, which it converts to an integer that is used as an index into the table. For example, if you had a hash table into which you wanted to store dates, you might take the Julian date (a positive integer assigned to every date since January 1, 4713 bc) and divide it by the size of the hash table. The remainder, termed the "hash key," would be your index into the hash table.

　　In a closed hash table, the hash key is the index of the element into which you store the date and the data associated with that date. If the slot in the table is occupied by another date (this is termed a "collision"), you can either generate another hash key (called "nonlinear rehashing") or step through the table until you find an available slot (linear rehashing).

　　In an open hash table (by far the most common), each element in the table is the head of a linked list of data items. In this model, a collision is handled by adding the colliding element to the linked list that begins at the slot in the table that both keys reference. This approach has the advantage that the table does not implicitly limit the number of elements it can hold, whereas the closed table cannot accept more data items than it has elements.

Hash Functions

　　Hashing is useful anywhere a wide range of possible values needs to be stored in a small amount of memory and be retrieved with simple, near-random access. As such, hash tables occur frequently in database engines and especially in language processors such as compilers and assemblers, where they elegantly serve as symbol tables. In such applications, the table is a principal data structure. Because all accesses to the table must be done through the hash function, the function must fulfill two requirements: It must be fast and it must do a good job distributing keys throughout the table. The latter requirement minimizes collisions and prevents data items with similar values from hashing to just one part of the table.

　　In most programs, the hash function occupies an insignificant amount of the execution time. In fact, it is difficult to write a useful program in which the hash function occupies as much as 1 percent of the execution time. If your profiler tells you otherwise, you need to change hash functions. I present here two excellent, general-purpose hash functions. HashPJW() (see Example 1) is based on work by Peter J. Weinberger of AT&T Bell Labs, and is better known.ElfHash() (Example 2) is a variant of HashPJW() that is used in UNIX object files that use the ELF format. The functions accept a pointer to a string and return an integer. The final hash key is the modulo generated by dividing the returned integer by the size of the table. The portable version of Weinberger's algorithm in Example 1 is attributable to Allen Holub. Unfortunately, Holub's version contains an error that was picked up in subsequent texts (including the first printing of Practical Algorithms for Programmers, by John Rex and myself). The bug does not prevent the code from working, it simply gives suboptimal hashes. The version in Example 1 is correct. Example 2 is taken from the "System V Application Binary Interface" specification. This algorithm is occasionally misprinted (see the Programming in Standard C manual for UnixWare, for instance) in a form that will give very bad results. The version I presented here is correct.

Example 1: HashPJW(). An adaptation of Peter Weinberger's generic hashing algorithm.

/*--- HashPJW ---------------------------------------------------

 *  An adaptation of Peter Weinberger's (PJW) generic hashing

 *  algorithm based on Allen Holub's version. Accepts a pointer

 *  to a datum to be hashed and returns an unsigned integer.

 *-------------------------------------------------------------*/

#include <limits.h>

#define BITS_IN_int     ( sizeof(int) * CHAR_BIT )

#define THREE_QUARTERS  ((int) ((BITS_IN_int * 3) / 4))

#define ONE_EIGHTH      ((int) (BITS_IN_int / 8))

#define HIGH_BITS       ( ~((unsigned int)(~0) >> ONE_EIGHTH ))

unsigned int HashPJW ( const char * datum )

{

    unsigned int hash_value, i;

    for ( hash_value = 0; *datum; ++datum )

    {

        hash_value = ( hash_value << ONE_EIGHTH ) + *datum;

        if (( i = hash_value & HIGH_BITS ) != 0 )

            hash_value =

                ( hash_value ^ ( i >> THREE_QUARTERS )) &

                        ~HIGH_BITS;

    }

    return ( hash_value );

}

　　Example 2: ElfHash(). The published hash algorithm used in the UNIX ELF format for object files.

/*--- ElfHash ---------------------------------------------------

 *  The published hash algorithm used in the UNIX ELF format

 *  for object files. Accepts a pointer to a string to be hashed

 *  and returns an unsigned long.

 *-------------------------------------------------------------*/

unsigned long ElfHash ( const unsigned char *name )

{

    unsigned long   h = 0, g;

    while ( *name )

    {

        h = ( h << 4 ) + *name++;

        if ( g = h & 0xF0000000 )

            h ^= g >> 24;

        h &= ~g;

    }

    return h;

}

　　Both functions attain their speed by performing only bit manipulations. No arithmetic is performed and the input data is dereferenced just once per byte.

　　Because a program spends so little time hashing, you should not tinker with an algorithm to optimize its performance. You should focus instead on the quality of its hashes. To do this, you will need a small program that will hash a data set and print statistics on the quality of the hash. The wordlist.exe program (a self-extracting source-code file) is available electronically (see code link at the top of the page) and will test your functions. The file also contains some useful test data sets. The program reads a text file and stores every unique word and a count of its occurrences into a hash table. A variety of command-line switches (all documented) allow you to specify the size of the hash table, print table statistics, and even print all the unique words and their counts.

　　This test program shows that the two algorithms mentioned previously do a superb job distributing hash values throughout the table. In 16-bit environments, HashPJW() is marginally better at hashing text while ElfHash() is significantly better at hashing numbers. In 32-bit environments, the algorithms are identical.

Performance Concerns

　　While a speedy algorithm does not lend itself to optimization, its hashes should be given attention if they are of poor quality. However, this attention should be viewed in context. Even hash-table-intensive programs (compilers do not figure in this group) will rarely be improved by more than 4 or 5 percent by migration from a so-so implementation to a perfect hash algorithm. So if poor-to-best under the most favorable circumstances generates only a 5 percent improvement, common sense suggests that tuning hash functions should be one of the very last optimizations you perform. This is doubly true because hash tuning requires considerable empirical testing: You do not want to improve performance on one data set while degrading it significantly on another.

　　Again, unless you have exceptional needs, use one of the aforementioned functions. Several aspects of your hash table can be tuned easily to improve performance, however.

　　The first of these is to make sure the table has a prime number of slots. As stated previously, the final hash index is the modulo of the hash value divided by the table size. The widest dispersal of modulo results will occur when the hash value and the table size are relatively prime (no common factors), which is most often assured when the table size is prime. For example, hashing the Bible, which consists of 42,829 unique words, into an open hash table with 30,241 elements (a prime number) shows that 76.6 percent of the slots were used and that the average chain was 1.85 words in length (with a maximum of 6). The same file run into a hash table of 30,240 elements (evenly divisible by integers 2 through 9) fills only 60.7 percent of the slots and the average chain is 2.33 words long (maximum: 10). This is a rather substantial improvement for adding only a single element to the table to make prime.

　　A second easy improvement is increasing the size of the hash table. This optimization is one of the classic memory-for-speed trade-offs and is effective as long as the data set has more elements than your hash table. (I'm still talking about open hash tables here. Closed hash tables will be discussed shortly.) Going back to the example of the Bible, when it was hashed (with HashPJW()) into a table of 7561 elements, 99.7 percent of the slots were occupied and the average chain was 5.68 words long; at 15,121 elements, occupancy was 93.9 percent and an average chain was 3.01 words. The values for a table of 30,241 elements were shown previously. In terms of performance for this example, quadrupling the table size nearly halved the chain length. Consistent with the (overall) small role played by hash operations, the program's timed performance only improved by 1.7 percent. However, after all other optimizations are performed, this difference could be meaningful.

　　Finally, be careful with the construction of the linked lists emanating from the table. Significant savings can be attained by not allocating memory each time a data element is added to the table, but by using a pool of nodes allocated as a single block. Likewise, the lists should be ordered, either by ascending value or by recency of access. Compiler symbol tables, for example, often place the most recently accessed node at the head of the list, on the theory that references to a variable will tend to be grouped together. This view is especially validated in languages such as C and Pascal, which have local variables.

Closed Hash Tables

As mentioned previously, the two primary methods of resolving hashing collisions in closed tables are linear rehashing or nonlinear rehashing. The idea behind nonlinear rehashing is that it helps disperse colliding values throughout the table.

Closed hash tables are particularly effective when the maximum size of the incoming data set is known and a table with many more elements can be allocated. It has been proven that once a closed table becomes more than 50 percent full, performance deteriorates significantly (seeData Structures and Program Design in C, by Robert Kruse, Bruce Leung, and Clovis Tondo, Prentice Hall, 1991). Closed tables are commonly used for rapid prototyping (they're easy to code quickly) and where fast access (no link list) is a priority and memory is easily available.

Traditional computer science has found that, given equal load (the ratio of occupied slots to table size), tables tend to perform better by using nonlinear rehashing than by stepping through the table. This view has generally been accepted without much debate because the math behind it is fairly compelling.

However, hardware advances (of all things!) are beginning to change the math in favor of linear rehashing. Specifically, the reasons are memory latency and CPU cache construction.

Memory Latency

　　High-end CPUs today commonly operate at speeds in excess of 250 MHz. At such speeds, each instruction cycle takes exactly 4 nanoseconds. By comparison, the fastest widely available dynamic RAM runs at 60 nanoseconds. In effect, this means that a single memory fetch is approximately 15 times slower than a CPU instruction. From this large disparity, you can see that your programs should avoid memory fetches as much as possible.

　　Most CPUs today have data caches on board, and the more your programs use this memory, rather than general DRAM, the better they will perform. As with all caches, their benefit accrues only if the data you need to access is already in the cache (when it is, you have a cache "hit"). CPU caches tend to hold chunks of data brought in over previous operations, and these chunks tend to include data past the byte or bytes used for the one instruction. As a result, cache hits will increase if you access adjoining bytes in subsequent instructions. Hence, you will have far more cache hits if you use linear rehashing and simply query the adjoining table element than if you use nonlinear rehashing.

　　How significant is this difference? The program memtime.c (available electronically, see the top of the page) shows you the effects of memory latency. It allocates a programmer-defined block and performs as many reads as there are bytes. In one pass, it does the reads randomly; in the second, it does the reads sequentially. Table 1 presents the results for 10 million reads into a 10-MB table.

Table 1: The results for 10 million reads into a 10-MB table.

Processor/Memory	Sequential	Random
100-MHz Pentium/60-ns memory	1 second	7 seconds
66-MHz Cyrix 486/70-ns memory	2 seconds	9 seconds

　　Predictably, the ratio becomes even more unfavorable when the program runs on operating systems with paged memory, such as UNIX. When the previous program was run on a 486/66-MHz PC with 70-ns memory running UnixWare, the sequential reads took 3 seconds and the random reads took 35 seconds.

　　the point is clear: Memory accesses that occur together should be located together.

　　You could argue that saving a handful of seconds across 10 million memory accesses is hardly worth the effort. In many cases this will be true. Nonetheless, coding for this optimization may often be a simple matter of choosing between equal efforts (as in the case of rehashing approaches). Moreover, every indication suggests that CPU speed will continue to increase, while DRAM access speeds have remained stable for several years. As time passes, the benefits of considering memory latency will increase.

　　In the specific case of rehashing, the linear approach has an additional benefit: A new hash key does not have to be computed. So, for closed hash tables, keep them sparsely populated and use linear rehashing.

　　Memory latency is bound to become more significant in other areas of programming. Bear this in mind as you code.