偶然发现一篇哈希的综述文章,虽然是1996年写的,里面的一些评测在今天看来早已经小case了。但是今天仍然极具参考价值。

地址:http://www.drdobbs.com/database/hashing-rehashed/184409859

正文:

Hashing algorithms occupy a unique place in the hearts of programmers. Discovered early on in computer science, they are among the most approachable algorithms and certainly the most enjoyable to tinker with. Moreover, the endless search for the Holy Grail, a perfect hash function for a given data set, still consumes considerable ink in each year's batch of computer-science journals. For developers who program for a living, however, recondite refinements to unusual hashing functions are of little use. In general, when you need a hash table, you want a quick, convenient hashing function. And unless you know where to get one, you will almost certainly fall prey to the fallacy that you can write and optimize one quickly for your particular application.

  In this article, I will discuss a good, general-purpose hashing algorithm and then analyze some of the standard wisdom on how hashing can be made more efficient. Finally, I'll examine the surprising effect of hardware advances on hashing--in particular, the widening gap between CPU performance and memory-access latency.

Basics

  Hash tables are just like most tables (the old-fashioned word for "arrays"). A table becomes a hash table when nonsequential access to a given element is achieved through the use of a hash function. A hash function is generally handed a data element, which it converts to an integer that is used as an index into the table. For example, if you had a hash table into which you wanted to store dates, you might take the Julian date (a positive integer assigned to every date since January 1, 4713 bc) and divide it by the size of the hash table. The remainder, termed the "hash key," would be your index into the hash table.

  In a closed hash table, the hash key is the index of the element into which you store the date and the data associated with that date. If the slot in the table is occupied by another date (this is termed a "collision"), you can either generate another hash key (called "nonlinear rehashing") or step through the table until you find an available slot (linear rehashing).

  In an open hash table (by far the most common), each element in the table is the head of a linked list of data items. In this model, a collision is handled by adding the colliding element to the linked list that begins at the slot in the table that both keys reference. This approach has the advantage that the table does not implicitly limit the number of elements it can hold, whereas the closed table cannot accept more data items than it has elements.

Hash Functions 

  Hashing is useful anywhere a wide range of possible values needs to be stored in a small amount of memory and be retrieved with simple, near-random access. As such, hash tables occur frequently in database engines and especially in language processors such as compilers and assemblers, where they elegantly serve as symbol tables. In such applications, the table is a principal data structure. Because all accesses to the table must be done through the hash function, the function must fulfill two requirements: It must be fast and it must do a good job distributing keys throughout the table. The latter requirement minimizes collisions and prevents data items with similar values from hashing to just one part of the table.

  In most programs, the hash function occupies an insignificant amount of the execution time. In fact, it is difficult to write a useful program in which the hash function occupies as much as 1 percent of the execution time. If your profiler tells you otherwise, you need to change hash functions. I present here two excellent, general-purpose hash functions. HashPJW() (see Example 1) is based on work by Peter J. Weinberger of AT&T Bell Labs, and is better known.ElfHash() (Example 2) is a variant of HashPJW() that is used in UNIX object files that use the ELF format. The functions accept a pointer to a string and return an integer. The final hash key is the modulo generated by dividing the returned integer by the size of the table. The portable version of Weinberger's algorithm in Example 1 is attributable to Allen Holub. Unfortunately, Holub's version contains an error that was picked up in subsequent texts (including the first printing of Practical Algorithms for Programmers, by John Rex and myself). The bug does not prevent the code from working, it simply gives suboptimal hashes. The version in Example 1 is correct. Example 2 is taken from the "System V Application Binary Interface" specification. This algorithm is occasionally misprinted (see the Programming in Standard C manual for UnixWare, for instance) in a form that will give very bad results. The version I presented here is correct.

Example 1: HashPJW(). An adaptation of Peter Weinberger's generic hashing algorithm.

/*--- HashPJW ---------------------------------------------------
* An adaptation of Peter Weinberger's (PJW) generic hashing
* algorithm based on Allen Holub's version. Accepts a pointer
* to a datum to be hashed and returns an unsigned integer.
*-------------------------------------------------------------*/
#include <limits.h>
#define BITS_IN_int ( sizeof(int) * CHAR_BIT )
#define THREE_QUARTERS ((int) ((BITS_IN_int * 3) / 4))
#define ONE_EIGHTH ((int) (BITS_IN_int / 8))
#define HIGH_BITS ( ~((unsigned int)(~0) >> ONE_EIGHTH ))
unsigned int HashPJW ( const char * datum )
{
unsigned int hash_value, i;
for ( hash_value = 0; *datum; ++datum )
{
hash_value = ( hash_value << ONE_EIGHTH ) + *datum;
if (( i = hash_value & HIGH_BITS ) != 0 )
hash_value =
( hash_value ^ ( i >> THREE_QUARTERS )) &
~HIGH_BITS;
}
return ( hash_value );
}

  Example 2: ElfHash(). The published hash algorithm used in the UNIX ELF format for object files.

/*--- ElfHash ---------------------------------------------------
* The published hash algorithm used in the UNIX ELF format
* for object files. Accepts a pointer to a string to be hashed
* and returns an unsigned long.
*-------------------------------------------------------------*/
unsigned long ElfHash ( const unsigned char *name )
{
unsigned long h = 0, g;
while ( *name )
{
h = ( h << 4 ) + *name++;
if ( g = h & 0xF0000000 )
h ^= g >> 24;
h &= ~g;
}
return h;
}

  

  Both functions attain their speed by performing only bit manipulations. No arithmetic is performed and the input data is dereferenced just once per byte.

  Because a program spends so little time hashing, you should not tinker with an algorithm to optimize its performance. You should focus instead on the quality of its hashes. To do this, you will need a small program that will hash a data set and print statistics on the quality of the hash. The wordlist.exe program (a self-extracting source-code file) is available electronically (see code link at the top of the page) and will test your functions. The file also contains some useful test data sets. The program reads a text file and stores every unique word and a count of its occurrences into a hash table. A variety of command-line switches (all documented) allow you to specify the size of the hash table, print table statistics, and even print all the unique words and their counts.

  This test program shows that the two algorithms mentioned previously do a superb job distributing hash values throughout the table. In 16-bit environments, HashPJW() is marginally better at hashing text while ElfHash() is significantly better at hashing numbers. In 32-bit environments, the algorithms are identical.

Performance Concerns

  While a speedy algorithm does not lend itself to optimization, its hashes should be given attention if they are of poor quality. However, this attention should be viewed in context. Even hash-table-intensive programs (compilers do not figure in this group) will rarely be improved by more than 4 or 5 percent by migration from a so-so implementation to a perfect hash algorithm. So if poor-to-best under the most favorable circumstances generates only a 5 percent improvement, common sense suggests that tuning hash functions should be one of the very last optimizations you perform. This is doubly true because hash tuning requires considerable empirical testing: You do not want to improve performance on one data set while degrading it significantly on another.

  Again, unless you have exceptional needs, use one of the aforementioned functions. Several aspects of your hash table can be tuned easily to improve performance, however.

  The first of these is to make sure the table has a prime number of slots. As stated previously, the final hash index is the modulo of the hash value divided by the table size. The widest dispersal of modulo results will occur when the hash value and the table size are relatively prime (no common factors), which is most often assured when the table size is prime. For example, hashing the Bible, which consists of 42,829 unique words, into an open hash table with 30,241 elements (a prime number) shows that 76.6 percent of the slots were used and that the average chain was 1.85 words in length (with a maximum of 6). The same file run into a hash table of 30,240 elements (evenly divisible by integers 2 through 9) fills only 60.7 percent of the slots and the average chain is 2.33 words long (maximum: 10). This is a rather substantial improvement for adding only a single element to the table to make prime.

  A second easy improvement is increasing the size of the hash table. This optimization is one of the classic memory-for-speed trade-offs and is effective as long as the data set has more elements than your hash table. (I'm still talking about open hash tables here. Closed hash tables will be discussed shortly.) Going back to the example of the Bible, when it was hashed (with HashPJW()) into a table of 7561 elements, 99.7 percent of the slots were occupied and the average chain was 5.68 words long; at 15,121 elements, occupancy was 93.9 percent and an average chain was 3.01 words. The values for a table of 30,241 elements were shown previously. In terms of performance for this example, quadrupling the table size nearly halved the chain length. Consistent with the (overall) small role played by hash operations, the program's timed performance only improved by 1.7 percent. However, after all other optimizations are performed, this difference could be meaningful.

  Finally, be careful with the construction of the linked lists emanating from the table. Significant savings can be attained by not allocating memory each time a data element is added to the table, but by using a pool of nodes allocated as a single block. Likewise, the lists should be ordered, either by ascending value or by recency of access. Compiler symbol tables, for example, often place the most recently accessed node at the head of the list, on the theory that references to a variable will tend to be grouped together. This view is especially validated in languages such as C and Pascal, which have local variables.

Closed Hash Tables

As mentioned previously, the two primary methods of resolving hashing collisions in closed tables are linear rehashing or nonlinear rehashing. The idea behind nonlinear rehashing is that it helps disperse colliding values throughout the table.

Closed hash tables are particularly effective when the maximum size of the incoming data set is known and a table with many more elements can be allocated. It has been proven that once a closed table becomes more than 50 percent full, performance deteriorates significantly (seeData Structures and Program Design in C, by Robert Kruse, Bruce Leung, and Clovis Tondo, Prentice Hall, 1991). Closed tables are commonly used for rapid prototyping (they're easy to code quickly) and where fast access (no link list) is a priority and memory is easily available.

Traditional computer science has found that, given equal load (the ratio of occupied slots to table size), tables tend to perform better by using nonlinear rehashing than by stepping through the table. This view has generally been accepted without much debate because the math behind it is fairly compelling.

However, hardware advances (of all things!) are beginning to change the math in favor of linear rehashing. Specifically, the reasons are memory latency and CPU cache construction.

Memory Latency

  High-end CPUs today commonly operate at speeds in excess of 250 MHz. At such speeds, each instruction cycle takes exactly 4 nanoseconds. By comparison, the fastest widely available dynamic RAM runs at 60 nanoseconds. In effect, this means that a single memory fetch is approximately 15 times slower than a CPU instruction. From this large disparity, you can see that your programs should avoid memory fetches as much as possible.

  Most CPUs today have data caches on board, and the more your programs use this memory, rather than general DRAM, the better they will perform. As with all caches, their benefit accrues only if the data you need to access is already in the cache (when it is, you have a cache "hit"). CPU caches tend to hold chunks of data brought in over previous operations, and these chunks tend to include data past the byte or bytes used for the one instruction. As a result, cache hits will increase if you access adjoining bytes in subsequent instructions. Hence, you will have far more cache hits if you use linear rehashing and simply query the adjoining table element than if you use nonlinear rehashing.

  How significant is this difference? The program memtime.c (available electronically, see the top of the page) shows you the effects of memory latency. It allocates a programmer-defined block and performs as many reads as there are bytes. In one pass, it does the reads randomly; in the second, it does the reads sequentially. Table 1 presents the results for 10 million reads into a 10-MB table.

Table 1: The results for 10 million reads into a 10-MB table.

Processor/Memory Sequential Random
100-MHz Pentium/60-ns memory 1 second 7 seconds
66-MHz Cyrix 486/70-ns memory 2 seconds 9 seconds

  Predictably, the ratio becomes even more unfavorable when the program runs on operating systems with paged memory, such as UNIX. When the previous program was run on a 486/66-MHz PC with 70-ns memory running UnixWare, the sequential reads took 3 seconds and the random reads took 35 seconds.

  the point is clear: Memory accesses that occur together should be located together.

  You could argue that saving a handful of seconds across 10 million memory accesses is hardly worth the effort. In many cases this will be true. Nonetheless, coding for this optimization may often be a simple matter of choosing between equal efforts (as in the case of rehashing approaches). Moreover, every indication suggests that CPU speed will continue to increase, while DRAM access speeds have remained stable for several years. As time passes, the benefits of considering memory latency will increase.

  In the specific case of rehashing, the linear approach has an additional benefit: A new hash key does not have to be computed. So, for closed hash tables, keep them sparsely populated and use linear rehashing.

  Memory latency is bound to become more significant in other areas of programming. Bear this in mind as you code.

A brief introduction to Hashing and Rehashing的更多相关文章

  1. Optimizing shaper — hashing filters (HTB)

    I have a very nice shaper in my linux box :-) How the configurator works — it’s another question, he ...

  2. [Guava官方文档翻译] 1.Guava简介 (Introduction)

    用户指南 Guava包含Google在Java项目中用到的一些核心库:collections, caching, primitives support, concurrency 库, common a ...

  3. double hashing 双重哈希

    二度哈希(rehashing / double hashing) 1.二度哈希的工作原理如下: 有一个包含多个哈希函数(H1……Hn)的集合.当我们要从哈希表中添加或获取元素时,首先使用哈希函数H1. ...

  4. Reading task(Introduction to Algorithms. 2nd)

    Introduction to Algorithms 2nd ed. Cambridge, MA: MIT Press, 2001. ISBN: 9780262032933. Introduction ...

  5. Supervised Hashing with Kernels, KSH

    Notation 该论文中应用到较多符号,为避免混淆,在此进行解释: n:原始数据集的大小 l:实验中用于监督学习的数据集大小(矩阵S行/列的大小) m:辅助数据集,用于得到基于核的哈希函数 r:比特 ...

  6. Spherical Hashing,球哈希

    1. Introduction 在传统的LSH.SSH.PCA-ITQ等哈希算法中,本质都是利用超平面对数据点进行划分,但是在D维空间中,至少需要D+1个超平面才能形成一个封闭.紧凑的区域.而球哈希方 ...

  7. 局部敏感哈希-Locality Sensitivity Hashing

    一. 近邻搜索 从这里开始我将会对LSH进行一番长篇大论.因为这只是一篇博文,并不是论文.我觉得一篇好的博文是尽可能让人看懂,它对语言的要求并没有像论文那么严格,因此它可以有更强的表现力. 局部敏感哈 ...

  8. 局部敏感哈希 Kernelized Locality-Sensitive Hashing Page

    Kernelized Locality-Sensitive Hashing Page   Brian Kulis (1) and Kristen Grauman (2)(1) UC Berkeley ...

  9. Algorithms - Data Structure - Perfect Hashing - 完全散列

    相关概念 散列表 hashtable 是一种实现字典操作的有效数据结构. 在散列表中,不是直接把关键字作为数组的下标,而是根据关键字计算出相应的下标. 散列函数 hashfunction'h' 除法散 ...

随机推荐

  1. 我所遭遇过的游戏中间件---Redux

    我所遭遇过的游戏中间件---Redux 一.关于Redux Substance Redux 是一款纹理处理软件加中间件,专门用于纹理生成和压缩.具其用户指南介绍,它能够对纹理集进行优化,可以将现有压缩 ...

  2. IOS中Key-Value Coding (KVC)的使用详解

    kvc,键值编码,是一个非正式的协议,它提供一种机制来间接访问对象的属性.直接访问对象是通过调用访问器的方法实现,而kvc不需要调用访问器的设置和获取方法,可以直接访问对象的属性. 下面介绍一下kvc ...

  3. 敏捷方法之极限编程(XP)和 Scrum区别

    敏捷(Agile)作为一种开发流程, 目前为各大公司所采用, 敏捷流程的具体实践有XP 和Scrum, 似乎很少有文章介绍这两者的区别, 发现一篇外文, 见解非常深刻, 特将其翻译一把. 原文(DIF ...

  4. CentOS 7.0 安装 ZCS 8.6.0

    简介 Zimbra的核心产品是Zimbra协作套件(Zimbra Collaboration Suite,简称ZCS).除了它的核心功能是电子邮件和日程安排服务器,当然还包括许多其它的功能,就象是下一 ...

  5. eclipse启动优化文章集合

    1. eclipse启动优化,终于不那么卡了! http://www.cfei.net/archives/445

  6. HDU 5411 CRB and Puzzle (2015年多校比赛第10场)

    1.题目描写叙述:pid=5411">点击打开链接 2.解题思路:本题实际是是已知一张无向图.问长度小于等于m的路径一共同拥有多少条. 能够通过建立转移矩阵利用矩阵高速幂解决.当中,转 ...

  7. Cocos2d-x教程(31)-TableView的滚动栏

    欢迎增加Cocos2d-x 交流群:193411763 转载时请注明原文出处 :http://blog.csdn.net/u012945598/article/details/38587659 在非常 ...

  8. Discuz常见小问题-如何修改板块和分区

    1 论坛-版块管理,然后添加一个版块名称(我把版块名称跟前面的主导航对应起来,比如都是论坛首页,在论坛首页下面放了三个版块,最新产品信息,最新培训信息,最新专题讨论) 2 点击编辑按钮 3 如果我设置 ...

  9. C++ vector类型要点总结

    概述 C++内置的数组支持容器的机制,但是它不支持容器抽象的语义.要解决此问题我们自己实现这样的类.在标准C++中,用容器向量(vector)实现. 容器向量也是一个类模板.vector是C++标准模 ...

  10. Servlet一(web基础学习笔记二十)

    一.Servlet简介 Servlet是sun公司提供的一门用于开发动态web资源的技术. Sun公司在其API中提供了一个servlet接口,用户若想用发一个动态web资源(即开发一个Java程序向 ...