Algorithm | hash

A basic requirement is that the function should provide a uniform distribution of hash values. A non-uniform distribution increases the number of collisions and the cost of resolving them.

A critical statistic for a hash table is called the load factor. This is simply the number of entries divided by the number of buckets, that is, n/k where n is the number of entries and k is the number of buckets.

hash不同的类型有不同的方法，性能各不一样。hash共有的问题就是碰撞，Collision resolution处理碰撞（冲突的方式）有：

1. 开放寻址法（open addressing）：\(hash_i=(hash(key)+d_i) \,\bmod\, m, i=1,2...k\,(k \le m-1)\)，其中hash(key)为散列函数，m为散列表长，d_i为增量序列，i为已发生碰撞的次数。增量序列可有下列取法：
d_i=1,2,3...(m-1)称为线性探测；即 d_i=i ，或者为其他线性函数。相当于逐个探测存放地址的表，直到查找到一个空单元，把散列地址存放在该空单元。
\(d_i=\pm 1^2, \pm 2^2,\pm 3^2...\pm k^2 (k \le m/2)\)称为平方探测。相对线性探测，相当于发生碰撞时探测间隔 d_i=i^2 个单元的位置是否为空，如果为空，将地址存放进去。
d_i=伪随机数序列，称为伪随机探测。

Well-known probe sequences include:

Linear probing, in which the interval between probes is fixed (usually 1)
Quadratic probing, in which the interval between probes is increased by adding the successive outputs of a quadratic polynomial to the starting value given by the original hash computation
Double hashing, in which the interval between probes is computed by another hash function

A drawback of all these open addressing schemes is that the number of stored entries cannot exceed the number of slots in the bucket array. Open addressing schemes also put more stringent requirements on the hash function: besides distributing the keys more uniformly over the buckets, the function must also minimize the clustering of hash values that are consecutive in the probe order.

Open addressing only saves memory if the entries are small (less than four times the size of a pointer) and the load factor is not too small. If the load factor is close to zero (that is, there are far more buckets than stored entries), open addressing is wasteful even if each entry is just two words.

Generally speaking, open addressing is better used for hash tables with small records that can be stored within the table (internal storage) and fit in a cache line. They are particularly suitable for elements of one word or less. If the table is expected to have a high load factor, the records are large, or the data is variable-sized, chained hash tables often perform as well or better.

这个性能可以做到很好，因为是连续数组，不需要重新开内存。

2. 单独链表法Separate chaining：将散列到同一个存储位置的所有元素保存在一个链表中。实现时，一种策略是散列表同一位置的所有碰撞结果都是用栈存放的，新元素被插入到表的前端还是后端完全取决于怎样方便。

Chained hash tables with linked lists are popular because they require only basic data structures with simple algorithms, and can use simple hash functions that are unsuitable for other methods.

Chained hash tables also inherit the disadvantages of linked lists. When storing small keys and values, the space overhead of the next pointer in each entry record can be significant. An additional disadvantage is that traversing a linked list has poor cache performance, making the processor cache ineffective.

3. 再散列：hash_i = hash_i (key), i=1,2...k。hash_i是一些散列函数。即在上次散列计算发生碰撞时，利用该次碰撞的散列函数地址产生新的散列函数地址，直到碰撞不再发生。这种方法不易产生“聚集”（Cluster），但增加了计算时间。

 class HashTable {

     public:

         struct List {

             int val;

             List* next;

             List(int val):val(val), next(NULL) {}

         };

         HashTable() {

             table = new List*[TABLE_SIZE];

             memset(table, , sizeof(List*) * TABLE_SIZE);

             count = ;

         }

         ~HashTable() {

             for (int i = ; i < TABLE_SIZE; ++i) {

                 List* p = table[i], *tmp;

                 while (p) {

                     tmp = p->next;

                     delete p;

                     p = tmp;

                 }

             }

             delete[] table;

         }

         void insert(int val) {

             //cout << "insert " << val << endl;

             count++;

             int index = hash(val);

             List* elem = new List(val);

             elem->next = table[index];

             table[index] = elem;

         }

         void remove(int val) {

             int index = hash(val);

             List **p = &table[index], *tmp;

             while (*p) {

                 if ((*p)->val == val) {

                     //cout << "remove " << val << endl;

                     count--;

                     tmp = (*p)->next;

                     delete *p;

                     *p = tmp;

                     return;

                 }

                 p = &((*p)->next);

             }

         }

         int size() const {

             return count;

         }

         void print() const {

             for (int i = ; i < TABLE_SIZE; ++i) {

                 List* p = table[i];

                 cout << i << ": ";

                 while (p) {

                     cout << p->val << " ";

                     p = p->next;

                 }

                 cout << endl;

             }

         }

     private:

         List** table;

         int count;

         int hash(int val) {

             return val % TABLE_SIZE;

         }

         enum { TABLE_SIZE =  };

 };

Algorithm | hash的更多相关文章

[Data Structure & Algorithm] Hash那点事儿
哈希表(Hash Table)是一种特殊的数据结构,它最大的特点就是可以快速实现查找.插入和删除.因为它独有的特点,Hash表经常被用来解决大数据问题,也因此被广大的程序员所青睐.为了能够更加灵活地使 ...
[Algorithm] 使用SimHash进行海量文本去重
在之前的两篇博文分别介绍了常用的hash方法([Data Structure & Algorithm] Hash那点事儿)以及局部敏感hash算法([Algorithm] 局部敏感哈希算法(L ...
powershell中使用Get-FileHash计算文件的hash值
今天在公司一台windows服务器上.需要对两个文件进行比对,笔者首先就想到了可以使用md5校验但是公司服务器上又不可以随意安装软件,于是笔者想到了可以试试windows自带的powershell中 ...
Hash算法：双重散列
双重散列是线性开型寻址散列(开放寻址法)中的冲突解决技术.双重散列使用在发生冲突时将第二个散列函数应用于键的想法. 此算法使用: (hash1(key) + i * hash2(key)) % TAB ...
md5 (c语言)
/** * \file md5.h * * \brief MD5 message digest algorithm (hash function) * * Copyright (C) 2006-201 ...
nodejs随记02
Basic认证检查报文头中Authorization字段,由认证方式和加密值构成: basic认证中,加密值为username:password,然后进行Base64编码构成; 获取username ...
md5增加指定的加密规则，进行加密
import java.io.UnsupportedEncodingException; import java.security.MessageDigest; import java.securit ...
crypto加密
/* hash.js */ var crypto = require('crypto'); module.exports = function(){ this.encode = fu ...
使用SimHash进行海量文本去重[转载]
阅读目录 1. SimHash与传统hash函数的区别 2. SimHash算法思想 3. SimHash流程实现 4. SimHash签名距离计算 5. SimHash存储和索引 6. SimHas ...

随机推荐

TCP/IP网络编程之多线程服务端的实现（二）
线程存在的问题和临界区上一章TCP/IP网络编程之多线程服务端的实现(一)的thread4.c中,我们发现多线程对同一变量进行加减,最后的结果居然不是我们预料之内的.其实,如果多执行几次程序,会发现 ...
TCP/IP网络编程之基于TCP的服务端/客户端（一）
理解TCP和UDP 根据数据传输方式的不同,基于网络协议的套接字一般分为TCP套接字和UDP套接字.因为TCP套接字是面向连接的,因此又称为基于流(stream)的套接字.TCP是Transmissi ...
opencv中的仿射变换
什么是仿射变换? 原理:1.一个任意的仿射变换都能表示为乘以一个矩阵(线性变换) 接着再加上一个向量(平移) 2.综上所述,我们能够用仿射变换来表示: 1)旋转(线性变换) 2)平移(向量加) 3 ...
laravel5.2总结--数据迁移
迁移就像是数据库中的版本控制,它让团队能够轻松的修改跟共享应用程序的数据库结构. 1 创建一个迁移 1.1 使用artisan命令make:migration来创建一个新的迁移: ph ...
在 Amazon AWS 搭建及部署网站：（三）开发及部署环境
服务器已经搭建好,网站也开始运行了.那么如何方便地部署代码呢? 最基本的方式,就是使用 SFTP 向网站目录直接部署.这种方法的缺点是版本控制不便,在上传时也无法方便的比较代码变化. 用SVN来部署是 ...
Github问题An error occurred trying to download
Github for windows安装过程出现了这样的问题An error occurred trying to download 'http://github-windows.s3.amazona ...
Python学习-day12 Mysql
MYSQ数据库的安装使用 Linux/UNIX上安装Mysql Linux平台上推荐使用RPM包来安装Mysql,MySQL AB提供了以下RPM包的下载地址: MySQL - MySQL服务器.你需 ...
冒泡排序（Bubble Sort）及优化
原理介绍冒泡排序算法的原理如下: 比较相邻的元素.如果第一个比第二个大,就交换他们两个. 对每一对相邻元素做同样的工作,从开始第一对到结尾的最后一对.在这一点,最后的元素应该会是最大的数. 针对所有 ...
SQL Server2012使用导入和导出向导时，用sql语句作为数据源，出现数据源类型会变成202或者203
用MS SqlServer2012进行数据导出时,使用的查询语句导出,但是出现了错误: “发现 xx个未知的列类型转换您只能保存此包“ 点击列查看详细错误信息时,可以看到: [源信息]源位置: 192 ...
【Luogu】P2468粟粟的书架（主席树+前缀和）
题目链接我仿佛中了个爆零debuff 本题分成两部分,五十分用前缀和,f[i][j][k]表示(1,1)到(i,j)的矩形大于等于k的有多少个数(再记录页数和),查询时二分,另外的用主席树,类似方法 ...

Algorithm | hash

Algorithm | hash的更多相关文章

随机推荐

热门专题