[SimHash] the Hash-based Similarity Detection Algorithm
The current information explosion has resulted in an increasing number of applications that need to deal with large volumes of data. While many of the data contains useless redundancy data, especially in mass media, web crawler/analytic fields, wasted many precious resources (power, bandwidth, CPU and storage, etc.). This has resulted in an increased interest in algorithms that process the input data in restricted ways.
But traditional hash algorithms have two problems, first it assumes that the data fits in main memory, it is unreasonable when dealing with massive data such as multimedia data, web crawler/analytic repositories and so on. And second, traditional hash can only indentify the identical data. this brings to light the importance of simhash.
Simhash 5 steps: Tokenize, Hash, Weigh Values, Merge, Dimensionality Reduction
tokenize
tokenize your data, assign weights to each token, weights and tokenize function are depend on your business
hash (md5, SHA1)
calculate token's hash value and convert it to binary (101011 )
weigh values
for each hash value, do hash*w, in this way: (101011 ) -> (w,-w,w,-w,w,w)
merge
add up tokens' values, to merge to 1 hash, for example, merge (4 -4 -4 4 -4 4) and (5 -5 5 -5 5 5) , results to (4+5 -4+-5 -4+5 4+-5 -4+5 4+5),which is (9 -9 1 -1 1)
Dimensionality Reduction
Finally, signs of elements of
Vcorresponds to the bits of the final fingerprint, for example (9 -9 1 -1 1) -> (1 0 1 0 1), we get 10101 as the fingerprint.
How to use SimHash fingerprints?
Hamming distance can be used to find the similarity between two given data, calculate the Hamming distance between 2 fingerprints.
Based on my experience, for 64 bit SimHash values, with elaborate weight values, distance of similar data often differ appreciably in magnitude from those unsimilar data.
how to calculate Hamming distance:
XOR, 只有两个位不同时结果是1 ,否则为0,两个二进制value“异或”后得到1的个数 为海明距离 。

SimHash algorithm, introduced by Charikarand is patented by Google.
simhash 0.1.0 : Python Package Index
[SimHash] the Hash-based Similarity Detection Algorithm的更多相关文章
- A Node Influence Based Label Propagation Algorithm for Community detection in networks 文章算法实现的疑问
这是我最近看到的一篇论文,思路还是很清晰的,就是改进的LPA算法.改进的地方在两个方面: (1)结合K-shell算法计算量了节点重重要度NI(node importance),标签更新顺序则按照NI ...
- VIPS: a VIsion based Page Segmentation Algorithm
VIPS: a VIsion based Page Segmentation Algorithm VIPS: a VIsion based Page Segmentation Algorithm In ...
- MBMD(MobileNet-based tracking by detection algorithm)作者答疑
If you fail to install and run this tracker, please email me (zhangyunhua@mail.dlut.edu.cn) Introduc ...
- anomaly detection algorithm
anomaly detection algorithm 以上就是异常监测算法流程
- Floyd判圈算法 Floyd Cycle Detection Algorithm
2018-01-13 20:55:56 Floyd判圈算法(Floyd Cycle Detection Algorithm),又称龟兔赛跑算法(Tortoise and Hare Algorithm) ...
- Floyd's Cycle Detection Algorithm
Floyd's Cycle Detection Algorithm http://www.siafoo.net/algorithm/10 改进版: http://www.siafoo.net/algo ...
- 从时序异常检测(Time series anomaly detection algorithm)算法原理讨论到时序异常检测应用的思考
1. 主要观点总结 0x1:什么场景下应用时序算法有效 历史数据可以被用来预测未来数据,对于一些周期性或者趋势性较强的时间序列领域问题,时序分解和时序预测算法可以发挥较好的作用,例如: 四季与天气的关 ...
- 个性探测综述阅读笔记——Recent trends in deep learning based personality detection
目录 abstract 1. introduction 1.1 个性衡量方法 1.2 应用前景 1.3 伦理道德 2. Related works 3. Baseline methods 3.1 文本 ...
- 论文阅读笔记五十二:CornerNet-Lite: Efficient Keypoint Based Object Detection(CVPR2019)
论文原址:https://arxiv.org/pdf/1904.08900.pdf github:https://github.com/princeton-vl/CornerNet-Lite 摘要 基 ...
随机推荐
- Jsp实现在线作业提交系统
Jsp实现在线作业提交系统 作为 Computer Science 的学生,凌晨四点之前睡都应该感到羞耻. 项目托管地址:https://github.com/four-in-the-morning/ ...
- Alias自定义命令
[root@localhost etc]# type home --检查是否占用-bash: type: home: not found[root@localhost etc]# alias h ...
- selenium+python unittest实践过程之问题杂集
1.列表选择项后直接获取文本内容获取不到,应该获取选择后显示的button的值 2.取值后的值带有空格,可以使用.strip()删除前后空格,以便断言 3.取值后有些值需要对类型进行转换才能断言成功 ...
- 基于AppDomain的"插件式"开发
很多时候,我们都想使用(开发)USB式(热插拔)的应用,例如,开发一个WinForm应用,并且这个WinForm应用能允许开发人员定制扩展插件,又例如,我们可能维护着一个WinService管理系统, ...
- windows7使用Sphinx+PHP+MySQL详细介绍
安装(Windows) 1.官方下载 Sphinx下载地址: 下载 2.解压并重命名 此处下载版本为3.0.3,将 sphinx 文件夹命名为sphinx 3.文件夹目录介绍 sphinx --api ...
- 更改KVM虚拟机root的密码
今天在使用qemu-kvm安装一个虚拟机,因为已经有一个虚拟机的image文件(qcow2格式的),所以创建虚拟机很简单,直接通过以下命令从image启动就行了. qemu-kvm -cpu host ...
- where语句中不能直接使用聚合函数
1.问题描述 select deptno ,avg(sal) from emp where count(*)>3 group by deptno; 在where 句中使用聚合函数count(*) ...
- JS基础-组成
类型 前缀 类型 实例 数组 a Array aItems 布尔值 b Boolean bIsComplete 浮点数 f Float fPrice 函数 fn Function fnHandler ...
- PHP中查询一个日期是周几
PHP查询一个日期是周几 1.date('l'),获取的是英文的星期几.Sunday 到 Saturday date('l', strtotime('2019-4-6')); // Saturday ...
- Deepin15.8系统下安装QorIQ Linux SDK v2.0 yocto成功完美运行的随笔
2019.2.17日:最终安装成功,完美解决! 2019.2.16日:最终安装未成功,但是过程中排除 了几个bug,前进了几步,仅供参考. 写在最前面,yocto安装是有系统要求的,Deepin 15 ...