[SimHash] the Hash-based Similarity Detection Algorithm

ScottGu 2024-10-01 12:36:34 原文

The current information explosion has resulted in an increasing number of applications that need to deal with large volumes of data. While many of the data contains useless redundancy data, especially in mass media, web crawler/analytic fields, wasted many precious resources (power, bandwidth, CPU and storage, etc.). This has resulted in an increased interest in algorithms that process the input data in restricted ways.

But traditional hash algorithms have two problems, first it assumes that the data ﬁts in main memory, it is unreasonable when dealing with massive data such as multimedia data, web crawler/analytic repositories and so on. And second, traditional hash can only indentify the identical data. this brings to light the importance of simhash.

Simhash 5 steps: Tokenize, Hash, Weigh Values, Merge, Dimensionality Reduction

tokenize
- tokenize your data, assign weights to each token, weights and tokenize function are depend on your business
hash (md5, SHA1)
- calculate token's hash value and convert it to binary (101011 )
weigh values
- for each hash value, do hash*w, in this way: (101011 ) -> (w,-w,w,-w,w,w)
merge
- add up tokens' values, to merge to 1 hash, for example, merge (4 -4 -4 4 -4 4) and (5 -5 5 -5 5 5) , results to (4+5 -4+-5 -4+5 4+-5 -4+5 4+5)，which is (9 -9 1 -1 1)
Dimensionality Reduction
- Finally, signs of elements of V corresponds to the bits of the final fingerprint, for example (9 -9 1 -1 1) -> (1 0 1 0 1), we get 10101 as the fingerprint.

How to use SimHash fingerprints?

Hamming distance can be used to find the similarity between two given data, calculate the Hamming distance between 2 fingerprints.

Based on my experience, for 64 bit SimHash values, with elaborate weight values, distance of similar data often differ appreciably in magnitude from those unsimilar data.

how to calculate Hamming distance:

　　XOR, 只有两个位不同时结果是1 ，否则为0，两个二进制value“异或”后得到1的个数为海明距离。

SimHash algorithm, introduced by Charikarand is patented by Google.

simhash 0.1.0 : Python Package Index

[SimHash] the Hash-based Similarity Detection Algorithm的更多相关文章

A Node Influence Based Label Propagation Algorithm for Community detection in networks 文章算法实现的疑问
这是我最近看到的一篇论文,思路还是很清晰的,就是改进的LPA算法.改进的地方在两个方面: (1)结合K-shell算法计算量了节点重重要度NI(node importance),标签更新顺序则按照NI ...
VIPS: a VIsion based Page Segmentation Algorithm
VIPS: a VIsion based Page Segmentation Algorithm VIPS: a VIsion based Page Segmentation Algorithm In ...
MBMD（MobileNet-based tracking by detection algorithm）作者答疑
If you fail to install and run this tracker, please email me (zhangyunhua@mail.dlut.edu.cn) Introduc ...
anomaly detection algorithm
anomaly detection algorithm 以上就是异常监测算法流程
Floyd判圈算法 Floyd Cycle Detection Algorithm
2018-01-13 20:55:56 Floyd判圈算法(Floyd Cycle Detection Algorithm),又称龟兔赛跑算法(Tortoise and Hare Algorithm) ...
Floyd's Cycle Detection Algorithm
Floyd's Cycle Detection Algorithm http://www.siafoo.net/algorithm/10 改进版: http://www.siafoo.net/algo ...
从时序异常检测（Time series anomaly detection algorithm）算法原理讨论到时序异常检测应用的思考
1. 主要观点总结 0x1:什么场景下应用时序算法有效历史数据可以被用来预测未来数据,对于一些周期性或者趋势性较强的时间序列领域问题,时序分解和时序预测算法可以发挥较好的作用,例如: 四季与天气的关 ...
个性探测综述阅读笔记——Recent trends in deep learning based personality detection
目录 abstract 1. introduction 1.1 个性衡量方法 1.2 应用前景 1.3 伦理道德 2. Related works 3. Baseline methods 3.1 文本 ...
论文阅读笔记五十二：CornerNet-Lite: Efficient Keypoint Based Object Detection（CVPR2019）
论文原址:https://arxiv.org/pdf/1904.08900.pdf github:https://github.com/princeton-vl/CornerNet-Lite 摘要基 ...

随机推荐

React 入门学习笔记2
摘自阮一峰:React入门实例教程,转载请注明出处. 一.获取真实的DOM节点组件并不是真实的 DOM 节点,而是存在于内存之中的一种数据结构,叫做虚拟 DOM (virtual DOM).只有当它 ...
Burpsuite-Intruder基础学习(一)
上周吧,将公司的短信及邮箱服务测试了一遍,就学习了Burpsuite的Intruder.(自学成才,还好网上有资料,入手还是挺简单的) 网上资料:https://www.gitbook.com/boo ...
Freemarker 基础概念
一.概述 FreeMarker 是一个模板引擎,一个基于模板生成文本输出的通用工具,使用纯 Java 编写,FreeMarker 被设计用来生成 HTML Web 页面,特别是基于 MVC 模式的应用 ...
【图像处理】Schmid滤波器
Schmid也是一种类Gabor图像滤波器,在这篇文章[1]中有详细推导和介绍. 一种更简洁的表达公式是: 当中,r为核半径,Z为归一化參数,τ和σ是比較重要的參数,在ReID提取TextFeatur ...
PHP操作xml学习笔记之增删改查（1）—增加
xml文件 <?xml version="1.0" encoding="utf-8"?><班级> <学生> ...
一、用Delphi10.3 创建一条JSON数据
一.用Delphi10.3构造一个JSON数据,非常之容易,代码如下: uses System.JSON; procedure TForm1.Button1Click(Sender: TObject) ...
从0开始学golang--1--部署本地服务器
部署自己的本地服务器. 找了个三方包项目:beego.看了下还不错. 上代码....: 首先直接安装三方包,CMD下:go get github.com/astaxie/beego 安装成功后会在pk ...
ADI高速信号采集芯片与JESD204B接口简介
ADI高速信号采集芯片与JESD204B接口简介 JESD204B接口介绍: JEDEC Standard No. 204B (JESD204B)—A standardized serial int ...
20155238 2016-2017-2 《Java程序设计》第二周学习总结
教材学习内容总结 java基本类型:整数,字节,浮点数,字符 //"单行批注" */"单行批注" 变量 "驼峰式命命法" int age0f ...
Nessus 用好的网络（比如热点）就可以正常在线更新扫描插件了
折腾那么多不如直接开热点 nethogs 实时查看网络流量 ( linux下 )