[SimHash] the Hash-based Similarity Detection Algorithm
The current information explosion has resulted in an increasing number of applications that need to deal with large volumes of data. While many of the data contains useless redundancy data, especially in mass media, web crawler/analytic fields, wasted many precious resources (power, bandwidth, CPU and storage, etc.). This has resulted in an increased interest in algorithms that process the input data in restricted ways.
But traditional hash algorithms have two problems, first it assumes that the data fits in main memory, it is unreasonable when dealing with massive data such as multimedia data, web crawler/analytic repositories and so on. And second, traditional hash can only indentify the identical data. this brings to light the importance of simhash.
Simhash 5 steps: Tokenize, Hash, Weigh Values, Merge, Dimensionality Reduction
tokenize
tokenize your data, assign weights to each token, weights and tokenize function are depend on your business
hash (md5, SHA1)
calculate token's hash value and convert it to binary (101011 )
weigh values
for each hash value, do hash*w, in this way: (101011 ) -> (w,-w,w,-w,w,w)
merge
add up tokens' values, to merge to 1 hash, for example, merge (4 -4 -4 4 -4 4) and (5 -5 5 -5 5 5) , results to (4+5 -4+-5 -4+5 4+-5 -4+5 4+5),which is (9 -9 1 -1 1)
Dimensionality Reduction
Finally, signs of elements of
V
corresponds to the bits of the final fingerprint, for example (9 -9 1 -1 1) -> (1 0 1 0 1), we get 10101 as the fingerprint.
How to use SimHash fingerprints?
Hamming distance can be used to find the similarity between two given data, calculate the Hamming distance between 2 fingerprints.
Based on my experience, for 64 bit SimHash values, with elaborate weight values, distance of similar data often differ appreciably in magnitude from those unsimilar data.
how to calculate Hamming distance:
XOR, 只有两个位不同时结果是1 ,否则为0,两个二进制value“异或”后得到1的个数 为海明距离 。
SimHash algorithm, introduced by Charikarand is patented by Google.
simhash 0.1.0 : Python Package Index
[SimHash] the Hash-based Similarity Detection Algorithm的更多相关文章
- A Node Influence Based Label Propagation Algorithm for Community detection in networks 文章算法实现的疑问
这是我最近看到的一篇论文,思路还是很清晰的,就是改进的LPA算法.改进的地方在两个方面: (1)结合K-shell算法计算量了节点重重要度NI(node importance),标签更新顺序则按照NI ...
- VIPS: a VIsion based Page Segmentation Algorithm
VIPS: a VIsion based Page Segmentation Algorithm VIPS: a VIsion based Page Segmentation Algorithm In ...
- MBMD(MobileNet-based tracking by detection algorithm)作者答疑
If you fail to install and run this tracker, please email me (zhangyunhua@mail.dlut.edu.cn) Introduc ...
- anomaly detection algorithm
anomaly detection algorithm 以上就是异常监测算法流程
- Floyd判圈算法 Floyd Cycle Detection Algorithm
2018-01-13 20:55:56 Floyd判圈算法(Floyd Cycle Detection Algorithm),又称龟兔赛跑算法(Tortoise and Hare Algorithm) ...
- Floyd's Cycle Detection Algorithm
Floyd's Cycle Detection Algorithm http://www.siafoo.net/algorithm/10 改进版: http://www.siafoo.net/algo ...
- 从时序异常检测(Time series anomaly detection algorithm)算法原理讨论到时序异常检测应用的思考
1. 主要观点总结 0x1:什么场景下应用时序算法有效 历史数据可以被用来预测未来数据,对于一些周期性或者趋势性较强的时间序列领域问题,时序分解和时序预测算法可以发挥较好的作用,例如: 四季与天气的关 ...
- 个性探测综述阅读笔记——Recent trends in deep learning based personality detection
目录 abstract 1. introduction 1.1 个性衡量方法 1.2 应用前景 1.3 伦理道德 2. Related works 3. Baseline methods 3.1 文本 ...
- 论文阅读笔记五十二:CornerNet-Lite: Efficient Keypoint Based Object Detection(CVPR2019)
论文原址:https://arxiv.org/pdf/1904.08900.pdf github:https://github.com/princeton-vl/CornerNet-Lite 摘要 基 ...
随机推荐
- oracle数据库之操作总结
## 连接数据库: sqlplus test/test##@localhost:/ORCL ## 查询数据库所有的表: select table_name from user_tables; ## 查 ...
- php5.6+Apache2.4+MySQL
在配置php的时候可以直接使用集成环境XAMPP:https://bitnami.com/stack/xampp?utm_source=bitnami&utm_medium=installer ...
- 初识Qt窗口界面
1.新建一个新的Qt Gui应用,项目名称随意,例如MyMainWindow,基类选择QMainWindow,类名为MainWindow. 2.项目建立后,双击mainwindow.ui文件,在界面的 ...
- D. Timetable
http://codeforces.com/problemset/problem/946/D Ivan is a student at Berland State University (BSU). ...
- P1446 [HNOI2008]Cards
题目描述 小春现在很清闲,面对书桌上的N张牌,他决定给每张染色,目前小春只有3种颜色:红色,蓝色,绿色.他询问Sun有多少种染色方案,Sun很快就给出了答案. 进一步,小春要求染出Sr张红色,Sb张蓝 ...
- 关于ISP、IAP、DFU和bootloader
这是嵌入式开发中常用的几个专业术语,其诞生的背景和其具体作用大概如下 在很久很久以前,那是8051单片机流行的时代,做单片机开发都需要一个专用工具,就是单片机的编程器,或者叫烧写器.说“烧”写一点 ...
- 纯CSS + 媒体查询实现网页导航特效
纯css+媒体查询实现网页导航特效 附上效果图: 代码如下,复制即可使用: <!DOCTYPE html> <html lang="en"> <hea ...
- Mysql5.7登录错误1045和1130的解决方法,亲测有用,希望能帮助到你们。
Mysql (针对Mysql5.7版本,其他版本可能略有不同) 错误:1045 解决方法: 以管理员身份运行cmd(win8系统:win+x 键 ,再按 A键 ),进入Mysql安装目录下的bin目录 ...
- jQuery 事件函数传参异常identifier starts immediately after numeric literal
问题情境: var arr=[aabbcc,112233]; var html = ""; for(var i =0;i<arr.length;i++){ html += ' ...
- nor flash启动与nand flash启动的区别
1)接口区别:NOR FLASH地址线和数据线分开,来了地址和控制信号,数据就出来.NAND Flash地址线和数据线在一起,需要用程序来控制,才能出数据.通俗的说,就是光给地址不行,要先命令,再给地 ...