The current information explosion has resulted in an increasing number of applications that need to deal with large volumes of data. While many of the data contains useless redundancy data, especially in mass media, web crawler/analytic fields, wasted many precious resources (power, bandwidth, CPU and storage, etc.). This has resulted in an increased interest in algorithms that process the input data in restricted ways.

But traditional hash algorithms have two problems, first it assumes that the data fits in main memory, it is unreasonable when dealing with massive data such as multimedia data, web crawler/analytic repositories and so on. And second, traditional hash can only indentify the identical data. this brings to light the importance of simhash.

 

Simhash 5 steps: Tokenize, Hash, Weigh Values, Merge, Dimensionality Reduction

  • tokenize

    • tokenize your data, assign weights to each token, weights and tokenize function are depend on your business

  • hash (md5, SHA1)

    • calculate token's hash value and convert it to binary (101011 )

  • weigh values

    • for each hash value, do hash*w, in this way: (101011 ) -> (w,-w,w,-w,w,w)

  • merge

    • add up tokens' values, to merge to 1 hash, for example, merge (4 -4 -4 4 -4 4) and (5 -5 5 -5 5 5) , results to (4+5 -4+-5 -4+5 4+-5 -4+5 4+5),which is (9 -9 1 -1 1)

  • Dimensionality Reduction

    • Finally, signs of elements of V corresponds to the bits of the final fingerprint, for example (9 -9 1 -1 1) -> (1 0 1 0 1), we get 10101 as the fingerprint.

How to use SimHash fingerprints?

Hamming distance can be used to find the similarity between two given data, calculate the Hamming distance between 2 fingerprints.

Based on my experience, for 64 bit SimHash values, with elaborate weight values,  distance of similar data often differ appreciably in magnitude from those unsimilar data.

how to calculate Hamming distance:

  XOR, 只有两个位不同时结果是1 ,否则为0,两个二进制value“异或”后得到1的个数 为海明距离 。

SimHash algorithm, introduced by Charikarand is patented by Google.

simhash 0.1.0 : Python Package Index

[SimHash] the Hash-based Similarity Detection Algorithm的更多相关文章

  1. A Node Influence Based Label Propagation Algorithm for Community detection in networks 文章算法实现的疑问

    这是我最近看到的一篇论文,思路还是很清晰的,就是改进的LPA算法.改进的地方在两个方面: (1)结合K-shell算法计算量了节点重重要度NI(node importance),标签更新顺序则按照NI ...

  2. VIPS: a VIsion based Page Segmentation Algorithm

    VIPS: a VIsion based Page Segmentation Algorithm VIPS: a VIsion based Page Segmentation Algorithm In ...

  3. MBMD(MobileNet-based tracking by detection algorithm)作者答疑

    If you fail to install and run this tracker, please email me (zhangyunhua@mail.dlut.edu.cn) Introduc ...

  4. anomaly detection algorithm

    anomaly detection algorithm 以上就是异常监测算法流程

  5. Floyd判圈算法 Floyd Cycle Detection Algorithm

    2018-01-13 20:55:56 Floyd判圈算法(Floyd Cycle Detection Algorithm),又称龟兔赛跑算法(Tortoise and Hare Algorithm) ...

  6. Floyd's Cycle Detection Algorithm

    Floyd's Cycle Detection Algorithm http://www.siafoo.net/algorithm/10 改进版: http://www.siafoo.net/algo ...

  7. 从时序异常检测(Time series anomaly detection algorithm)算法原理讨论到时序异常检测应用的思考

    1. 主要观点总结 0x1:什么场景下应用时序算法有效 历史数据可以被用来预测未来数据,对于一些周期性或者趋势性较强的时间序列领域问题,时序分解和时序预测算法可以发挥较好的作用,例如: 四季与天气的关 ...

  8. 个性探测综述阅读笔记——Recent trends in deep learning based personality detection

    目录 abstract 1. introduction 1.1 个性衡量方法 1.2 应用前景 1.3 伦理道德 2. Related works 3. Baseline methods 3.1 文本 ...

  9. 论文阅读笔记五十二:CornerNet-Lite: Efficient Keypoint Based Object Detection(CVPR2019)

    论文原址:https://arxiv.org/pdf/1904.08900.pdf github:https://github.com/princeton-vl/CornerNet-Lite 摘要 基 ...

随机推荐

  1. snip

    首先明确物体太小太大都不好检测(都从roi的角度来分析):   1.小物体: a.本身像素点少,如果从anchor的点在gt像素内来说,能提取出来的正样本少    b.小物体会出现iou过低.具体来说 ...

  2. 多线程之Lock

    Java并发编程:Lock 在上一篇文章中我们讲到了如何使用关键字synchronized来实现同步访问.本文我们继续来探讨这个问题,从Java 5之后,在java.util.concurrent.l ...

  3. Kafka设计解析(三)Kafka High Availability (下)

    转载自 技术世界,原文链接 Kafka设计解析(三)- Kafka High Availability (下) 摘要 本文在上篇文章基础上,更加深入讲解了Kafka的HA机制,主要阐述了HA相关各种场 ...

  4. Java笔试题解析(二)——2015届唯品会校招

    曾经总是看别人写的笔经面经.今天自己最终能够写自己亲身经历的一篇了 T-T. 前阵子去了唯品会的秋招宣讲会,华工场(如今才知道原来找家互联网公司工作的人好多).副总裁介绍了VIP的商业模式是逛街式的购 ...

  5. expdp导出时报错ora-16000

    一.问题现象:在对数据库进行expdp导出时发生报错ora-16000,脚本如下: nohup expdp "'/ as sysdba'" schemas=shp DIRECTOR ...

  6. DQL-分页查询

    关键字 :limit 一.应用场景当要查询的条目数太多,一页显示不全二.语法 select 查询列表from 表limit [offset,]size;注意:offset代表的是起始的条目索引,默认从 ...

  7. mysql索引和外键

    innodb外键: 1.CASCADE:从父表删除或更新会自动删除或更新子表中匹配的行 2.SET NULL:从父表删除或更新行,会设置子表中的外键列为NULL,但必须保证子表列没有指定NOT NUL ...

  8. html 头标签 <meta http-equiv 属性应用。

    比较常用的 1. 网页出现乱码时设置字符集 <meta http-equiv="content-Type" content="text/html; charset= ...

  9. 20155210 潘滢昊2016-2017-2 《Java程序设计》第9周学习总结

    20155210 2016-2017-2 <Java程序设计>第9周学习总结 教材学习内容总结 JDBC驱动的四种类型(按操作方式分类的): JDBC-ODBC Bridge Driver ...

  10. WPF 自定义ComboBox样式,自定义多选控件

    原文:WPF 自定义ComboBox样式,自定义多选控件 一.ComboBox基本样式 ComboBox有两种状态,可编辑和不可编辑状态.通过设置IsEditable属性可以切换控件状态. 先看基本样 ...