The current information explosion has resulted in an increasing number of applications that need to deal with large volumes of data. While many of the data contains useless redundancy data, especially in mass media, web crawler/analytic fields, wasted many precious resources (power, bandwidth, CPU and storage, etc.). This has resulted in an increased interest in algorithms that process the input data in restricted ways.

But traditional hash algorithms have two problems, first it assumes that the data fits in main memory, it is unreasonable when dealing with massive data such as multimedia data, web crawler/analytic repositories and so on. And second, traditional hash can only indentify the identical data. this brings to light the importance of simhash.

 

Simhash 5 steps: Tokenize, Hash, Weigh Values, Merge, Dimensionality Reduction

  • tokenize

    • tokenize your data, assign weights to each token, weights and tokenize function are depend on your business

  • hash (md5, SHA1)

    • calculate token's hash value and convert it to binary (101011 )

  • weigh values

    • for each hash value, do hash*w, in this way: (101011 ) -> (w,-w,w,-w,w,w)

  • merge

    • add up tokens' values, to merge to 1 hash, for example, merge (4 -4 -4 4 -4 4) and (5 -5 5 -5 5 5) , results to (4+5 -4+-5 -4+5 4+-5 -4+5 4+5),which is (9 -9 1 -1 1)

  • Dimensionality Reduction

    • Finally, signs of elements of V corresponds to the bits of the final fingerprint, for example (9 -9 1 -1 1) -> (1 0 1 0 1), we get 10101 as the fingerprint.

How to use SimHash fingerprints?

Hamming distance can be used to find the similarity between two given data, calculate the Hamming distance between 2 fingerprints.

Based on my experience, for 64 bit SimHash values, with elaborate weight values,  distance of similar data often differ appreciably in magnitude from those unsimilar data.

how to calculate Hamming distance:

  XOR, 只有两个位不同时结果是1 ,否则为0,两个二进制value“异或”后得到1的个数 为海明距离 。

SimHash algorithm, introduced by Charikarand is patented by Google.

simhash 0.1.0 : Python Package Index

[SimHash] the Hash-based Similarity Detection Algorithm的更多相关文章

  1. A Node Influence Based Label Propagation Algorithm for Community detection in networks 文章算法实现的疑问

    这是我最近看到的一篇论文,思路还是很清晰的,就是改进的LPA算法.改进的地方在两个方面: (1)结合K-shell算法计算量了节点重重要度NI(node importance),标签更新顺序则按照NI ...

  2. VIPS: a VIsion based Page Segmentation Algorithm

    VIPS: a VIsion based Page Segmentation Algorithm VIPS: a VIsion based Page Segmentation Algorithm In ...

  3. MBMD(MobileNet-based tracking by detection algorithm)作者答疑

    If you fail to install and run this tracker, please email me (zhangyunhua@mail.dlut.edu.cn) Introduc ...

  4. anomaly detection algorithm

    anomaly detection algorithm 以上就是异常监测算法流程

  5. Floyd判圈算法 Floyd Cycle Detection Algorithm

    2018-01-13 20:55:56 Floyd判圈算法(Floyd Cycle Detection Algorithm),又称龟兔赛跑算法(Tortoise and Hare Algorithm) ...

  6. Floyd's Cycle Detection Algorithm

    Floyd's Cycle Detection Algorithm http://www.siafoo.net/algorithm/10 改进版: http://www.siafoo.net/algo ...

  7. 从时序异常检测(Time series anomaly detection algorithm)算法原理讨论到时序异常检测应用的思考

    1. 主要观点总结 0x1:什么场景下应用时序算法有效 历史数据可以被用来预测未来数据,对于一些周期性或者趋势性较强的时间序列领域问题,时序分解和时序预测算法可以发挥较好的作用,例如: 四季与天气的关 ...

  8. 个性探测综述阅读笔记——Recent trends in deep learning based personality detection

    目录 abstract 1. introduction 1.1 个性衡量方法 1.2 应用前景 1.3 伦理道德 2. Related works 3. Baseline methods 3.1 文本 ...

  9. 论文阅读笔记五十二:CornerNet-Lite: Efficient Keypoint Based Object Detection(CVPR2019)

    论文原址:https://arxiv.org/pdf/1904.08900.pdf github:https://github.com/princeton-vl/CornerNet-Lite 摘要 基 ...

随机推荐

  1. VMware虚拟机更换根用户( su: Authentication failure问题)

    su命令不能切换root,提示su: Authentication failure,只要你sudo passwd root过一次之后,下次再su的时候只要输入密码就可以成功登录了.

  2. 【vue】vue依赖安装如vue-router、vue-resource、vuex等

    方式一: 最直接的方式为在 package.json中添加如图依赖配置,然后项目 cnpm install即可 方式二: 根据vue项目的搭建教程,接下来记录下如何在Vue-cli创建的项目中安装vu ...

  3. golden gate 加initial load 在rac 上的配置

    前言goldengate 11g 在oracle 11g rac 上的配置 (源是rac+asm , 目标是单数据库实例) 源端: 1. 配置tnsnames [oracle@rac1 admin]$ ...

  4. 说说MySQL索引

    前言 关于索引,这是一个非常重要的知识点,同样,在面试的时候也会被经常的问到: 本文描述了索引的结构,介绍了InnoDB的索引方案等知识点,感兴趣的可以看一下: 引入 本文参考文章:MySQL的索引 ...

  5. Detect Changes in Network Connectivity

    Some times you will need a mechanism to check whether changes to network occurring during running yo ...

  6. Ajax第一天——入门与基本概念

    什么是Ajax Ajax被认为是(Asynchronous JavaScript and XML的缩写).异步的js和xml 异步和同步:同步->类似打电话,接完一个再接下一个:异步:-> ...

  7. JavaWeb总结(四)

    使用Servlet发送服务器端响应信息 Servlet API中定义一个专门的接口类javax.servlet.http.HttpServletResponse用于创建HTTP响应,包括HTTP协议的 ...

  8. 一个java程序员该具备的工具

    http://itindex.net/detail/37115-java-%E7%A8%8B%E5%BA%8F%E5%91%98-%E5%B7%A5%E5%85%B7 文章很详细.

  9. 洛咕 P3961 [TJOI2013]黄金矿工

    甚至都不是树形背包= = 把每条线抠出来,这一条线就是个链的依赖关系,随便背包一下 // luogu-judger-enable-o2 #include<bits/stdc++.h> #d ...

  10. cogs1685 【NOI2014】魔法森林 Link-Cut Tree

    LCT练手好题啊. SPFA的做♂FA是把边按照a排序,然后加一条权值为b的边跑SPFA,不断更新答案.很好的做♂FA,但复杂度无♂FA保证. LCT的做♂FA类似,也是把边按照a排序,然后也是加一条 ...