The current information explosion has resulted in an increasing number of applications that need to deal with large volumes of data. While many of the data contains useless redundancy data, especially in mass media, web crawler/analytic fields, wasted many precious resources (power, bandwidth, CPU and storage, etc.). This has resulted in an increased interest in algorithms that process the input data in restricted ways.

But traditional hash algorithms have two problems, first it assumes that the data fits in main memory, it is unreasonable when dealing with massive data such as multimedia data, web crawler/analytic repositories and so on. And second, traditional hash can only indentify the identical data. this brings to light the importance of simhash.

 

Simhash 5 steps: Tokenize, Hash, Weigh Values, Merge, Dimensionality Reduction

  • tokenize

    • tokenize your data, assign weights to each token, weights and tokenize function are depend on your business

  • hash (md5, SHA1)

    • calculate token's hash value and convert it to binary (101011 )

  • weigh values

    • for each hash value, do hash*w, in this way: (101011 ) -> (w,-w,w,-w,w,w)

  • merge

    • add up tokens' values, to merge to 1 hash, for example, merge (4 -4 -4 4 -4 4) and (5 -5 5 -5 5 5) , results to (4+5 -4+-5 -4+5 4+-5 -4+5 4+5),which is (9 -9 1 -1 1)

  • Dimensionality Reduction

    • Finally, signs of elements of V corresponds to the bits of the final fingerprint, for example (9 -9 1 -1 1) -> (1 0 1 0 1), we get 10101 as the fingerprint.

How to use SimHash fingerprints?

Hamming distance can be used to find the similarity between two given data, calculate the Hamming distance between 2 fingerprints.

Based on my experience, for 64 bit SimHash values, with elaborate weight values,  distance of similar data often differ appreciably in magnitude from those unsimilar data.

how to calculate Hamming distance:

  XOR, 只有两个位不同时结果是1 ,否则为0,两个二进制value“异或”后得到1的个数 为海明距离 。

SimHash algorithm, introduced by Charikarand is patented by Google.

simhash 0.1.0 : Python Package Index

[SimHash] the Hash-based Similarity Detection Algorithm的更多相关文章

  1. A Node Influence Based Label Propagation Algorithm for Community detection in networks 文章算法实现的疑问

    这是我最近看到的一篇论文,思路还是很清晰的,就是改进的LPA算法.改进的地方在两个方面: (1)结合K-shell算法计算量了节点重重要度NI(node importance),标签更新顺序则按照NI ...

  2. VIPS: a VIsion based Page Segmentation Algorithm

    VIPS: a VIsion based Page Segmentation Algorithm VIPS: a VIsion based Page Segmentation Algorithm In ...

  3. MBMD(MobileNet-based tracking by detection algorithm)作者答疑

    If you fail to install and run this tracker, please email me (zhangyunhua@mail.dlut.edu.cn) Introduc ...

  4. anomaly detection algorithm

    anomaly detection algorithm 以上就是异常监测算法流程

  5. Floyd判圈算法 Floyd Cycle Detection Algorithm

    2018-01-13 20:55:56 Floyd判圈算法(Floyd Cycle Detection Algorithm),又称龟兔赛跑算法(Tortoise and Hare Algorithm) ...

  6. Floyd's Cycle Detection Algorithm

    Floyd's Cycle Detection Algorithm http://www.siafoo.net/algorithm/10 改进版: http://www.siafoo.net/algo ...

  7. 从时序异常检测(Time series anomaly detection algorithm)算法原理讨论到时序异常检测应用的思考

    1. 主要观点总结 0x1:什么场景下应用时序算法有效 历史数据可以被用来预测未来数据,对于一些周期性或者趋势性较强的时间序列领域问题,时序分解和时序预测算法可以发挥较好的作用,例如: 四季与天气的关 ...

  8. 个性探测综述阅读笔记——Recent trends in deep learning based personality detection

    目录 abstract 1. introduction 1.1 个性衡量方法 1.2 应用前景 1.3 伦理道德 2. Related works 3. Baseline methods 3.1 文本 ...

  9. 论文阅读笔记五十二:CornerNet-Lite: Efficient Keypoint Based Object Detection(CVPR2019)

    论文原址:https://arxiv.org/pdf/1904.08900.pdf github:https://github.com/princeton-vl/CornerNet-Lite 摘要 基 ...

随机推荐

  1. 关于mysql-mybatis批量添加

    mybatis怎么实现一次插入多条数据   以后从新浪博客转到博客园这边来记录把.   这篇地址:http://blog.sina.com.cn/s/blog_13e9702640102ysho.ht ...

  2. 基于vue-cli3和追书神器制作的移动端小说阅读网站,附接口和源码

    项目简介 基于node express+mysql+vue-cli3和追书神器接口制作的移动端小说阅读网站,**仅供参考学习!不用于任何商业用途!** 闲暇时间用vue练练手,就想写个小说网站来看看, ...

  3. PHP中查询一个日期是周几

    PHP查询一个日期是周几 1.date('l'),获取的是英文的星期几.Sunday 到 Saturday date('l', strtotime('2019-4-6')); // Saturday ...

  4. 基于 Keras 用深度学习预测时间序列

    目录 基于 Keras 用深度学习预测时间序列 问题描述 多层感知机回归 多层感知机回归结合"窗口法" 改进方向 扩展阅读 本文主要参考了 Jason Brownlee 的博文 T ...

  5. R语言爬虫:使用R语言爬取豆瓣电影数据

    豆瓣排名前25电影及评价爬取 url <-'http://movie.douban.com/top250?format=text' # 获取网页原代码,以行的形式存放在web 变量中 web & ...

  6. QuestaSim 中文注释乱码

    在QuestaSim按如下顺序打开对应窗口, Tools -> Edit Preferences -> By Name -> Find 输入 encoding搜索对应项,将其valu ...

  7. Rust 智能指针(一)

    Rust 智能指针(一) 1.Box<T> Box<T>是指向堆中的指针. fn main() { let box = Box::new(3); println!(" ...

  8. struts2第一天——入门和基本操作

    一.概述 1.运用场景: 应用于三层架构中web层的框架(显示层的运用),是经典MVC模型的web应用的变体. 2.与struts1的对比: struts2是在struts1基于webwork发展的全 ...

  9. 2017-2018-1 20155233 《信息安全系统设计基础》实现mypwd

    2017-2018-1 20155233 <信息安全系统设计基础>实现mypwd linux命令pwd介绍 pwd命令以绝对路径的方式显示用户当前工作目录.命令将当前目录的全路径名称(从根 ...

  10. 苏州Uber优步司机奖励政策(4月22日)

    滴快车单单2.5倍,注册地址:http://www.udache.com/ 如何注册Uber司机(全国版最新最详细注册流程)/月入2万/不用抢单:http://www.cnblogs.com/mfry ...