转:locality sensitive hashing

Motivation
The task of finding nearest neighbours is very common. You can think of applications like finding duplicate or similar documents, audio/video search. Although using brute force to check for all possible combinations will give you the exact nearest neighbour but it’s not scalable at all. Approximate algorithms to accomplish this task has been an area of active research. Although these algorithms don’t guarantee to give you the exact answer, more often than not they’ll be provide a good approximation. These algorithms are faster and scalable.
Locality sensitive hashing (LSH) is one such algorithm. LSH has many applications, including:

- Near-duplicate detection: LSH is commonly used to deduplicate large quantities of documents, webpages, and other files.
- Genome-wide association study: Biologists often use LSH to identify similar gene expressions in genome databases.
- Large-scale image search: Google used LSH along with PageRank to build their image search technology VisualRank.
- Audio/video fingerprinting: In multimedia technologies, LSH is widely used as a fingerprinting technique A/V data.
In this blog, we’ll try to understand the workings of this algorithm.
General Idea

LSH refers to a family of functions (known as LSH families) to hash data points into buckets so that data points near each other are located in the same buckets with high probability, while data points far from each other are likely to be in different buckets. This makes it easier to identify observations with various degrees of similarity.
Finding similar documents
Let’s try to understand how we can leverage LSH in solving an actual problem. The problem that we’re trying to solve:
Goal: You have been given a large collections of documents. You want to find “near duplicate” pairs.
In the context of this problem//////再次问题的背景下, we can break down the LSH algorithm into 3 broad steps:
- Shingling
- Min hashing
- Locality-sensitive hashing

Shingling
In this step, we convert each document into a set of characters of length k (also known as k-shingles or k-grams). The key idea is to represent each document in our collection as a set of k-shingles.

For ex: One of your document (D): “Nadal”. Now if we’re interested in 2-shingles, then our set: {Na, ad, da, al}. Similarly set of 3-shingles: {Nad, ada, dal}.
- Similar documents are more likely to share more shingles
- Reordering paragraphs in a document of changing words doesn’t have much affect on shingles
- k value of 8–10 is generally used in practice. A small value will result in many shingles which are present in most of the documents (bad for differentiating documents)
Jaccard Index
We’ve a representation of each document in the form of shingles. Now, we need a metric to measure similarity between documents. Jaccard Index is a good choice for this. Jaccard Index between document A & B can be defined as:

It’s also known as intersection over union (IOU).
A: {Na, ad, da, al} and B: {Na, ad, di, ia}.

Jaccard Index = 2/6
Let’s discuss 2 big issues that we need to tackle:
Time complexity
Now you may be thinking that we can stop here. But if you think about the scalability, doing just this won’t work. For a collection of n documents, you need to do n*(n-1)/2 comparison, basically O(n²). Imagine you have 1 million documents, then the number of comparison will be 5*10¹¹ (not scalable at all!).
Space complexity
The document matrix is a sparse matrix and storing it as it is will be a big memory overhead. One way to solve this is hashing.
Hashing
The idea of hashing is to convert each document to a small signature using a hashing function H*.* Suppose a document in our corpus is denoted by d. Then:
- H(d) is the signature and it’s small enough to fit in memory
- If similarity(d1,d2) is high then *Probability(H(d1)==H(d2))* is high
- If similarity(d1,d2) is low then *Probability(H(d1)==H(d2))* is low
Choice of hashing function is tightly linked to the similarity metric we’re using. For Jaccard similarity the appropriate hashing function is min-hashing.
Min hashing
This is the critical and the most magical aspect of this algorithm so pay attention:
Step 1: Random permutation (π) of row index of document shingle matrix.

////////对行进行随机排列
Step 2: Hash function is the index of the first (in the permuted order) row in which column C has value 1. Do this several time (use different permutations) to create signature of a column.
第2步:哈希函数是列C值为1的第一行(按顺序排列)的索引。这样做几次(使用不同的排列)来创建一个列的签名。



////这个图根本看不懂
转:locality sensitive hashing的更多相关文章
- [Algorithm] 局部敏感哈希算法(Locality Sensitive Hashing)
局部敏感哈希(Locality Sensitive Hashing,LSH)算法是我在前一段时间找工作时接触到的一种衡量文本相似度的算法.局部敏感哈希是近似最近邻搜索算法中最流行的一种,它有坚实的理论 ...
- 局部敏感哈希-Locality Sensitive Hashing
局部敏感哈希 转载请注明http://blog.csdn.net/stdcoutzyx/article/details/44456679 在检索技术中,索引一直须要研究的核心技术.当下,索引技术主要分 ...
- LSH(Locality Sensitive Hashing)原理与实现
原文地址:https://blog.csdn.net/guoziqing506/article/details/53019049 LSH(Locality Sensitive Hashing)翻译成中 ...
- Locality Sensitive Hashing,LSH
1. 基本思想 局部敏感(Locality Senstitive):即空间中距离较近的点映射后发生冲突的概率高,空间中距离较远的点映射后发生冲突的概率低. 局部敏感哈希的基本思想类似于一种空间域转换思 ...
- 局部敏感哈希算法(Locality Sensitive Hashing)
from:https://www.cnblogs.com/maybe2030/p/4953039.html 阅读目录 1. 基本思想 2. 局部敏感哈希LSH 3. 文档相似度计算 局部敏感哈希(Lo ...
- 局部敏感哈希Locality Sensitive Hashing(LSH)之随机投影法
1. 概述 LSH是由文献[1]提出的一种用于高效求解最近邻搜索问题的Hash算法.LSH算法的基本思想是利用一个hash函数把集合中的元素映射成hash值,使得相似度越高的元素hash值相等的概率也 ...
- 局部敏感哈希-Locality Sensitivity Hashing
一. 近邻搜索 从这里开始我将会对LSH进行一番长篇大论.因为这只是一篇博文,并不是论文.我觉得一篇好的博文是尽可能让人看懂,它对语言的要求并没有像论文那么严格,因此它可以有更强的表现力. 局部敏感哈 ...
- 从NLP任务中文本向量的降维问题,引出LSH(Locality Sensitive Hash 局部敏感哈希)算法及其思想的讨论
1. 引言 - 近似近邻搜索被提出所在的时代背景和挑战 0x1:从NN(Neighbor Search)说起 ANN的前身技术是NN(Neighbor Search),简单地说,最近邻检索就是根据数据 ...
- Locality Sensitive Hash 局部敏感哈希
Locality Sensitive Hash是一种常见的用于处理高维向量的索引办法.与其它基于Tree的数据结构,诸如KD-Tree.SR-Tree相比,它较好地克服了Curse of Dimens ...
随机推荐
- Innodb表空间迁移过程
1.大致流程 将a实例的表的数据迁移到b实例上. 1.在目标实例b上创建一个相同的表 2.在目标库b上执行ALTER TABLE t DISCARD TABLESPACE; 3.在源库a上执行FLUS ...
- Centos7下Jewel版本radosgw服务启动
前言 本篇介绍了centos7下jewel版本的radosgw配置,这里的配置是指将服务能够正常起来,不涉及到S3的配置,以及其他的更多的配置,radosgw后面的gw就是gateway的意思,也就是 ...
- sql实现通过父级id查询所有的子集
通过sql实现传入父级id查询出所有的子集 最近刚好有个业务需要这样实现个功能,就是在点击查询列表详情的时候只会传入父级id,而详情得渲染出所有子集,那么做法有很多,可以直接通过代码递归查询去实现, ...
- 不同角度看Handler——另类三问
之前有一章节介绍了Handler的常见面试题,今天就来说说另类的,可能你没关注的其他问题,一起看看吧. 系统为什么提供Handler 这点大家应该都知道一些,就是为了切换线程,主要就是为了解决在子线程 ...
- mysql之多表查询的其他查询
1,临时表查询 (1)需求:查询高于本部门平均工资的人员 select * from person as p, (select dept_id, avg(salary) as '平均工资' from ...
- python文件操作与编解码
1 # 文件操作 2 3 ''' 4 1.文件路径:要知道文件的路径 5 6 2.编码方式:要知道文件是什么编码的.utf-8 gbk...... 7 8 3.操作方式:要以什么样的方式进行打开这个文 ...
- Python pip下载过慢解决方案
pip是一个python的包安装与管理工具,安装python时候可以选择是否安装,如果安装了pip可以使用命令查看版本 C:\Users\Vincente λ pip -V pip 19.2.3 fr ...
- 使用pdfFactory为PDF文件设定查看选项
一般情况下,大部分PDF文件都会按照默认的查看设置,以100%的尺寸显示第一页的内容.但在一些特殊情况下,PDF文件的创建者会设定其他的文件查看尺寸,或设定打开页为第N页,来达到引起阅读者关注的目的. ...
- Mac下载器Folx的标签功能怎么使用
当大家使用Folx下载软件的时候,会发现,下载好的文件或者视频,会被Folx自动打上标签,进行归类,这其实就是Folx自带的智能标签功能,它能智能识别图片.视频.应用程序并分类.但很多时候,智能标签并 ...
- Problem D. Country Meow 题解(三分套三分套三分)
题目链接 题目大意 给你n(n<=100)个点,要你找一个点使得和所有点距离的最大值最小值ans 题目思路 一直在想二分答案,但是不会check 这个时候就要换一下思想 三分套三分套三分坐标即可 ...