https://www.cs.utah.edu/~jeffp/teaching/cs5955/L4-Jaccard+Shingle.pdf

https://www.cs.utah.edu/~jeffp/teaching/cs5955/L5-Minhash.pdf

【可测空间  convert the data (homeworks, webpages, emails) into an object in an abstract space that we know how to measure distance 】

We will study how to define the distance between sets, specifically with the Jaccard distance. To illustrate and motivate this study, we will focus on using Jaccard distance to measure the distance between documents. This uses the common “bag of words” model, which is simplistic, but is sufficient for many applications. We start with some big questions. This lecture will only begin to answer them. • Given two homework assignments (reports) how can a computer detect if one is likely to have been plagiarized from the other without understanding the content? • In trying to index webpages, how does Google avoid listing duplicates or mirrors? • How does a computer quickly understand emails, for either detecting spam or placing effective advertisers? (If an ad worked on one email, how can we determine which others are similar?)

【词带将文本段落转化为数值集合 convert documents into sets】

4.2 Documents to Sets How do we apply this set machinery to documents? Bag of words vs. Shingles The first option is the bag of words model, where each document is treated as an unordered set of words. A more general approach is to shingle the document. This takes consecutive words and group them as a single object. A k-shingle is a consecutive set of k words. So the set of all 1-shingles is exactly the bag of words model. An alternative name to k-shingle is an k-gram. These mean the same thing. D1 : I am Sam. D2 : Sam I am. D3 : I do not like green eggs and ham. D4 : I do not like them, Sam I am. The (k = 1)-shingles of D1∪D2∪D3∪D4 are: {[I], [am], [Sam], [do], [not], [like], [green], [eggs], [and], [ham], [them]}.

The (k = 2)-shingles of D1∪D2∪D3∪D4 are: {[I am], [am Sam], [Sam Sam], [Sam I], [am I], [I do], [do not], [not like], [like green], [green eggs], [eggs and], [and ham], [like them], [them Sam]}. The set of k-shingles of a document with n words is at most n − k. The takes space O(kn) to store them all. If k is small, this is not a high overhead. Furthermore, the space goes down as items are repeated.

The set of k-shingles of a document with n words is at most n − k. The takes space O(kn) to store them all. If k is small, this is not a high overhead. Furthermore, the space goes down as items are repeated.

【勘误--k n n-k+1  空间复杂度 space O(kn) 】

【Jaccard 对相似度的度量 Jaccard with Shingles】

4.3 Jaccard with Shingles So how do we put this together. Consider the (k = 2)-shingles for each D1, D2, D3, and D4: D1 : [I am], [am Sam] D2 : [Sam I], [I am] D3 : [I do], [do not], [not like], [like green], [green eggs], [eggs and], [and ham] D4 : [I do], [do not], [not like], [like them], [them Sam], [Sam I], [I am]

Now the Jaccard similarity is as follows: JS(D1, D2) = 1/3 ≈ 0.333 JS(D1, D3) = 0 = 0.0 JS(D1, D4) = 1/8 = 0.125 JS(D2, D3) = 0 = 0.0 JS(D3, D4) = 2/7 ≈ 0.286 JS(D3, D4) = 3/11 ≈ 0.273 Next time we will see how to use this special abstract structure of sets to compute this distance (approximately) very efficiently and at extremely large scale.

Jaccard Similarity and Shingling的更多相关文章

  1. jaccard similarity coefficient 相似度计算

    Jaccard index From Wikipedia, the free encyclopedia     The Jaccard index, also known as the Jaccard ...

  2. Jaccard similarity(杰卡德相似度)和Abundance correlation(丰度相关性)

    杰卡德距离(Jaccard Distance) 是用来衡量两个集合差异性的一种指标,它是杰卡德相似系数的补集,被定义为1减去Jaccard相似系数.而杰卡德相似系数(Jaccard similarit ...

  3. 基于jaccard相似度的LSH

    使用Python通过LSH建立推荐引擎 LSH:一个可以用来处理成百上千行的算法 前提: Python 基础 Pandas 学完本教程之后,解锁成就: 通过建立shingles 为LSH准备训练集和测 ...

  4. 机器学习中的相似性度量(Similarity Measurement)

    机器学习中的相似性度量(Similarity Measurement) 在做分类时常常需要估算不同样本之间的相似性度量(Similarity Measurement),这时通常采用的方法就是计算样本间 ...

  5. 相似性度量(Similarity Measurement)与“距离”(Distance)

    在做分类时常常需要估算不同样本之间的相似性度量(Similarity Measurement),这时通常采用的方法就是计算样本间的“距离”(Distance).采用什么样的方法计算距离是很讲究,甚至关 ...

  6. 相似性分析之Jaccard相似系数

    Jaccard, 又称为Jaccard相似系数(Jaccard similarity coefficient)用于比较有限样本集之间的相似性与差异性.Jaccard系数值越大,样本相似度越高 公式: ...

  7. Dice Similarity Coefficent vs. IoU Dice系数和IoU

    Dice Similarity Coefficent vs. IoU Several readers emailed regarding the segmentation performance of ...

  8. 相似系数_杰卡德距离(Jaccard Distance)

    python机器学习-乳腺癌细胞挖掘(博主亲自录制视频)https://study.163.com/course/introduction.htm?courseId=1005269003&ut ...

  9. 海量数据挖掘MMDS week2: 局部敏感哈希Locality-Sensitive Hashing, LSH

    http://blog.csdn.net/pipisorry/article/details/48858661 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

随机推荐

  1. Timeout watchdog using a standby thread

    http://codereview.stackexchange.com/questions/84697/timeout-watchdog-using-a-standby-thread he simpl ...

  2. getenv, _wgetenv

    Description The C library function char *getenv(const char *name) searches for the environment strin ...

  3. popcount 算法分析

    转载: http://blog.csdn.net/gaochao1900/article/details/5646211 http://www.cnblogs.com/Martinium/archiv ...

  4. Storyboards Tutorial 04

    设计好后运行发现没有任何变化,是空白的.这是因为你的tableview相关的delegate方法还在.所以首先要屏蔽或者删除在PlayerDetailsViewController.m 如下的操作 # ...

  5. recovery怎么刷机,recovery是什么意思

    转自:http://www.3lian.com/edu/2012/04-11/25212.html Recovery是什么意思? recovery翻译过来就是“恢复”的意思,是开机后通过特殊按键组合( ...

  6. xss---攻击

    xss表示Cross Site Scripting(跨站脚本攻击),它与SQL注入攻击类似,SQL注入攻击中以SQL语句作为用户输入,从而达到查询/修改/删除数据的目的,而在xss攻击中,通过插入恶意 ...

  7. json字符串调整

    碰到比较长的json字符串,不知道哪里出错时,可以找一个正确的json字符串,慢慢把它调整到需要的形式,而不是去分析,字符串太长,一直看,效率太慢,容易看花眼.

  8. AngularJS中选择样式

    代码下载:https://files.cnblogs.com/files/xiandedanteng/angularJSSelectClass.rar 要点,{{ctrl.name}}比<spa ...

  9. [集合]解决system权限3389无法添加的用户情况

    Webshell有了SYSTEM权限,却无法成功添加administrators用户,因此导致无法成功连接3389.总结原因有以下几点:I.杀软篇1,360杀毒软件2,麦咖啡杀毒软件3,卡巴斯基杀毒软 ...

  10. distinct 与order by 一起用

    distinct 后面不要直接跟Order by , 如果要用在子查询用用order by .