https://www.cs.utah.edu/~jeffp/teaching/cs5955/L4-Jaccard+Shingle.pdf

https://www.cs.utah.edu/~jeffp/teaching/cs5955/L5-Minhash.pdf

【可测空间  convert the data (homeworks, webpages, emails) into an object in an abstract space that we know how to measure distance 】

We will study how to define the distance between sets, specifically with the Jaccard distance. To illustrate and motivate this study, we will focus on using Jaccard distance to measure the distance between documents. This uses the common “bag of words” model, which is simplistic, but is sufficient for many applications. We start with some big questions. This lecture will only begin to answer them. • Given two homework assignments (reports) how can a computer detect if one is likely to have been plagiarized from the other without understanding the content? • In trying to index webpages, how does Google avoid listing duplicates or mirrors? • How does a computer quickly understand emails, for either detecting spam or placing effective advertisers? (If an ad worked on one email, how can we determine which others are similar?)

【词带将文本段落转化为数值集合 convert documents into sets】

4.2 Documents to Sets How do we apply this set machinery to documents? Bag of words vs. Shingles The first option is the bag of words model, where each document is treated as an unordered set of words. A more general approach is to shingle the document. This takes consecutive words and group them as a single object. A k-shingle is a consecutive set of k words. So the set of all 1-shingles is exactly the bag of words model. An alternative name to k-shingle is an k-gram. These mean the same thing. D1 : I am Sam. D2 : Sam I am. D3 : I do not like green eggs and ham. D4 : I do not like them, Sam I am. The (k = 1)-shingles of D1∪D2∪D3∪D4 are: {[I], [am], [Sam], [do], [not], [like], [green], [eggs], [and], [ham], [them]}.

The (k = 2)-shingles of D1∪D2∪D3∪D4 are: {[I am], [am Sam], [Sam Sam], [Sam I], [am I], [I do], [do not], [not like], [like green], [green eggs], [eggs and], [and ham], [like them], [them Sam]}. The set of k-shingles of a document with n words is at most n − k. The takes space O(kn) to store them all. If k is small, this is not a high overhead. Furthermore, the space goes down as items are repeated.

The set of k-shingles of a document with n words is at most n − k. The takes space O(kn) to store them all. If k is small, this is not a high overhead. Furthermore, the space goes down as items are repeated.

【勘误--k n n-k+1  空间复杂度 space O(kn) 】

【Jaccard 对相似度的度量 Jaccard with Shingles】

4.3 Jaccard with Shingles So how do we put this together. Consider the (k = 2)-shingles for each D1, D2, D3, and D4: D1 : [I am], [am Sam] D2 : [Sam I], [I am] D3 : [I do], [do not], [not like], [like green], [green eggs], [eggs and], [and ham] D4 : [I do], [do not], [not like], [like them], [them Sam], [Sam I], [I am]

Now the Jaccard similarity is as follows: JS(D1, D2) = 1/3 ≈ 0.333 JS(D1, D3) = 0 = 0.0 JS(D1, D4) = 1/8 = 0.125 JS(D2, D3) = 0 = 0.0 JS(D3, D4) = 2/7 ≈ 0.286 JS(D3, D4) = 3/11 ≈ 0.273 Next time we will see how to use this special abstract structure of sets to compute this distance (approximately) very efficiently and at extremely large scale.

Jaccard Similarity and Shingling的更多相关文章

  1. jaccard similarity coefficient 相似度计算

    Jaccard index From Wikipedia, the free encyclopedia     The Jaccard index, also known as the Jaccard ...

  2. Jaccard similarity(杰卡德相似度)和Abundance correlation(丰度相关性)

    杰卡德距离(Jaccard Distance) 是用来衡量两个集合差异性的一种指标,它是杰卡德相似系数的补集,被定义为1减去Jaccard相似系数.而杰卡德相似系数(Jaccard similarit ...

  3. 基于jaccard相似度的LSH

    使用Python通过LSH建立推荐引擎 LSH:一个可以用来处理成百上千行的算法 前提: Python 基础 Pandas 学完本教程之后,解锁成就: 通过建立shingles 为LSH准备训练集和测 ...

  4. 机器学习中的相似性度量(Similarity Measurement)

    机器学习中的相似性度量(Similarity Measurement) 在做分类时常常需要估算不同样本之间的相似性度量(Similarity Measurement),这时通常采用的方法就是计算样本间 ...

  5. 相似性度量(Similarity Measurement)与“距离”(Distance)

    在做分类时常常需要估算不同样本之间的相似性度量(Similarity Measurement),这时通常采用的方法就是计算样本间的“距离”(Distance).采用什么样的方法计算距离是很讲究,甚至关 ...

  6. 相似性分析之Jaccard相似系数

    Jaccard, 又称为Jaccard相似系数(Jaccard similarity coefficient)用于比较有限样本集之间的相似性与差异性.Jaccard系数值越大,样本相似度越高 公式: ...

  7. Dice Similarity Coefficent vs. IoU Dice系数和IoU

    Dice Similarity Coefficent vs. IoU Several readers emailed regarding the segmentation performance of ...

  8. 相似系数_杰卡德距离(Jaccard Distance)

    python机器学习-乳腺癌细胞挖掘(博主亲自录制视频)https://study.163.com/course/introduction.htm?courseId=1005269003&ut ...

  9. 海量数据挖掘MMDS week2: 局部敏感哈希Locality-Sensitive Hashing, LSH

    http://blog.csdn.net/pipisorry/article/details/48858661 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

随机推荐

  1. 2016集训测试赛(二十六)Problem A: bar

    Solution 首先审清题意, 这里要求的是子串而不是子序列... 我们考虑用1表示p, -1表示j. 用sum[i]表示字符串前\(i\)的前缀和. 则我们考虑一个字符串\([L, R]\)有什么 ...

  2. java.lang.NoSuchMethodError: main Exception in thread "main" ===Exception

    java.lang.NoSuchMethodError: mainException in thread "main" 出现该异常是因为在之前我的项目中自定义了一个String类, ...

  3. https的实现原理

    加密算法 有两种基本的加解密算法类型: 1)对称加密:密钥只有一个,加密解密为同一个密码,且加解密速度快,典型的对称加密算法有DES.AES等: 2)非对称加密:密钥成对出现(且根据公钥无法推知私钥, ...

  4. mysql之count,max,min,sum,avg,celing,floor

    写在前面 昨天去青龙峡玩了一天,累的跟狗似的.不过还好,最终也算登到山顶了,也算来北京后征服的第三座山了.这里也唠叨一句,做开发这行,没事还是多运动运动,对自己还是很有好处的,废话少说,还是折腾折腾s ...

  5. xamarin.ios 半圆角按钮Readerer

    xamarin.from上可以使用本身的button实现圆角带图标的按钮,但是没有半圆角的按钮实现,需要自己使用Renderer重新写过来重写一个button. 下面是一个重写的带边框的方式,代码如下 ...

  6. RecyclerView的滚动事件分析

    列表的滚动一般分为两种: 手指按下 -> 手指拖拽列表移动 -> 手指停止拖拽 -> 抬起手指 手指按下 -> 手指快速拖拽后抬起手指 -> 列表继续滚动 -> 停 ...

  7. js 模拟发短信

    <!doctype html> <html> <head> <meta charset="utf-8"> <title> ...

  8. angular - 如何支持less和sass(scss)

    更新时间: (2018-7-26) - 使用angular6.x最新版本 新建项目时,我们指定类型: 示例:ng new projectname -style=sass(scss) 实例:ng new ...

  9. mysql rpm安装,以及修改charset

    http://my.oschina.net/u/1156660/blog/343154?fromerr=tmDGGiDL 修改charset: http://stackoverflow.com/que ...

  10. hdu2141Can you find it?

     给你四个集合.要你从这四个集合中 各取出一个数出来,推断,取出的前三个数的和 是否等于第四个数. 数据比較大.我的做法是将 前两个集合全部数全部和的情况取出来, 然后二分查找第四个集合和第三集合 ...