https://www.cs.utah.edu/~jeffp/teaching/cs5955/L4-Jaccard+Shingle.pdf

https://www.cs.utah.edu/~jeffp/teaching/cs5955/L5-Minhash.pdf

【可测空间  convert the data (homeworks, webpages, emails) into an object in an abstract space that we know how to measure distance 】

We will study how to define the distance between sets, specifically with the Jaccard distance. To illustrate and motivate this study, we will focus on using Jaccard distance to measure the distance between documents. This uses the common “bag of words” model, which is simplistic, but is sufficient for many applications. We start with some big questions. This lecture will only begin to answer them. • Given two homework assignments (reports) how can a computer detect if one is likely to have been plagiarized from the other without understanding the content? • In trying to index webpages, how does Google avoid listing duplicates or mirrors? • How does a computer quickly understand emails, for either detecting spam or placing effective advertisers? (If an ad worked on one email, how can we determine which others are similar?)

【词带将文本段落转化为数值集合 convert documents into sets】

4.2 Documents to Sets How do we apply this set machinery to documents? Bag of words vs. Shingles The first option is the bag of words model, where each document is treated as an unordered set of words. A more general approach is to shingle the document. This takes consecutive words and group them as a single object. A k-shingle is a consecutive set of k words. So the set of all 1-shingles is exactly the bag of words model. An alternative name to k-shingle is an k-gram. These mean the same thing. D1 : I am Sam. D2 : Sam I am. D3 : I do not like green eggs and ham. D4 : I do not like them, Sam I am. The (k = 1)-shingles of D1∪D2∪D3∪D4 are: {[I], [am], [Sam], [do], [not], [like], [green], [eggs], [and], [ham], [them]}.

The (k = 2)-shingles of D1∪D2∪D3∪D4 are: {[I am], [am Sam], [Sam Sam], [Sam I], [am I], [I do], [do not], [not like], [like green], [green eggs], [eggs and], [and ham], [like them], [them Sam]}. The set of k-shingles of a document with n words is at most n − k. The takes space O(kn) to store them all. If k is small, this is not a high overhead. Furthermore, the space goes down as items are repeated.

The set of k-shingles of a document with n words is at most n − k. The takes space O(kn) to store them all. If k is small, this is not a high overhead. Furthermore, the space goes down as items are repeated.

【勘误--k n n-k+1  空间复杂度 space O(kn) 】

【Jaccard 对相似度的度量 Jaccard with Shingles】

4.3 Jaccard with Shingles So how do we put this together. Consider the (k = 2)-shingles for each D1, D2, D3, and D4: D1 : [I am], [am Sam] D2 : [Sam I], [I am] D3 : [I do], [do not], [not like], [like green], [green eggs], [eggs and], [and ham] D4 : [I do], [do not], [not like], [like them], [them Sam], [Sam I], [I am]

Now the Jaccard similarity is as follows: JS(D1, D2) = 1/3 ≈ 0.333 JS(D1, D3) = 0 = 0.0 JS(D1, D4) = 1/8 = 0.125 JS(D2, D3) = 0 = 0.0 JS(D3, D4) = 2/7 ≈ 0.286 JS(D3, D4) = 3/11 ≈ 0.273 Next time we will see how to use this special abstract structure of sets to compute this distance (approximately) very efficiently and at extremely large scale.

Jaccard Similarity and Shingling的更多相关文章

  1. jaccard similarity coefficient 相似度计算

    Jaccard index From Wikipedia, the free encyclopedia     The Jaccard index, also known as the Jaccard ...

  2. Jaccard similarity(杰卡德相似度)和Abundance correlation(丰度相关性)

    杰卡德距离(Jaccard Distance) 是用来衡量两个集合差异性的一种指标,它是杰卡德相似系数的补集,被定义为1减去Jaccard相似系数.而杰卡德相似系数(Jaccard similarit ...

  3. 基于jaccard相似度的LSH

    使用Python通过LSH建立推荐引擎 LSH:一个可以用来处理成百上千行的算法 前提: Python 基础 Pandas 学完本教程之后,解锁成就: 通过建立shingles 为LSH准备训练集和测 ...

  4. 机器学习中的相似性度量(Similarity Measurement)

    机器学习中的相似性度量(Similarity Measurement) 在做分类时常常需要估算不同样本之间的相似性度量(Similarity Measurement),这时通常采用的方法就是计算样本间 ...

  5. 相似性度量(Similarity Measurement)与“距离”(Distance)

    在做分类时常常需要估算不同样本之间的相似性度量(Similarity Measurement),这时通常采用的方法就是计算样本间的“距离”(Distance).采用什么样的方法计算距离是很讲究,甚至关 ...

  6. 相似性分析之Jaccard相似系数

    Jaccard, 又称为Jaccard相似系数(Jaccard similarity coefficient)用于比较有限样本集之间的相似性与差异性.Jaccard系数值越大,样本相似度越高 公式: ...

  7. Dice Similarity Coefficent vs. IoU Dice系数和IoU

    Dice Similarity Coefficent vs. IoU Several readers emailed regarding the segmentation performance of ...

  8. 相似系数_杰卡德距离(Jaccard Distance)

    python机器学习-乳腺癌细胞挖掘(博主亲自录制视频)https://study.163.com/course/introduction.htm?courseId=1005269003&ut ...

  9. 海量数据挖掘MMDS week2: 局部敏感哈希Locality-Sensitive Hashing, LSH

    http://blog.csdn.net/pipisorry/article/details/48858661 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

随机推荐

  1. HDFS读文件过程分析:读取文件的Block数据

    转自http://shiyanjun.cn/archives/962.html 我们可以从java.io.InputStream类中看到,抽象出一个read方法,用来读取已经打开的InputStrea ...

  2. 深入理解 JavaScript Function

    1.Function Arguments JavaScript 函数的参数 类型可以是 复杂类型如  Object or Array 和简单类型 String Integer null undefin ...

  3. Start Developing iOS Apps Today

    view types - view常见类型

  4. [置顶] python字典和nametuple互相转换例子

    如果tuple中的元素很多的时候操作起来就比较麻烦,有可能会由于索引错误导致出错. namedtuple对象给tuple命名. 下面的例子可以字典和nametuple互相转换 aa={'verbosi ...

  5. HDU 3150 Robot Roll Call – Cambot…Servo…Gypsy…Croooow(map)

    题目链接:http://acm.hdu.edu.cn/showproblem.php?pid=3150 Problem Description Mystery Science Theater 3000 ...

  6. CSS实现鼠标放图片上显示白色边框+文字描写叙述

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/ ...

  7. ArcGIS教程:面积制表

    摘要 计算两个数据集之间交叉制表的区域并输出表. 插图 使用方法 · 区域定义为输入中具有同样值的全部区.各区无需相连. 栅格和要素数据集都可用于区域输入. · 假设区域输入和类输入均为具有同样分辨率 ...

  8. IP反查网站,ip反查接口,旁站查询接口大全,通过IP查域名汇总:

    http://cn.bing.com/search?q=ip%3A220.181.111.85     http://dns.aizhan.com/?q=www.baidu.com     http: ...

  9. ZF-net

    ZF-net 摘要: 1.这篇文章的motivation 是 :CNN性能良好,可是我们不知道它为何性能良好.也不知道它怎么能够被提高? 2.本文介绍了一种新方法实现中间层和分类器的可视化 3.採用消 ...

  10. C++中全局变量如何使用

    运行文件的小技巧:包含2个.CPP和一个.H文件,必须一个.CPP一个.H一一对应.且C++中,只能运行一个项目,要想在多个文件中(.cpp)运行一个.cpp必须建立多个项目,或者将不允许运行的文件从 ...