Jaccard Similarity and Shingling
https://www.cs.utah.edu/~jeffp/teaching/cs5955/L4-Jaccard+Shingle.pdf
https://www.cs.utah.edu/~jeffp/teaching/cs5955/L5-Minhash.pdf
【可测空间 convert the data (homeworks, webpages, emails) into an object in an abstract space that we know how to measure distance 】
We will study how to define the distance between sets, specifically with the Jaccard distance. To illustrate and motivate this study, we will focus on using Jaccard distance to measure the distance between documents. This uses the common “bag of words” model, which is simplistic, but is sufficient for many applications. We start with some big questions. This lecture will only begin to answer them. • Given two homework assignments (reports) how can a computer detect if one is likely to have been plagiarized from the other without understanding the content? • In trying to index webpages, how does Google avoid listing duplicates or mirrors? • How does a computer quickly understand emails, for either detecting spam or placing effective advertisers? (If an ad worked on one email, how can we determine which others are similar?)
【词带将文本段落转化为数值集合 convert documents into sets】
4.2 Documents to Sets How do we apply this set machinery to documents? Bag of words vs. Shingles The first option is the bag of words model, where each document is treated as an unordered set of words. A more general approach is to shingle the document. This takes consecutive words and group them as a single object. A k-shingle is a consecutive set of k words. So the set of all 1-shingles is exactly the bag of words model. An alternative name to k-shingle is an k-gram. These mean the same thing. D1 : I am Sam. D2 : Sam I am. D3 : I do not like green eggs and ham. D4 : I do not like them, Sam I am. The (k = 1)-shingles of D1∪D2∪D3∪D4 are: {[I], [am], [Sam], [do], [not], [like], [green], [eggs], [and], [ham], [them]}.
The (k = 2)-shingles of D1∪D2∪D3∪D4 are: {[I am], [am Sam], [Sam Sam], [Sam I], [am I], [I do], [do not], [not like], [like green], [green eggs], [eggs and], [and ham], [like them], [them Sam]}. The set of k-shingles of a document with n words is at most n − k. The takes space O(kn) to store them all. If k is small, this is not a high overhead. Furthermore, the space goes down as items are repeated.
The set of k-shingles of a document with n words is at most n − k. The takes space O(kn) to store them all. If k is small, this is not a high overhead. Furthermore, the space goes down as items are repeated.
【勘误--k n n-k+1 空间复杂度 space O(kn) 】
【Jaccard 对相似度的度量 Jaccard with Shingles】
4.3 Jaccard with Shingles So how do we put this together. Consider the (k = 2)-shingles for each D1, D2, D3, and D4: D1 : [I am], [am Sam] D2 : [Sam I], [I am] D3 : [I do], [do not], [not like], [like green], [green eggs], [eggs and], [and ham] D4 : [I do], [do not], [not like], [like them], [them Sam], [Sam I], [I am]
Now the Jaccard similarity is as follows: JS(D1, D2) = 1/3 ≈ 0.333 JS(D1, D3) = 0 = 0.0 JS(D1, D4) = 1/8 = 0.125 JS(D2, D3) = 0 = 0.0 JS(D3, D4) = 2/7 ≈ 0.286 JS(D3, D4) = 3/11 ≈ 0.273 Next time we will see how to use this special abstract structure of sets to compute this distance (approximately) very efficiently and at extremely large scale.
Jaccard Similarity and Shingling的更多相关文章
- jaccard similarity coefficient 相似度计算
Jaccard index From Wikipedia, the free encyclopedia The Jaccard index, also known as the Jaccard ...
- Jaccard similarity(杰卡德相似度)和Abundance correlation(丰度相关性)
杰卡德距离(Jaccard Distance) 是用来衡量两个集合差异性的一种指标,它是杰卡德相似系数的补集,被定义为1减去Jaccard相似系数.而杰卡德相似系数(Jaccard similarit ...
- 基于jaccard相似度的LSH
使用Python通过LSH建立推荐引擎 LSH:一个可以用来处理成百上千行的算法 前提: Python 基础 Pandas 学完本教程之后,解锁成就: 通过建立shingles 为LSH准备训练集和测 ...
- 机器学习中的相似性度量(Similarity Measurement)
机器学习中的相似性度量(Similarity Measurement) 在做分类时常常需要估算不同样本之间的相似性度量(Similarity Measurement),这时通常采用的方法就是计算样本间 ...
- 相似性度量(Similarity Measurement)与“距离”(Distance)
在做分类时常常需要估算不同样本之间的相似性度量(Similarity Measurement),这时通常采用的方法就是计算样本间的“距离”(Distance).采用什么样的方法计算距离是很讲究,甚至关 ...
- 相似性分析之Jaccard相似系数
Jaccard, 又称为Jaccard相似系数(Jaccard similarity coefficient)用于比较有限样本集之间的相似性与差异性.Jaccard系数值越大,样本相似度越高 公式: ...
- Dice Similarity Coefficent vs. IoU Dice系数和IoU
Dice Similarity Coefficent vs. IoU Several readers emailed regarding the segmentation performance of ...
- 相似系数_杰卡德距离(Jaccard Distance)
python机器学习-乳腺癌细胞挖掘(博主亲自录制视频)https://study.163.com/course/introduction.htm?courseId=1005269003&ut ...
- 海量数据挖掘MMDS week2: 局部敏感哈希Locality-Sensitive Hashing, LSH
http://blog.csdn.net/pipisorry/article/details/48858661 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...
随机推荐
- python 设计模式之门面模式
facade:建筑物的表面 门面模式是一个软件工程设计模式,主要用于面向对象编程. 一个门面可以看作是为大段代码提供简单接口的对象,就像类库. 门面模式被归入建筑设计模式.门面模式隐藏系统内部的细 ...
- Ansible进阶之企业级应用
1.环境 cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 ...
- Elasticsearch本地环境安装和常用操作
本篇文章首发于我的头条号Elasticsearch本地环境安装和常用操作,欢迎关注我的头条号和微信公众号"大数据技术和人工智能"(微信搜索bigdata_ai_tech)获取更多干 ...
- Beginning Auto Layout Tutorial in iOS 7: Part 6
Gallery example 屏幕有四个分开的相同的矩形,每个矩形有一个label和一个image view.创建一个Gallery的项目.在Main.storyboard中,拖拉一个view大小为 ...
- Redis 架构设计
1.设计层面 (1) 存储小而热的数据 (2) 结合业务数据特点,正确使用内存类型 (3) 冷.热数据分离 2.架构层面 (1) 提前做好容量(内存)规划 (2) 结合持久化模式优劣正确使用,一般建议 ...
- Runtime.getRuntime().exec()----记录日志案例
Runtime.getRuntime().exec()方法主要用于运行外部的程序或命令. Runtime.getRuntime().exec共同拥有六个重载方法: 1.public Process e ...
- Android学习笔记(24):进度条组件ProgressBar及其子类
ProgressBar作为进度条组件使用,它还派生了SeekBar(拖动条)和RatingBar(星级评分条). ProgressBar支持的XML属性: Attribute Name Related ...
- Nginx 一些常用的URL 重写方法
url重写应该不陌生,不管是SEO URL 伪静态的需要,还是在非常流行的wordpress里,重写无处不在. 1. 在 Apache 的写法 RewriteCond %{HTTP_HOST} n ...
- 怎么样自己动手写OS
虽然我现在并不是从事内核方向,却本着探索计算机本质的想法学习的内核,自从写完这个内核以后真的发现对很多东西的理解都更深一层,所以专研内核,对我现在的工作是很有帮助的.我个人强烈建议师弟师妹们尽早地啃一 ...
- 基于友善之臂ARM-tiny4412--uboot源代码分析
/* * armboot - Startup Code for OMAP3530/ARM Cortex CPU-core * * Copyright (c) 2004 Texas Instrument ...