海量数据挖掘MMDS week2: LSH的距离度量方法
http://blog.csdn.net/pipisorry/article/details/48882167
海量数据挖掘Mining Massive Datasets(MMDs) -Jure Leskovec courses学习笔记之局部敏感哈希LSH的距离度量方法
Distance Measures距离度量方法
{There are many other notions of similarity(beyond jaccard similarity) or distance and which one to use depends on what type of data we have and what our notion of similar is.Beside it is possible to combine hash functions from a family,to get the s curve
affect that we saw for LSH applied to min-hash matrices.In fact, the construction is essentially the same for any LSH family.And we'll conclude this unit by seeing some particular LSH families, and how they work for the cosine distance and Euclidean distance.}
Euclidean distance Vs. Non-Euclidean distance 欧氏距离对比非欧氏距离
Note: dense: given any two points,their average will be a point in the space.And there is no reasonable notion of the average of points in the space.欧氏距离可以计算average,但是非欧氏距离却不一定。
Axioms of Distance Measures 距离度量公理
距离度量就满足的性质
Note: iff = if and only if [英文文献中常见拉丁字母缩写整理(红色最常见)]
欧氏距离
Note: 范数Norm:
给定向量x=(x1,x2,...xn)
L1范数:向量各个元素绝对值之和,Manhattan distance。
L2范数:向量各个元素的平方求和然后求平方根,也叫欧式范数、欧氏距离。
Lp范数:向量各个元素绝对值的p次方求和然后求1/p次方
L∞范数:向量各个元素求绝对值,最大那个元素的绝对值
非欧氏距离
Note:
1. cosine distance: requires points to be vectors, if the vectors have real numbers as components, then they are essentially points in the Euclidean space.But the vectors could have integer components in which case the space is not Euclidean.
2. 编辑距离有两种方式:一种是直接将其中一个元音字符替换成另 一个,一种是先删除字符再插入另一个字符。
非欧氏距离及其满足公理性质的证明:
Jaccard Dist
Note: Proof中使用反证法:两个都不成立,即都相等时,minhash(x)=minhash(y)了。
Cosine Dist余弦距离
cosine distance is useful for data that is in the form of a vector.Often the vector is in very high dimensions.
Note:
1. The length of a vector from the origin is actually the normal Euclidian distance,what we call the L2 norm.
2. No matter how many dimensions the vectors have, any two lines that intersect, and P1 and P2 do intersect at the origin,they'll follow a plane.
3. if you project P1 onto P2,the length of the projection is the dot product, divided by the length of P2.Then the cosine of the angle between them is the ratio of adjacent(the dot product divided by P2) over hypotenuse(斜边, the length of P1).
Note: vectors here are really directions, not magnitudes.So two vectors with the same direction and different magnitudes are really the same vector.Even to vector and its negation, the reverse of the vector,ought to be thought of as the
same vector.
Edit distance编辑距离
子串的定义:one string is a sub-sequence of another if we can get the first by deleting 0 or more positions from the second.the positions of the deleted characters did not have to be consecutive.
计算x,y编辑距离的两种方式
Note: 第一种方式中我们可以逆向编辑:we can get from y to x by doing the same edits in reverse.delete u and v,and then we insert a to get x.
Hamming distance汉明距离
Reviews复习
Note:距离矩阵
he she his hers
he 1 3 2
she 4 3
his 3
from:http://blog.csdn.net/pipisorry/article/details/48882167
ref: 距离和相似性度量方法
海量数据挖掘MMDS week2: LSH的距离度量方法的更多相关文章
- 海量数据挖掘MMDS week2: 局部敏感哈希Locality-Sensitive Hashing, LSH
http://blog.csdn.net/pipisorry/article/details/48858661 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...
- 海量数据挖掘MMDS week2: 频繁项集挖掘 Apriori算法的改进:非hash方法
http://blog.csdn.net/pipisorry/article/details/48914067 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...
- 海量数据挖掘MMDS week2: Nearest-Neighbor Learning最近邻学习
http://blog.csdn.net/pipisorry/article/details/48894963 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...
- 海量数据挖掘MMDS week2: 频繁项集挖掘 Apriori算法的改进:基于hash的方法
http://blog.csdn.net/pipisorry/article/details/48901217 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...
- 海量数据挖掘MMDS week2: Association Rules关联规则与频繁项集挖掘
http://blog.csdn.net/pipisorry/article/details/48894977 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...
- 海量数据挖掘MMDS week7: 局部敏感哈希LSH(进阶)
http://blog.csdn.net/pipisorry/article/details/49686913 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...
- 海量数据挖掘MMDS week3:社交网络之社区检测:高级技巧
http://blog.csdn.net/pipisorry/article/details/49052255 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...
- 海量数据挖掘MMDS week5: 聚类clustering
http://blog.csdn.net/pipisorry/article/details/49427989 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...
- 海量数据挖掘MMDS week4: 推荐系统Recommendation System
http://blog.csdn.net/pipisorry/article/details/49205589 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...
随机推荐
- java里String类为何被设计为final
前些天面试遇到一个非常难的关于String的问题,"String为何被设计为不可变的"?类似的问题也有"String为何被设计为final?"个人认为还是前面一 ...
- 剑指架构师系列-ftp服务器
1.安装FTP 我们在开发项目时,肯定需要专门的一台ftp服务器来存在上传的静态资源,今天我们就在CentOS下搭建一个ftp服务器. 1.安装vsftpd组件,安装完后,有/etc/vsftpd/v ...
- 实战 PureMVC
最近看PureMVC,在IBM开发者社区发现此文,对PureMVC的讲解清晰简洁,看了可快速入门.另外,<腾讯桌球>游戏的开发者吴秦,也曾进一步剖析PureMVC,可结合看加深理解. 引言 ...
- MFC误报内存泄露的修复
在debug状态退出程序的时候,VS会在输出窗口列出可能的内存泄露的地方. MFC中使用DEBUG_NEW能够更方便的定位泄露的地点.但假如MFC的dll释放""过早"& ...
- CentOS环境下使用GIT基于Nginx的私服搭建全过程
阅读本文前你必须预先装好CentOS并且已经安装和配置好Nginx了. 安装GIT私服套件 安装centos6.5-centos7.0 安装nginx yum install -y?git gitwe ...
- Maven之(六)setting.xml配置文件详解
setting.xml配置文件 maven的配置文件settings.xml存在于两个地方: 1.安装的地方:${M2_HOME}/conf/settings.xml 2.用户的目录:${user.h ...
- Zookeeper的安装配置及基本开发
一.简介 Zookeeper 是分布式服务框架,主要是用来解决分布式应用中经常遇到的一些数据管理问题,如:统一命名服务.状态同步服务.集群管理.分布式应用配置项的管理等等. ZooKeeper的目标就 ...
- Redis 学习笔记2:redis.conf配置文件详解
Redis 的配置文件位于 Redis 安装目录下,文件名为 redis.conf. 参数说明: 参数说明 redis.conf 配置项说明如下: 1. Redis默认不是以守护进程的方式运行,可以通 ...
- 在自己笔记本电脑上如何访问虚拟机的内容、包括可以使用ssh、访问tomcat、访问nginx
1.给自己的电脑设置一个回环网卡,关于如何配置回环网卡,可以百度搜索一下 设置好后的状态如下: 并把回环网卡的ipv4的值设置成192.168.1.1 配置如下: 2.将vmware中的"虚 ...
- GC真正的垃圾:强、软、弱、和虚 对象
垃圾回收的基本思想就是判断一个对象是否可触及性,说白了就是判断一个对象是否可以访问,如果对象对引用了,说明对象正在被使用,如果发现对象没有被引用,说明对象已经不再使用了,不再使用的对象可以被回收,但是 ...