Calculate Similarity调研
Calculate Similarity — the most relevant Metrics in a Nutshell
——调研学习相似度定义与计算
Zhang Zhibin 张芷彬
Many data science techniques are based on measuring similarity and dissimilarity between objects. For example, K-Nearest-Neighbors uses similarity to classify new data objects. In Unsupervised Learning, K-Means is a clustering method which uses Euclidean distance to compute the distance between the cluster centroids and it’s assigned data points. Recommendation engines use neighborhood based collaborative filtering methods which identify an individual’s neighbor based on the similarity/dissimilarity to the other users.
I will take a look at the most relevant similarity metrics in practice. Measuring similarity between objects can be performed in a number of ways.
Generally we can divide similarity metrics into two different groups:
- Similarity Based Metrics:
- Pearson’s correlation
- Spearman’s correlation
- Kendall’s Tau
- Cosine similarity
- Jaccard similarity
2. Distance Based Metrics:
- Euclidean distance
- Manhattan distance
许多数据科学技术都是基于测量对象之间的相似性和相异性。例如,K-Nearest-Neighbors 使用相似性对新数据对象进行分类。在无监督学习中,K-Means 是一种聚类方法,它使用欧几里德距离来计算聚类质心与其分配的数据点之间的距离。推荐引擎使用基于邻域的协同过滤方法,该方法根据与其他用户的相似性/不相似性来识别个人的邻居。
通常我们可以将相似度度量分为两个不同的组:
- 基于相似性的指标:
- 皮尔逊相关性
- Spearman 相关性
- 肯德尔的 Tau
- 余弦相似度
- 杰卡德相似度
2.基于距离的指标:
- 欧几里得距离
- 曼哈顿距离
Similarity Based Metrics基于相似性的指标
- Similarity based methods determine the most similar objects with the highest values as it implies they live in closer neighborhoods.
Pearson’s Correlation皮尔逊相关性
- Correlation is a technique for investigating the relationship between two quantitative, continuous variables, for example, age and blood pressure. Pearson’s correlation coefficient is a measure related to the strength and direction of a linear relationship. We calculate this metric for the vectors x and y in the following way:

- where

- The Pearson’s correlation can take a range of values from -1 to +1. Only having an increase or decrease that are directly related will not lead to a Pearson’s correlation of 1 or -1.
- Pearson 的相关性可以取从 -1 到 +1 的值范围。只有直接相关的增加或减少不会导致 Pearson 相关性为 1 或 -1

代码示例:
import numpy as np
from scipy.stats import pearsonr
import matplotlib.pyplot as plt# seed random number generator
np.random.seed(42)
# prepare data
x = np.random.randn(15)
y = x + np.random.randn(15)# plot x and y
plt.scatter(x, y)
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))
plt.xlabel('x')
plt.ylabel('y')
plt.show()

# calculate Pearson's correlation
corr, _ = pearsonr(x, y)
print('Pearsons correlation: %.3f' % corr)
Pearsons correlation: 0.810
Spearman’s Correlation斯皮尔曼等级相关系数
Spearman’s correlation is what is known as non-parametric statistic, which is a statistic who’s distribution does not depend on parameters (statistics that follow normal distributions or binomial distributions are examples of parametric statistics). Very often, non-parametric statistics rank the data instead of taking the originial values. This is true for Spearman’s correlation coefficient, which is calculated similarly to Pearson’s correlation. The difference between these metrics is that Spearman’s correlation uses the rank of each value.
To calculate Spearman’s correlation we first need to map each of our data to ranked data values:

If the raw data are [0, -5, 4, 7], the ranked values will be [2, 1, 3, 4]. We can calculate Spearman’s correlation in the following way:

where

Spearman’s correlation benchmarks monotonic relationships, therefore it can have perfect relationships that are not linear. It can take a range of values from -1 to +1. The following plot clarifies the difference between Pearson’s and Spearman’s correlation.

Source: Wikipedia
For data exploration, I recommend calculating both Pearson’s and Spearman’s correlation. The comparison of both can result in interesting findings. If S>P (as shown above), it means that we have a monotonic relationship, not a linear relationship. Since linearity simplifies the process of fitting a regression algorithm to the dataset, we might want to modify the non-linear, monotonic data using log-transformation to appear linear.
代码示例:
from scipy.stats import spearmanr
# calculate Spearman's correlation
corr, _ = spearmanr(x, y)
print(‘Spearmans correlation: %.3f’ % corr)
Spearmans correlation: 0.836
Cosine Similarity余弦相似度
余弦相似度计算两个向量之间夹角的余弦值。为了计算余弦相似度,我们使用以下公式:
回想一下余弦函数:左边的红色向量指向不同的角度,右边的图表显示了结果函数。
因此,余弦相似度可以取-1和+1之间的值。如果向量指向完全相同的方向,则余弦相似度为 +1。如果向量指向相反的方向,则余弦相似度为 -1。
余弦相似度在文本分析中非常流行。它用于确定文档彼此之间的相似程度,而不管它们的大小。TF-IDF 文本分析技术有助于将文档转换为向量,其中向量中的每个值对应于文档中单词的 TF-IDF 分数。每个单词都有自己的轴,余弦相似度决定了文档的相似程度。
The cosine similarity calculates the cosine of the angle between two vectors. In order to calculate the cosine similarity we use the following formula:

Recall the cosine function: on the left the red vectors point at different angles and the graph on the right shows the resulting function.

Accordingly, the cosine similarity can take on values between -1 and +1. If the vectors point in the exact same direction, the cosine similarity is +1. If the vectors point in opposite directions, the cosine similarity is -1.

The cosine similarity is very popular in text analysis. It is used to determine how similar documents are to one another irrespective of their size. The TF-IDF text analysis technique helps converting the documents into vectors where each value in the vector corresponds to the TF-IDF score of a word in the document. Each word has its own axis, the cosine similarity then determines how similar the documents are.
Implementation in Python
We need to reshape the vectors x and y using .reshape(1, -1) to compute the cosine similarity for a single sample.
from sklearn.metrics.pairwise import cosine_similarity
cos_sim = cosine_similarity(x.reshape(1,-1),y.reshape(1,-1))
print('Cosine similarity: %.3f' % cos_sim)
Cosine similarity: 0.773
Jaccard Similarity杰卡德相似度
余弦相似度用于比较两个实值向量,而 Jaccard 相似度用于比较两个二进制向量(集合)。
在集合论中,查看公式的可视化通常很有帮助:
我们可以看到,Jaccard 相似度将交集的大小除以样本集的并集大小。
余弦相似度和 Jaccard 相似度都是计算文本相似度的常用指标。计算 Jaccard 相似度的计算成本更高,因为它将一个文档的所有术语与另一个文档匹配。通过检测重复,Jaccard 相似度被证明是有用的。
Cosine similarity is for comparing two real-valued vectors, but Jaccard similarity is for comparing two binary vectors (sets).

In set theory it is often helpful to see a visualization of the formula:

We can see that the Jaccard similarity divides the size of the intersection by the size of the union of the sample sets.
Both Cosine similarity and Jaccard similarity are common metrics for calculating text similarity. Calculating the Jaccard similarity is computationally more expensive as it matches all the terms of one document to another document. The Jaccard similarity turns out to be useful by detecting duplicates.
Implementation in Python
from sklearn.metrics import jaccard_score
A = [1, 1, 1, 0]
B = [1, 1, 0, 1]
jacc = jaccard_score(A,B)
print(‘Jaccard similarity: %.3f’ % jacc)
Jaccard similarity: 0.500
Distance Based Metrics基于距离的指标
Distance based methods prioritize objects with the lowest values to detect similarity amongst them.
Euclidean Distance欧几里得距离
欧几里得距离是两个向量之间的直线距离。
对于两个向量 x 和 y,可以如下计算:
与余弦和 Jaccard 相似度相比,欧几里得距离在 NLP 应用的上下文中并不经常使用。它适用于连续数值变量。欧几里得距离不是尺度不变的,因此建议在计算距离之前对数据进行缩放。此外,欧几里得距离乘以数据集中冗余信息的影响。如果我有五个高度相关的变量,并且我们将所有五个变量作为输入,那么我们会将这种冗余效应加权 5。
The Euclidean distance is a straight-line distance between two vectors.
For the two vectors x and y, this can be computed as follows:

Compared to the Cosine and Jaccard similarity, Euclidean distance is not used very often in the context of NLP applications. It is appropriate for continuous numerical variables. Euclidean distance is not scale invariant, therefore scaling the data prior to computing the distance is recommended. Additionally, Euclidean distance multiplies the effect of redundant information in the dataset. If I had five variables which are heavily correlated and we take all five variables as input, then we would weight this redundancy effect by five.
Implementation in Python
from scipy.spatial import distance
dst = distance.euclidean(x,y)
print(‘Euclidean distance: %.3f’ % dst)
Euclidean distance: 3.273
Manhattan Distance与欧几里得距离不同的是曼哈顿距离,也称为“城市街区”,即一个向量到另一个向量的距离。您可以将此指标想象为当您无法穿过建筑物时计算两点之间距离的一种方法。
绿线为您提供欧几里得距离,而紫色线为您提供曼哈顿距离。
Different from Euclidean distance is the Manhattan distance, also called ‘cityblock’, distance from one vector to another. You can imagine this metric as a way to compute the distance between two points when you are not able to go through buildings.
We calculate the Manhattan distance as follows:
我们计算曼哈顿距离如下:

The green line gives you the Euclidean distance, while the purple line gives you the Manhattan distance.

Source: Quora
In many ML applications Euclidean distance is the metric of choice. However, for high dimensional data Manhattan distance is preferable as it yields more robust results.
Implementation in Python
from scipy.spatial import distance
dst = distance.cityblock(x,y)
print(‘Manhattan distance: %.3f’ % dst)
Manhattan distance: 10.468
Calculate Similarity调研的更多相关文章
- HOJ题目分类
各种杂题,水题,模拟,包括简单数论. 1001 A+B 1002 A+B+C 1009 Fat Cat 1010 The Angle 1011 Unix ls 1012 Decoding Task 1 ...
- 第三十三节,目标检测之选择性搜索-Selective Search
在基于深度学习的目标检测算法的综述 那一节中我们提到基于区域提名的目标检测中广泛使用的选择性搜索算法.并且该算法后来被应用到了R-CNN,SPP-Net,Fast R-CNN中.因此我认为还是有研究的 ...
- Event Recommendation Engine Challenge分步解析第五步
一.请知晓 本文是基于: Event Recommendation Engine Challenge分步解析第一步 Event Recommendation Engine Challenge分步解析第 ...
- Event Recommendation Engine Challenge分步解析第四步
一.请知晓 本文是基于: Event Recommendation Engine Challenge分步解析第一步 Event Recommendation Engine Challenge分步解析第 ...
- A N EAR -D UPLICATE D ETECTION A LGORITHM T O F ACILITATE D OCUMENT C LUSTERING——有时间看看里面的相关研究
摘自:http://aircconline.com/ijdkp/V4N6/4614ijdkp04.pdf In the syntactical approach we define binary at ...
- 【学习笔记】第六章 python核心技术与实践--深入浅出字符串
[第五章]思考题答案,仅供参考: 思考题1:第一种方法更快,原因就是{}不需要去调用相关的函数: 思考题2:用列表作为key在这里是不被允许的,因为列表是一个动态变化的数据结构,字典当中的key要求是 ...
- [SimHash] find the percentage of similarity between two given data
SimHash algorithm, introduced by Charikarand is patented by Google. Simhash 5 steps: Tokenize, Hash, ...
- Postgresql-xl 调研
Postgresql-xl 调研 来历 这个项目的背后是一家叫做stormDB的公司.整个代买基于postgres-xc.开源版本应该是stormdb的一个分支. In 2010, NTT's Ope ...
- 1063. Set Similarity (25)
1063. Set Similarity (25) 时间限制 300 ms 内存限制 32000 kB 代码长度限制 16000 B 判题程序 Standard 作者 CHEN, Yue Given ...
- [LintCode] Cosine Similarity 余弦公式
Cosine similarity is a measure of similarity between two vectors of an inner product space that meas ...
随机推荐
- 最简单的以CentOS为base images 安装 Nodejs等操作的方法
镜像内安装NodeJS的简单方法 公司内有产品需要安装nodejs以便进行相关操作,Linux和Windows时没有问题,但是如果是镜像的话可能会稍微复杂一点, 这里简单进行一下总结, 以便备忘. 1 ...
- k8s笔记——NodePort暴露nginx-controller实现https自动跳转自定义nodePort端口
安装nginx-controller并暴露nodePort helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx ...
- session未过期就丢失的原因以及处理方式
转 https://blog.csdn.net/flamelp/article/details/5316725?utm_medium=distribute.pc_relevant.none-task- ...
- TienChin 活动管理-设置活动的默认状态
// 设置活动未过期,相当于新增的活动,默认都是未过期的 activity.setActivityStatus(1);
- Linux文件IO之一 [补档-2023-07-21]
Linux文件IO 8-1C标准库IO函数的工作流程 使用fopen函数打开一个文件,之后会返回一个FILE* fp指针,fp指针指向一个结构体,这个结构体是c 标准io库中的一个结构体,这个结构 ...
- 手把手教你-把Kali Linux系统安装到U盘 【超详细】(随身携带/即插即用)
[0-背景说明]: 1)为什么想要把Kali Linux系统安装到U盘? 之前学习渗透测试的时候,有安装过虚拟机,在虚拟机上安装Kali Linux系统,但是因为是在现有系统上安装虚拟机又安装kali ...
- SpringBoot基于Spring Security的HTTP跳转HTTPS
简单说说 之所以采用Spring Security来做这件事,一是Spring Security可以根据不同的URL来进行判断是否需要跳转(不推荐), 二是不需要新建一个TomcatServletWe ...
- .NET周刊【1月第3期 2024-01-24】
国内文章 .NET开源的简单.快速.强大的前后端分离后台权限管理系统 https://www.cnblogs.com/Can-daydayup/p/17980851 本文介绍了中台Admin,一款基于 ...
- Java商城单体和微服务架构有什么区别
微服务架构 概述 BizSpring移动全端国际化电商平台,是建立在Spring Cloud 基础上的微服务应用,服务化是系统达到一定规模以后的必然选择,主流的互联网公司基本都在迁移到服务化架构. 我 ...
- TStringList,的IndexOf,find,IndexOfName 例子
a=wokao b=in c=wori d=ri e=我靠 f=我插 procedure TForm1.Button1Click(Sender: TObject); var MyList: TStri ...