Calculate Similarity — the most relevant Metrics in a Nutshell

——调研学习相似度定义与计算

Zhang Zhibin 张芷彬

Many data science techniques are based on measuring similarity and dissimilarity between objects. For example, K-Nearest-Neighbors uses similarity to classify new data objects. In Unsupervised Learning, K-Means is a clustering method which uses Euclidean distance to compute the distance between the cluster centroids and it’s assigned data points. Recommendation engines use neighborhood based collaborative filtering methods which identify an individual’s neighbor based on the similarity/dissimilarity to the other users.

I will take a look at the most relevant similarity metrics in practice. Measuring similarity between objects can be performed in a number of ways.

Generally we can divide similarity metrics into two different groups:

  1. Similarity Based Metrics:
  • Pearson’s correlation
  • Spearman’s correlation
  • Kendall’s Tau
  • Cosine similarity
  • Jaccard similarity

2. Distance Based Metrics:

  • Euclidean distance
  • Manhattan distance

许多数据科学技术都是基于测量对象之间的相似性和相异性。例如,K-Nearest-Neighbors 使用相似性对新数据对象进行分类。在无监督学习中,K-Means 是一种聚类方法,它使用欧几里德距离来计算聚类质心与其分配的数据点之间的距离。推荐引擎使用基于邻域的协同过滤方法,该方法根据与其他用户的相似性/不相似性来识别个人的邻居。

通常我们可以将相似度度量分为两个不同的组:

  1. 基于相似性的指标:
  • 皮尔逊相关性
  • Spearman 相关性
  • 肯德尔的 Tau
  • 余弦相似度
  • 杰卡德相似度

2.基于距离的指标:

  • 欧几里得距离
  • 曼哈顿距离

Similarity Based Metrics基于相似性的指标

  • Similarity based methods determine the most similar objects with the highest values as it implies they live in closer neighborhoods.

Pearson’s Correlation皮尔逊相关性

  • Correlation is a technique for investigating the relationship between two quantitative, continuous variables, for example, age and blood pressure. Pearson’s correlation coefficient is a measure related to the strength and direction of a linear relationship. We calculate this metric for the vectors x and y in the following way:
  • where
  • The Pearson’s correlation can take a range of values from -1 to +1. Only having an increase or decrease that are directly related will not lead to a Pearson’s correlation of 1 or -1.
  • Pearson 的相关性可以取从 -1 到 +1 的值范围。只有直接相关的增加或减少不会导致 Pearson 相关性为 1 或 -1

代码示例:

import numpy as np
from scipy.stats import pearsonr
import matplotlib.pyplot as plt# seed random number generator
np.random.seed(42)
# prepare data
x = np.random.randn(15)
y = x + np.random.randn(15)# plot x and y
plt.scatter(x, y)
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))
plt.xlabel('x')
plt.ylabel('y')
plt.show()

# calculate Pearson's correlation
corr, _ = pearsonr(x, y)
print('Pearsons correlation: %.3f' % corr)

Pearsons correlation: 0.810

Spearman’s Correlation斯皮尔曼等级相关系数

Spearman’s correlation is what is known as non-parametric statistic, which is a statistic who’s distribution does not depend on parameters (statistics that follow normal distributions or binomial distributions are examples of parametric statistics). Very often, non-parametric statistics rank the data instead of taking the originial values. This is true for Spearman’s correlation coefficient, which is calculated similarly to Pearson’s correlation. The difference between these metrics is that Spearman’s correlation uses the rank of each value.

To calculate Spearman’s correlation we first need to map each of our data to ranked data values:

If the raw data are [0, -5, 4, 7], the ranked values will be [2, 1, 3, 4]. We can calculate Spearman’s correlation in the following way:

where

Spearman’s correlation benchmarks monotonic relationships, therefore it can have perfect relationships that are not linear. It can take a range of values from -1 to +1. The following plot clarifies the difference between Pearson’s and Spearman’s correlation.

Source: Wikipedia

For data exploration, I recommend calculating both Pearson’s and Spearman’s correlation. The comparison of both can result in interesting findings. If S>P (as shown above), it means that we have a monotonic relationship, not a linear relationship. Since linearity simplifies the process of fitting a regression algorithm to the dataset, we might want to modify the non-linear, monotonic data using log-transformation to appear linear.

代码示例:

from scipy.stats import spearmanr
# calculate Spearman's correlation
corr, _ = spearmanr(x, y)
print(‘Spearmans correlation: %.3f’ % corr)

Spearmans correlation: 0.836

Cosine Similarity余弦相似度

余弦相似度计算两个向量之间夹角的余弦值。为了计算余弦相似度,我们使用以下公式:

回想一下余弦函数:左边的红色向量指向不同的角度,右边的图表显示了结果函数。

因此,余弦相似度可以取-1和+1之间的值。如果向量指向完全相同的方向,则余弦相似度为 +1。如果向量指向相反的方向,则余弦相似度为 -1。

余弦相似度在文本分析中非常流行。它用于确定文档彼此之间的相似程度,而不管它们的大小。TF-IDF 文本分析技术有助于将文档转换为向量,其中向量中的每个值对应于文档中单词的 TF-IDF 分数。每个单词都有自己的轴,余弦相似度决定了文档的相似程度。

The cosine similarity calculates the cosine of the angle between two vectors. In order to calculate the cosine similarity we use the following formula:

Recall the cosine function: on the left the red vectors point at different angles and the graph on the right shows the resulting function.

Accordingly, the cosine similarity can take on values between -1 and +1. If the vectors point in the exact same direction, the cosine similarity is +1. If the vectors point in opposite directions, the cosine similarity is -1.

The cosine similarity is very popular in text analysis. It is used to determine how similar documents are to one another irrespective of their size. The TF-IDF text analysis technique helps converting the documents into vectors where each value in the vector corresponds to the TF-IDF score of a word in the document. Each word has its own axis, the cosine similarity then determines how similar the documents are.

Implementation in Python

We need to reshape the vectors x and y using .reshape(1, -1) to compute the cosine similarity for a single sample.

from sklearn.metrics.pairwise import cosine_similarity
cos_sim = cosine_similarity(x.reshape(1,-1),y.reshape(1,-1))
print('Cosine similarity: %.3f' % cos_sim)

Cosine similarity: 0.773

Jaccard Similarity杰卡德相似度

余弦相似度用于比较两个实值向量,而 Jaccard 相似度用于比较两个二进制向量(集合)。

在集合论中,查看公式的可视化通常很有帮助:

我们可以看到,Jaccard 相似度将交集的大小除以样本集的并集大小。

余弦相似度和 Jaccard 相似度都是计算文本相似度的常用指标。计算 Jaccard 相似度的计算成本更高,因为它将一个文档的所有术语与另一个文档匹配。通过检测重复,Jaccard 相似度被证明是有用的。

Cosine similarity is for comparing two real-valued vectors, but Jaccard similarity is for comparing two binary vectors (sets).

In set theory it is often helpful to see a visualization of the formula:

We can see that the Jaccard similarity divides the size of the intersection by the size of the union of the sample sets.

Both Cosine similarity and Jaccard similarity are common metrics for calculating text similarity. Calculating the Jaccard similarity is computationally more expensive as it matches all the terms of one document to another document. The Jaccard similarity turns out to be useful by detecting duplicates.

Implementation in Python

from sklearn.metrics import jaccard_score
A = [1, 1, 1, 0]
B = [1, 1, 0, 1]
jacc = jaccard_score(A,B)
print(‘Jaccard similarity: %.3f’ % jacc)

Jaccard similarity: 0.500

Distance Based Metrics基于距离的指标

Distance based methods prioritize objects with the lowest values to detect similarity amongst them.

Euclidean Distance欧几里得距离

欧几里得距离是两个向量之间的直线距离。

对于两个向量 x 和 y,可以如下计算:

与余弦和 Jaccard 相似度相比,欧几里得距离在 NLP 应用的上下文中并不经常使用。它适用于连续数值变量。欧几里得距离不是尺度不变的,因此建议在计算距离之前对数据进行缩放。此外,欧几里得距离乘以数据集中冗余信息的影响。如果我有五个高度相关的变量,并且我们将所有五个变量作为输入,那么我们会将这种冗余效应加权 5。

The Euclidean distance is a straight-line distance between two vectors.

For the two vectors x and y, this can be computed as follows:

Compared to the Cosine and Jaccard similarity, Euclidean distance is not used very often in the context of NLP applications. It is appropriate for continuous numerical variables. Euclidean distance is not scale invariant, therefore scaling the data prior to computing the distance is recommended. Additionally, Euclidean distance multiplies the effect of redundant information in the dataset. If I had five variables which are heavily correlated and we take all five variables as input, then we would weight this redundancy effect by five.

Implementation in Python

from scipy.spatial import distance
dst = distance.euclidean(x,y)
print(‘Euclidean distance: %.3f’ % dst)

Euclidean distance: 3.273

Manhattan Distance与欧几里得距离不同的是曼哈顿距离,也称为“城市街区”,即一个向量到另一个向量的距离。您可以将此指标想象为当您无法穿过建筑物时计算两点之间距离的一种方法。

绿线为您提供欧几里得距离,而紫色线为您提供曼哈顿距离。

Different from Euclidean distance is the Manhattan distance, also called ‘cityblock’, distance from one vector to another. You can imagine this metric as a way to compute the distance between two points when you are not able to go through buildings.

We calculate the Manhattan distance as follows:

我们计算曼哈顿距离如下:

The green line gives you the Euclidean distance, while the purple line gives you the Manhattan distance.

Source: Quora

In many ML applications Euclidean distance is the metric of choice. However, for high dimensional data Manhattan distance is preferable as it yields more robust results.

Implementation in Python

from scipy.spatial import distance
dst = distance.cityblock(x,y)
print(‘Manhattan distance: %.3f’ % dst)

Manhattan distance: 10.468

Calculate Similarity调研的更多相关文章

  1. HOJ题目分类

    各种杂题,水题,模拟,包括简单数论. 1001 A+B 1002 A+B+C 1009 Fat Cat 1010 The Angle 1011 Unix ls 1012 Decoding Task 1 ...

  2. 第三十三节,目标检测之选择性搜索-Selective Search

    在基于深度学习的目标检测算法的综述 那一节中我们提到基于区域提名的目标检测中广泛使用的选择性搜索算法.并且该算法后来被应用到了R-CNN,SPP-Net,Fast R-CNN中.因此我认为还是有研究的 ...

  3. Event Recommendation Engine Challenge分步解析第五步

    一.请知晓 本文是基于: Event Recommendation Engine Challenge分步解析第一步 Event Recommendation Engine Challenge分步解析第 ...

  4. Event Recommendation Engine Challenge分步解析第四步

    一.请知晓 本文是基于: Event Recommendation Engine Challenge分步解析第一步 Event Recommendation Engine Challenge分步解析第 ...

  5. A N EAR -D UPLICATE D ETECTION A LGORITHM T O F ACILITATE D OCUMENT C LUSTERING——有时间看看里面的相关研究

    摘自:http://aircconline.com/ijdkp/V4N6/4614ijdkp04.pdf In the syntactical approach we define binary at ...

  6. 【学习笔记】第六章 python核心技术与实践--深入浅出字符串

    [第五章]思考题答案,仅供参考: 思考题1:第一种方法更快,原因就是{}不需要去调用相关的函数: 思考题2:用列表作为key在这里是不被允许的,因为列表是一个动态变化的数据结构,字典当中的key要求是 ...

  7. [SimHash] find the percentage of similarity between two given data

    SimHash algorithm, introduced by Charikarand is patented by Google. Simhash 5 steps: Tokenize, Hash, ...

  8. Postgresql-xl 调研

    Postgresql-xl 调研 来历 这个项目的背后是一家叫做stormDB的公司.整个代买基于postgres-xc.开源版本应该是stormdb的一个分支. In 2010, NTT's Ope ...

  9. 1063. Set Similarity (25)

    1063. Set Similarity (25) 时间限制 300 ms 内存限制 32000 kB 代码长度限制 16000 B 判题程序 Standard 作者 CHEN, Yue Given ...

  10. [LintCode] Cosine Similarity 余弦公式

    Cosine similarity is a measure of similarity between two vectors of an inner product space that meas ...

随机推荐

  1. 2024年最新的Python操控微信教程

    自从微信禁止网页版登陆之后,itchat 库实现的功能也就都不能用了,那现在 Python 还能操作微信吗?答案是:可以! 在Github上有一个项目叫<WeChatPYAPI>可以使用 ...

  2. linux虚拟机固定ip

    1.查看宿主机IP信息 在windows宿主机上,键盘输入win+r,输出cmd,打开终端命令行: 输入ipconfig /all,查看宿主机IP信息: 2.修改Linux虚拟机的配置文件 Linux ...

  3. ILRuntime性能测试

    我们公司有一个Unity原生开发语言C#写的项目,目前已经在安卓测试过多次,上架IOS在考虑热更,所以对ILRuntim进行性能测试,在测试过程中已经按照官方文档进行了CLR绑定和生成Release的 ...

  4. vim 从嫌弃到依赖(23)——最后的闲扯

    截止到上一篇文章,关于vim的基础操作都已经讨论完了,这篇我主要就是闲扯,瞎聊.就想毕业论文都有一个致谢一样,这篇我们就作为整个系列的致谢吧 学习vim到底能给我们带来什么 学习vim到底能给我们带来 ...

  5. c++基础之变量和基本类型

    之前我写过一系列的c/c++ 从汇编上解释它如何实现的博文.从汇编层面上看,确实c/c++的执行过程很清晰,甚至有的地方可以做相关优化.而c++有的地方就只是一个语法糖,或者说并没有转化到汇编中,而是 ...

  6. 强化学习从基础到进阶-案例与实践[4.1]:深度Q网络-DQN项目实战CartPole-v0

    强化学习从基础到进阶-案例与实践[4.1]:深度Q网络-DQN项目实战CartPole-v0 1.定义算法 相比于Q learning,DQN本质上是为了适应更为复杂的环境,并且经过不断的改良迭代,到 ...

  7. 4.4 Windows驱动开发:内核监控进程与线程创建

    当你需要在Windows操作系统中监控进程的启动和退出时,可以使用PsSetCreateProcessNotifyRoutineEx函数来创建一个MyCreateProcessNotifyEx回调函数 ...

  8. 【部署教程】基于GPT2训练了一个傻狗机器人 - By ChatGPT 技术学习

    作者:小傅哥 博客:https://bugstack.cn 沉淀.分享.成长,让自己和他人都能有所收获! 首先我想告诉你,从事编程开发这一行,要学会的是学习的方式方法.方向对了,才能事半功倍.而我认为 ...

  9. 遥感图像处理笔记之【U-Net for Semantic Segmentation on Unbalanced Aerial Imagery】

    遥感图像处理学习(5) 前言 遥感系列第5篇.遥感图像处理方向的学习者可以参考或者复刻 本文初编辑于2023年12月15日 2024年1月24日搬运至本人博客园平台 文章标题:U-Net for Se ...

  10. 【动态内存】C语言动态内存管理及使用总结篇【初学者保姆级福利】

    动态内存管理及应用总结篇 一篇博客学好动态内存的管理和使用 这篇博客干货满满,建议收藏再看哦!! 求个赞求个赞求个赞求个赞 谢谢 先赞后看好习惯 打字不容易,这都是很用心做的,希望得到支持你 大家的点 ...