Calculate Similarity — the most relevant Metrics in a Nutshell

——调研学习相似度定义与计算

Zhang Zhibin 张芷彬

Many data science techniques are based on measuring similarity and dissimilarity between objects. For example, K-Nearest-Neighbors uses similarity to classify new data objects. In Unsupervised Learning, K-Means is a clustering method which uses Euclidean distance to compute the distance between the cluster centroids and it’s assigned data points. Recommendation engines use neighborhood based collaborative filtering methods which identify an individual’s neighbor based on the similarity/dissimilarity to the other users.

I will take a look at the most relevant similarity metrics in practice. Measuring similarity between objects can be performed in a number of ways.

Generally we can divide similarity metrics into two different groups:

  1. Similarity Based Metrics:
  • Pearson’s correlation
  • Spearman’s correlation
  • Kendall’s Tau
  • Cosine similarity
  • Jaccard similarity

2. Distance Based Metrics:

  • Euclidean distance
  • Manhattan distance

许多数据科学技术都是基于测量对象之间的相似性和相异性。例如,K-Nearest-Neighbors 使用相似性对新数据对象进行分类。在无监督学习中,K-Means 是一种聚类方法,它使用欧几里德距离来计算聚类质心与其分配的数据点之间的距离。推荐引擎使用基于邻域的协同过滤方法,该方法根据与其他用户的相似性/不相似性来识别个人的邻居。

通常我们可以将相似度度量分为两个不同的组:

  1. 基于相似性的指标:
  • 皮尔逊相关性
  • Spearman 相关性
  • 肯德尔的 Tau
  • 余弦相似度
  • 杰卡德相似度

2.基于距离的指标:

  • 欧几里得距离
  • 曼哈顿距离

Similarity Based Metrics基于相似性的指标

  • Similarity based methods determine the most similar objects with the highest values as it implies they live in closer neighborhoods.

Pearson’s Correlation皮尔逊相关性

  • Correlation is a technique for investigating the relationship between two quantitative, continuous variables, for example, age and blood pressure. Pearson’s correlation coefficient is a measure related to the strength and direction of a linear relationship. We calculate this metric for the vectors x and y in the following way:
  • where
  • The Pearson’s correlation can take a range of values from -1 to +1. Only having an increase or decrease that are directly related will not lead to a Pearson’s correlation of 1 or -1.
  • Pearson 的相关性可以取从 -1 到 +1 的值范围。只有直接相关的增加或减少不会导致 Pearson 相关性为 1 或 -1

代码示例:

import numpy as np
from scipy.stats import pearsonr
import matplotlib.pyplot as plt# seed random number generator
np.random.seed(42)
# prepare data
x = np.random.randn(15)
y = x + np.random.randn(15)# plot x and y
plt.scatter(x, y)
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))
plt.xlabel('x')
plt.ylabel('y')
plt.show()

# calculate Pearson's correlation
corr, _ = pearsonr(x, y)
print('Pearsons correlation: %.3f' % corr)

Pearsons correlation: 0.810

Spearman’s Correlation斯皮尔曼等级相关系数

Spearman’s correlation is what is known as non-parametric statistic, which is a statistic who’s distribution does not depend on parameters (statistics that follow normal distributions or binomial distributions are examples of parametric statistics). Very often, non-parametric statistics rank the data instead of taking the originial values. This is true for Spearman’s correlation coefficient, which is calculated similarly to Pearson’s correlation. The difference between these metrics is that Spearman’s correlation uses the rank of each value.

To calculate Spearman’s correlation we first need to map each of our data to ranked data values:

If the raw data are [0, -5, 4, 7], the ranked values will be [2, 1, 3, 4]. We can calculate Spearman’s correlation in the following way:

where

Spearman’s correlation benchmarks monotonic relationships, therefore it can have perfect relationships that are not linear. It can take a range of values from -1 to +1. The following plot clarifies the difference between Pearson’s and Spearman’s correlation.

Source: Wikipedia

For data exploration, I recommend calculating both Pearson’s and Spearman’s correlation. The comparison of both can result in interesting findings. If S>P (as shown above), it means that we have a monotonic relationship, not a linear relationship. Since linearity simplifies the process of fitting a regression algorithm to the dataset, we might want to modify the non-linear, monotonic data using log-transformation to appear linear.

代码示例:

from scipy.stats import spearmanr
# calculate Spearman's correlation
corr, _ = spearmanr(x, y)
print(‘Spearmans correlation: %.3f’ % corr)

Spearmans correlation: 0.836

Cosine Similarity余弦相似度

余弦相似度计算两个向量之间夹角的余弦值。为了计算余弦相似度,我们使用以下公式:

回想一下余弦函数:左边的红色向量指向不同的角度,右边的图表显示了结果函数。

因此,余弦相似度可以取-1和+1之间的值。如果向量指向完全相同的方向,则余弦相似度为 +1。如果向量指向相反的方向,则余弦相似度为 -1。

余弦相似度在文本分析中非常流行。它用于确定文档彼此之间的相似程度,而不管它们的大小。TF-IDF 文本分析技术有助于将文档转换为向量,其中向量中的每个值对应于文档中单词的 TF-IDF 分数。每个单词都有自己的轴,余弦相似度决定了文档的相似程度。

The cosine similarity calculates the cosine of the angle between two vectors. In order to calculate the cosine similarity we use the following formula:

Recall the cosine function: on the left the red vectors point at different angles and the graph on the right shows the resulting function.

Accordingly, the cosine similarity can take on values between -1 and +1. If the vectors point in the exact same direction, the cosine similarity is +1. If the vectors point in opposite directions, the cosine similarity is -1.

The cosine similarity is very popular in text analysis. It is used to determine how similar documents are to one another irrespective of their size. The TF-IDF text analysis technique helps converting the documents into vectors where each value in the vector corresponds to the TF-IDF score of a word in the document. Each word has its own axis, the cosine similarity then determines how similar the documents are.

Implementation in Python

We need to reshape the vectors x and y using .reshape(1, -1) to compute the cosine similarity for a single sample.

from sklearn.metrics.pairwise import cosine_similarity
cos_sim = cosine_similarity(x.reshape(1,-1),y.reshape(1,-1))
print('Cosine similarity: %.3f' % cos_sim)

Cosine similarity: 0.773

Jaccard Similarity杰卡德相似度

余弦相似度用于比较两个实值向量,而 Jaccard 相似度用于比较两个二进制向量(集合)。

在集合论中,查看公式的可视化通常很有帮助:

我们可以看到,Jaccard 相似度将交集的大小除以样本集的并集大小。

余弦相似度和 Jaccard 相似度都是计算文本相似度的常用指标。计算 Jaccard 相似度的计算成本更高,因为它将一个文档的所有术语与另一个文档匹配。通过检测重复,Jaccard 相似度被证明是有用的。

Cosine similarity is for comparing two real-valued vectors, but Jaccard similarity is for comparing two binary vectors (sets).

In set theory it is often helpful to see a visualization of the formula:

We can see that the Jaccard similarity divides the size of the intersection by the size of the union of the sample sets.

Both Cosine similarity and Jaccard similarity are common metrics for calculating text similarity. Calculating the Jaccard similarity is computationally more expensive as it matches all the terms of one document to another document. The Jaccard similarity turns out to be useful by detecting duplicates.

Implementation in Python

from sklearn.metrics import jaccard_score
A = [1, 1, 1, 0]
B = [1, 1, 0, 1]
jacc = jaccard_score(A,B)
print(‘Jaccard similarity: %.3f’ % jacc)

Jaccard similarity: 0.500

Distance Based Metrics基于距离的指标

Distance based methods prioritize objects with the lowest values to detect similarity amongst them.

Euclidean Distance欧几里得距离

欧几里得距离是两个向量之间的直线距离。

对于两个向量 x 和 y,可以如下计算:

与余弦和 Jaccard 相似度相比,欧几里得距离在 NLP 应用的上下文中并不经常使用。它适用于连续数值变量。欧几里得距离不是尺度不变的,因此建议在计算距离之前对数据进行缩放。此外,欧几里得距离乘以数据集中冗余信息的影响。如果我有五个高度相关的变量,并且我们将所有五个变量作为输入,那么我们会将这种冗余效应加权 5。

The Euclidean distance is a straight-line distance between two vectors.

For the two vectors x and y, this can be computed as follows:

Compared to the Cosine and Jaccard similarity, Euclidean distance is not used very often in the context of NLP applications. It is appropriate for continuous numerical variables. Euclidean distance is not scale invariant, therefore scaling the data prior to computing the distance is recommended. Additionally, Euclidean distance multiplies the effect of redundant information in the dataset. If I had five variables which are heavily correlated and we take all five variables as input, then we would weight this redundancy effect by five.

Implementation in Python

from scipy.spatial import distance
dst = distance.euclidean(x,y)
print(‘Euclidean distance: %.3f’ % dst)

Euclidean distance: 3.273

Manhattan Distance与欧几里得距离不同的是曼哈顿距离,也称为“城市街区”,即一个向量到另一个向量的距离。您可以将此指标想象为当您无法穿过建筑物时计算两点之间距离的一种方法。

绿线为您提供欧几里得距离,而紫色线为您提供曼哈顿距离。

Different from Euclidean distance is the Manhattan distance, also called ‘cityblock’, distance from one vector to another. You can imagine this metric as a way to compute the distance between two points when you are not able to go through buildings.

We calculate the Manhattan distance as follows:

我们计算曼哈顿距离如下:

The green line gives you the Euclidean distance, while the purple line gives you the Manhattan distance.

Source: Quora

In many ML applications Euclidean distance is the metric of choice. However, for high dimensional data Manhattan distance is preferable as it yields more robust results.

Implementation in Python

from scipy.spatial import distance
dst = distance.cityblock(x,y)
print(‘Manhattan distance: %.3f’ % dst)

Manhattan distance: 10.468

Calculate Similarity调研的更多相关文章

  1. HOJ题目分类

    各种杂题,水题,模拟,包括简单数论. 1001 A+B 1002 A+B+C 1009 Fat Cat 1010 The Angle 1011 Unix ls 1012 Decoding Task 1 ...

  2. 第三十三节,目标检测之选择性搜索-Selective Search

    在基于深度学习的目标检测算法的综述 那一节中我们提到基于区域提名的目标检测中广泛使用的选择性搜索算法.并且该算法后来被应用到了R-CNN,SPP-Net,Fast R-CNN中.因此我认为还是有研究的 ...

  3. Event Recommendation Engine Challenge分步解析第五步

    一.请知晓 本文是基于: Event Recommendation Engine Challenge分步解析第一步 Event Recommendation Engine Challenge分步解析第 ...

  4. Event Recommendation Engine Challenge分步解析第四步

    一.请知晓 本文是基于: Event Recommendation Engine Challenge分步解析第一步 Event Recommendation Engine Challenge分步解析第 ...

  5. A N EAR -D UPLICATE D ETECTION A LGORITHM T O F ACILITATE D OCUMENT C LUSTERING——有时间看看里面的相关研究

    摘自:http://aircconline.com/ijdkp/V4N6/4614ijdkp04.pdf In the syntactical approach we define binary at ...

  6. 【学习笔记】第六章 python核心技术与实践--深入浅出字符串

    [第五章]思考题答案,仅供参考: 思考题1:第一种方法更快,原因就是{}不需要去调用相关的函数: 思考题2:用列表作为key在这里是不被允许的,因为列表是一个动态变化的数据结构,字典当中的key要求是 ...

  7. [SimHash] find the percentage of similarity between two given data

    SimHash algorithm, introduced by Charikarand is patented by Google. Simhash 5 steps: Tokenize, Hash, ...

  8. Postgresql-xl 调研

    Postgresql-xl 调研 来历 这个项目的背后是一家叫做stormDB的公司.整个代买基于postgres-xc.开源版本应该是stormdb的一个分支. In 2010, NTT's Ope ...

  9. 1063. Set Similarity (25)

    1063. Set Similarity (25) 时间限制 300 ms 内存限制 32000 kB 代码长度限制 16000 B 判题程序 Standard 作者 CHEN, Yue Given ...

  10. [LintCode] Cosine Similarity 余弦公式

    Cosine similarity is a measure of similarity between two vectors of an inner product space that meas ...

随机推荐

  1. 解锁前端新潜能:如何使用 Rust 锈化前端工具链

    ​ 前言 近年来,Rust的受欢迎程度不断上升.首先,在操作系统领域,Rust 已成为 Linux 内核官方认可的开发语言之一,Windows 也宣布将使用 Rust 来重写内核,并重写部分驱动程序. ...

  2. C# 输入指定日期获取当前年的第一天 、当前年的最后天、某月的第一天 、某月的最后一天

    方法 /// <summary> /// 取得当前年的第一天 /// </summary> /// <param name="datetime"> ...

  3. vim 从嫌弃到依赖(9)——命令模式进阶

    上一篇文章更新还是在51前,最近发生了很多事情了,全国各地的疫情又有蔓延的趋势,北京朝阳区都已经开始实施居家办公.各位小伙伴请注意安全,安全平安的度过这个疫情. 废话不多说了,接着上次的内容往下写. ...

  4. 【踩坑记录】SpringBoot跨域配置不生效

    问题复现: 明明在拦截器里配置了跨域,就是不生效,使用PostMan等后端调试工具调试,均正常,Response中有Access-Control-Allow-Origin: *,这个Header,但是 ...

  5. 强化学习从基础到进阶-常见问题和面试必知必答[8]:近端策略优化(proximal policy optimization,PPO)算法

    强化学习从基础到进阶-常见问题和面试必知必答[8]:近端策略优化(proximal policy optimization,PPO)算法 1.核心词汇 同策略(on-policy):要学习的智能体和与 ...

  6. 6.1 C/C++ 封装字符串操作

    C/C++语言是一种通用的编程语言,具有高效.灵活和可移植等特点.C语言主要用于系统编程,如操作系统.编译器.数据库等:C语言是C语言的扩展,增加了面向对象编程的特性,适用于大型软件系统.图形用户界面 ...

  7. 《Java 面经手册》PDF,417页11.5万字,完稿!

    作者:小傅哥 博客:https://bugstack.cn 沉淀.分享.成长,让自己和他人都能有所收获! 一.前言 我膨胀了 ,在看了大部分以面试讲解的 Java 文章后,发现很多内容的讲解都偏向于翻 ...

  8. 进程状态|操作系统|什么是pcb|什么是僵尸进程 |什么是孤儿进程 【超详细的图文解释】【Linux OS】

    说在前面 今天给大家带来操作系统中进程状态的详解. 本篇博主将通过从进程状态的广泛概念,深入到Linux操作系统详细的一些进程状态.在解释进程状态的过程中,博主还会穿插一些操作系统一些重要概念!本篇干 ...

  9. 体验 ABP 的功能和服务

    大家好,我是张飞洪,感谢您的阅读,我会不定期和你分享学习心得,希望我的文章能成为你成长路上的垫脚石,让我们一起精进. ABP是一个全栈开发框架,它在企业解决方案的各个方面都有许多构建模块.在前面三章中 ...

  10. Pandas字符串离散化处理

    字符串离散化处理 import pandas as pd import numpy as np from matplotlib import pyplot as plt # 读取csv文件 file_ ...