Calculate Similarity调研

Calculate Similarity — the most relevant Metrics in a Nutshell

——调研学习相似度定义与计算

Zhang Zhibin 张芷彬

Many data science techniques are based on measuring similarity and dissimilarity between objects. For example, K-Nearest-Neighbors uses similarity to classify new data objects. In Unsupervised Learning, K-Means is a clustering method which uses Euclidean distance to compute the distance between the cluster centroids and it’s assigned data points. Recommendation engines use neighborhood based collaborative filtering methods which identify an individual’s neighbor based on the similarity/dissimilarity to the other users.

I will take a look at the most relevant similarity metrics in practice. Measuring similarity between objects can be performed in a number of ways.

Generally we can divide similarity metrics into two different groups:

Similarity Based Metrics:

Pearson’s correlation
Spearman’s correlation
Kendall’s Tau
Cosine similarity
Jaccard similarity

2. Distance Based Metrics:

Euclidean distance
Manhattan distance

许多数据科学技术都是基于测量对象之间的相似性和相异性。例如，K-Nearest-Neighbors 使用相似性对新数据对象进行分类。在无监督学习中，K-Means 是一种聚类方法，它使用欧几里德距离来计算聚类质心与其分配的数据点之间的距离。推荐引擎使用基于邻域的协同过滤方法，该方法根据与其他用户的相似性/不相似性来识别个人的邻居。

通常我们可以将相似度度量分为两个不同的组：

基于相似性的指标：

皮尔逊相关性
Spearman 相关性
肯德尔的 Tau
余弦相似度
杰卡德相似度

2.基于距离的指标：

欧几里得距离
曼哈顿距离

Similarity Based Metrics基于相似性的指标

Similarity based methods determine the most similar objects with the highest values as it implies they live in closer neighborhoods.

Pearson’s Correlation皮尔逊相关性

Correlation is a technique for investigating the relationship between two quantitative, continuous variables, for example, age and blood pressure. Pearson’s correlation coefficient is a measure related to the strength and direction of a linear relationship. We calculate this metric for the vectors x and y in the following way:
where
The Pearson’s correlation can take a range of values from -1 to +1. Only having an increase or decrease that are directly related will not lead to a Pearson’s correlation of 1 or -1.
Pearson 的相关性可以取从 -1 到 +1 的值范围。只有直接相关的增加或减少不会导致 Pearson 相关性为 1 或 -1

代码示例：

import numpy as np
from scipy.stats import pearsonr
import matplotlib.pyplot as plt# seed random number generator
np.random.seed(42)
# prepare data
x = np.random.randn(15)
y = x + np.random.randn(15)# plot x and y
plt.scatter(x, y)
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))
plt.xlabel('x')
plt.ylabel('y')
plt.show()

# calculate Pearson's correlation
corr, _ = pearsonr(x, y)
print('Pearsons correlation: %.3f' % corr)

Pearsons correlation: 0.810

Spearman’s Correlation斯皮尔曼等级相关系数

Spearman’s correlation is what is known as non-parametric statistic, which is a statistic who’s distribution does not depend on parameters (statistics that follow normal distributions or binomial distributions are examples of parametric statistics). Very often, non-parametric statistics rank the data instead of taking the originial values. This is true for Spearman’s correlation coefficient, which is calculated similarly to Pearson’s correlation. The difference between these metrics is that Spearman’s correlation uses the rank of each value.

To calculate Spearman’s correlation we first need to map each of our data to ranked data values:

If the raw data are [0, -5, 4, 7], the ranked values will be [2, 1, 3, 4]. We can calculate Spearman’s correlation in the following way:

where

Spearman’s correlation benchmarks monotonic relationships, therefore it can have perfect relationships that are not linear. It can take a range of values from -1 to +1. The following plot clarifies the difference between Pearson’s and Spearman’s correlation.

Source: Wikipedia

For data exploration, I recommend calculating both Pearson’s and Spearman’s correlation. The comparison of both can result in interesting findings. If S>P (as shown above), it means that we have a monotonic relationship, not a linear relationship. Since linearity simplifies the process of fitting a regression algorithm to the dataset, we might want to modify the non-linear, monotonic data using log-transformation to appear linear.

代码示例：

from scipy.stats import spearmanr
# calculate Spearman's correlation
corr, _ = spearmanr(x, y)
print(‘Spearmans correlation: %.3f’ % corr)

Spearmans correlation: 0.836

Cosine Similarity余弦相似度

余弦相似度计算两个向量之间夹角的余弦值。为了计算余弦相似度，我们使用以下公式：

回想一下余弦函数：左边的红色向量指向不同的角度，右边的图表显示了结果函数。

因此，余弦相似度可以取-1和+1之间的值。如果向量指向完全相同的方向，则余弦相似度为 +1。如果向量指向相反的方向，则余弦相似度为 -1。

余弦相似度在文本分析中非常流行。它用于确定文档彼此之间的相似程度，而不管它们的大小。TF-IDF 文本分析技术有助于将文档转换为向量，其中向量中的每个值对应于文档中单词的 TF-IDF 分数。每个单词都有自己的轴，余弦相似度决定了文档的相似程度。

The cosine similarity calculates the cosine of the angle between two vectors. In order to calculate the cosine similarity we use the following formula:

Recall the cosine function: on the left the red vectors point at different angles and the graph on the right shows the resulting function.

Accordingly, the cosine similarity can take on values between -1 and +1. If the vectors point in the exact same direction, the cosine similarity is +1. If the vectors point in opposite directions, the cosine similarity is -1.

The cosine similarity is very popular in text analysis. It is used to determine how similar documents are to one another irrespective of their size. The TF-IDF text analysis technique helps converting the documents into vectors where each value in the vector corresponds to the TF-IDF score of a word in the document. Each word has its own axis, the cosine similarity then determines how similar the documents are.

Implementation in Python

We need to reshape the vectors x and y using .reshape(1, -1) to compute the cosine similarity for a single sample.

from sklearn.metrics.pairwise import cosine_similarity
cos_sim = cosine_similarity(x.reshape(1,-1),y.reshape(1,-1))
print('Cosine similarity: %.3f' % cos_sim)

Cosine similarity: 0.773

Jaccard Similarity杰卡德相似度

余弦相似度用于比较两个实值向量，而 Jaccard 相似度用于比较两个二进制向量（集合）。

在集合论中，查看公式的可视化通常很有帮助：

我们可以看到，Jaccard 相似度将交集的大小除以样本集的并集大小。

余弦相似度和 Jaccard 相似度都是计算文本相似度的常用指标。计算 Jaccard 相似度的计算成本更高，因为它将一个文档的所有术语与另一个文档匹配。通过检测重复，Jaccard 相似度被证明是有用的。

Cosine similarity is for comparing two real-valued vectors, but Jaccard similarity is for comparing two binary vectors (sets).

In set theory it is often helpful to see a visualization of the formula:

We can see that the Jaccard similarity divides the size of the intersection by the size of the union of the sample sets.

Both Cosine similarity and Jaccard similarity are common metrics for calculating text similarity. Calculating the Jaccard similarity is computationally more expensive as it matches all the terms of one document to another document. The Jaccard similarity turns out to be useful by detecting duplicates.

Implementation in Python

from sklearn.metrics import jaccard_score
A = [1, 1, 1, 0]
B = [1, 1, 0, 1]
jacc = jaccard_score(A,B)
print(‘Jaccard similarity: %.3f’ % jacc)

Jaccard similarity: 0.500

Distance Based Metrics基于距离的指标

Distance based methods prioritize objects with the lowest values to detect similarity amongst them.

Euclidean Distance欧几里得距离

欧几里得距离是两个向量之间的直线距离。

对于两个向量 x 和 y，可以如下计算：

与余弦和 Jaccard 相似度相比，欧几里得距离在 NLP 应用的上下文中并不经常使用。它适用于连续数值变量。欧几里得距离不是尺度不变的，因此建议在计算距离之前对数据进行缩放。此外，欧几里得距离乘以数据集中冗余信息的影响。如果我有五个高度相关的变量，并且我们将所有五个变量作为输入，那么我们会将这种冗余效应加权 5。

The Euclidean distance is a straight-line distance between two vectors.

For the two vectors x and y, this can be computed as follows:

Compared to the Cosine and Jaccard similarity, Euclidean distance is not used very often in the context of NLP applications. It is appropriate for continuous numerical variables. Euclidean distance is not scale invariant, therefore scaling the data prior to computing the distance is recommended. Additionally, Euclidean distance multiplies the effect of redundant information in the dataset. If I had five variables which are heavily correlated and we take all five variables as input, then we would weight this redundancy effect by five.

Implementation in Python

from scipy.spatial import distance
dst = distance.euclidean(x,y)
print(‘Euclidean distance: %.3f’ % dst)

Euclidean distance: 3.273

Manhattan Distance与欧几里得距离不同的是曼哈顿距离，也称为“城市街区”，即一个向量到另一个向量的距离。您可以将此指标想象为当您无法穿过建筑物时计算两点之间距离的一种方法。

绿线为您提供欧几里得距离，而紫色线为您提供曼哈顿距离。

Different from Euclidean distance is the Manhattan distance, also called ‘cityblock’, distance from one vector to another. You can imagine this metric as a way to compute the distance between two points when you are not able to go through buildings.

We calculate the Manhattan distance as follows:

我们计算曼哈顿距离如下：

The green line gives you the Euclidean distance, while the purple line gives you the Manhattan distance.

Source: Quora

In many ML applications Euclidean distance is the metric of choice. However, for high dimensional data Manhattan distance is preferable as it yields more robust results.

Implementation in Python

from scipy.spatial import distance
dst = distance.cityblock(x,y)
print(‘Manhattan distance: %.3f’ % dst)

Manhattan distance: 10.468

Calculate Similarity调研的更多相关文章

HOJ题目分类
各种杂题,水题,模拟,包括简单数论. 1001 A+B 1002 A+B+C 1009 Fat Cat 1010 The Angle 1011 Unix ls 1012 Decoding Task 1 ...
第三十三节，目标检测之选择性搜索-Selective Search
在基于深度学习的目标检测算法的综述那一节中我们提到基于区域提名的目标检测中广泛使用的选择性搜索算法.并且该算法后来被应用到了R-CNN,SPP-Net,Fast R-CNN中.因此我认为还是有研究的 ...
Event Recommendation Engine Challenge分步解析第五步
一.请知晓本文是基于: Event Recommendation Engine Challenge分步解析第一步 Event Recommendation Engine Challenge分步解析第 ...
Event Recommendation Engine Challenge分步解析第四步
一.请知晓本文是基于: Event Recommendation Engine Challenge分步解析第一步 Event Recommendation Engine Challenge分步解析第 ...
A N EAR -D UPLICATE D ETECTION A LGORITHM T O F ACILITATE D OCUMENT C LUSTERING——有时间看看里面的相关研究
摘自:http://aircconline.com/ijdkp/V4N6/4614ijdkp04.pdf In the syntactical approach we define binary at ...
【学习笔记】第六章 python核心技术与实践--深入浅出字符串
[第五章]思考题答案,仅供参考: 思考题1:第一种方法更快,原因就是{}不需要去调用相关的函数: 思考题2:用列表作为key在这里是不被允许的,因为列表是一个动态变化的数据结构,字典当中的key要求是 ...
[SimHash] find the percentage of similarity between two given data
SimHash algorithm, introduced by Charikarand is patented by Google. Simhash 5 steps: Tokenize, Hash, ...
Postgresql-xl 调研
Postgresql-xl 调研来历这个项目的背后是一家叫做stormDB的公司.整个代买基于postgres-xc.开源版本应该是stormdb的一个分支. In 2010, NTT's Ope ...
1063. Set Similarity (25)
1063. Set Similarity (25) 时间限制 300 ms 内存限制 32000 kB 代码长度限制 16000 B 判题程序 Standard 作者 CHEN, Yue Given ...
[LintCode] Cosine Similarity 余弦公式
Cosine similarity is a measure of similarity between two vectors of an inner product space that meas ...

随机推荐

XCODE IOS 静态链接库替换升级
XCODE 版本15.2. 一个很久需求没更新的IOS 应用,近来有新需求要开发. 拉下代码运行,出现了个BAD_ACCESS错误.出错的位置位于一个调用的第三方的.a静态库内部.因为调用代码并没有修 ...
CLion搭建Qt开发环境，并解决目录重构问题（最新版）
序言 Qt版本不断更新,QtCreator也不断更新.在Qt4和Qt5时代,我一直认为开发Qt最好的IDE就是自带的QtCreator,可是时至今日,到了Qt6时代,QtCreator已经都12.0. ...
Walrus 0.5发布：重构交互流程，打造开箱即用的部署体验
开源应用管理平台 Walrus 0.5 已于近日正式发布! Walrus 0.4 引入了全新应用模型,极大程度减少了重复的配置工作,并为研发团队屏蔽了云原生及基础设施的复杂度.Walrus 0.5 在 ...
C# 通过VMI接口获取硬件ID
使用C#语言实现通过VMI(虚拟机监控器)接口来获取硬件ID的过程.VMI是一种用于虚拟化环境的接口,用于管理虚拟机和宿主机之间的通信和资源共享.具体实现中,需要通过添加System.Manageme ...
Swift中指针UnsafePointer的常见用法
指针类型 //基本指针 UnsafePointer<T> const T * UnsafeMutablePointer T * //集合指针 UnsafeBufferPointer con ...
XPath从入门到精通：基础和高级用法完整指南，附美团APP匹配示例
XPath 通常用来进行网站.XML (APP )和数据挖掘,通过元素和属性的方式来获取指定的节点,然后抓取需要的信息. 学习 XPath 语法之前,首先了解一下一些概念. 概念介绍节点之间的关系 ...
详细了解Transformer：Attention Is All You Need
1. 背景在机器翻译任务下,RNN.LSTM.GRU等序列模型在NLP中取得了巨大的成功,但是这些模型的训练是通常沿着输入和输出序列的符号位置进行计算的顺序计算,无法并行. 文中提出了名为Trans ...
【Flink入门修炼】1-1 为什么要学习 Flink？
流处理和批处理是什么? 什么是 Flink?为什么要学习 Flink? Flink 有什么特点,能做什么? 本文将为你解答以上问题. 一.批处理和流处理早些年,大数据处理还主要为批处理,一般按天或小 ...
Java-获取当前时间并进行格式化
获取年月日时分秒 import java.text.SimpleDateFormat; import java.util.Date; Date dt = new Date(); SimpleDate ...
CF1010C Border 题解
题目传送门前置知识最大公约数 | 裴蜀定理简化题意给定一个长度为 \(n\) 的序列 \(a\),求 \((\sum\limits_{i=1}^{n}d_ia_i) \bmod k\) 一共会 ...

Calculate Similarity调研

Calculate Similarity调研的更多相关文章

随机推荐

热门专题