Chi Square Distance
The chi squared distance d(x,y) is, as you already know, a distance between two histograms x=[x_1,..,x_n] and y=[y_1,...,y_n] having n bins both. Moreover, both histograms are normalized, i.e. their entries sum up to one.
The distance measure d is usually defined (although alternative definitions exist) as d(x,y) = sum( (xi-yi)^2 / (xi+yi) ) / 2 . It is often used in computer vision to compute distances between some bag-of-visual-word representations of images.
The name of the distance is derived from Pearson's chi squared test statistic X²(x,y) = sum( (xi-yi)^2 / xi) for comparing discrete probability distributions (i.e histograms). However, unlike the test statistic, d(x,y) is symmetric wrt. x and y, which is often useful in practice, e.g., when you want to construct a kernel out of the histogram distances.
Chi-Square Distance
Consider a frequency table with n rows and p columns, it is possible to calculate row profiles and column profiles. Let us then plot the n or p points from each profile. We can define the distances between these points. The Euclidean distance between the components of the profiles, on which a weighting is defined (each term has a weight that is the inverse of its frequency), is called the chi-square distance. The name of the distance is derived from the fact that the mathematical expression defining the distance is identical to that encountered in the elaboration of the chi square goodness of fit test.
MATHEMATICAL ASPECTS
![]() |
where
| f i. | is the sum of the components of the ith row; |
| f .j | is the sum of the components of the jth column; |
![]() |
is the ith row profile for j = 1,2,...,p. |
![]() |
where
is the jth column profile for j = 1,...,n.
DOMAINS AND LIMITATIONS
The chi-square distance incorporates a weight that is inversely proportional to the total of each row (or column), which increases the importance of small deviations in the rows (or columns) which have a small sum with respect to those with more important sum package.
The chi-square distance has the property of distributional equivalence, meaning that it ensures that the distances between rows and columns are invariant when two columns (or two rows) with identical profiles are aggregated.
EXAMPLES
Consider a contingency table charting how satisfied employees working for three different businesses are. Let us establish a distance table using the chi-square distance.
Values for the studied variable X can fall into one of three categories:
- X 1: high satisfaction;
- X 2: medium satisfaction;
- X 3: low satisfaction.
The observations collected from samples of individuals from the three businesses are given below:
|
Business 1 |
Business 2 |
Business 3 |
Total |
|
|---|---|---|---|---|
|
X 1 |
20 |
55 |
30 |
105 |
|
X 2 |
18 |
40 |
15 |
73 |
|
X 3 |
12 |
5 |
5 |
22 |
|
Total |
50 |
100 |
50 |
200 |
The relative frequency table is obtained by dividing all of the elements of the table by 200, the total number of observations:
|
Business 1 |
Business 2 |
Business 3 |
Total |
|
|---|---|---|---|---|
|
X 1 |
0.1 |
0.275 |
0.15 |
0.525 |
|
X 2 |
0.09 |
0.2 |
0.075 |
0.365 |
|
X 3 |
0.06 |
0.025 |
0.025 |
0.11 |
|
Total |
0.25 |
0.5 |
0.25 |
1 |
We can calculate the difference in employee satisfaction between the the 3 enterprises. The column profile matrix is given below:
|
Business 1 |
Business 2 |
Business 3 |
Total |
|
|---|---|---|---|---|
|
X 1 |
0.4 |
0.55 |
0.6 |
1.55 |
|
X 2 |
0.36 |
0.4 |
0.3 |
1.06 |
|
X 3 |
0.24 |
0.05 |
0.1 |
0.39 |
|
Total |
1 |
1 |
1 |
3 |
![]() |
We can calculate d(1,3) and d(2,3) in a similar way. The distances obtained are summarized in the following distance table:
|
Business 1 |
Business 2 |
Business 3 |
|
|---|---|---|---|
|
Business 1 |
0 |
0.613 |
0.514 |
|
Business 2 |
0.613 |
0 |
0.234 |
|
Business 3 |
0.514 |
0.234 |
0 |
We can also calculate the distances between the rows, in other words the difference in employee satisfaction; to do this we need the line profile table:
|
Business 1 |
Business 2 |
Business 3 |
Total |
|
|---|---|---|---|---|
|
X 1 |
0.19 |
0.524 |
0.286 |
1 |
|
X 2 |
0.246 |
0.548 |
0.206 |
1 |
|
X 3 |
0.546 |
0.227 |
0.227 |
1 |
|
Total |
0.982 |
1.299 |
0.719 |
3 |
![]() |
We can calculate d(1,3) and d(2,3) in a similar way. The differences between the degrees of employee satisfaction are finally summarized in the following distance table:
|
X 1 |
X 2 |
X 3 |
|
|---|---|---|---|
|
X 1 |
0 |
0.198 |
0.835 |
|
X 2 |
0.198 |
0 |
0.754 |
|
X 3 |
0.835 |
0.754 |
0 |
http://www.researchgate.net/post/What_is_chi-squared_distance_I_need_help_with_the_source_code
http://www.springerreference.com/docs/html/chapterdbid/60817.html
Chi Square Distance的更多相关文章
- BestCoder Round #87 1002 Square Distance[DP 打印方案]
Square Distance Accepts: 73 Submissions: 598 Time Limit: 4000/2000 MS (Java/Others) Memory Limit ...
- HDU 5903 Square Distance (贪心+DP)
题意:一个字符串被称为square当且仅当它可以由两个相同的串连接而成. 例如, "abab", "aa"是square, 而"aaa", ...
- hdu 5903 Square Distance(dp)
Problem Description A string is called a square string if it can be obtained by concatenating two co ...
- [HDU5903]Square Distance(DP)
题意:给一个字符串t ,求与这个序列刚好有m个位置字符不同的由两个相同的串拼接起来的字符串 s,要求字典序最小的答案. 分析:按照贪心的想法,肯定在前面让字母尽量小,尽可能的填a,但问题是不知道前面填 ...
- BendFord's law's Chi square test
http://www.siam.org/students/siuro/vol1issue1/S01009.pdf bendford'law e=log10(1+l/n) o=freq of first ...
- HDU 5903 - Square Distance [ DP ] ( BestCoder Round #87 1002 )
题意: 给一个字符串t ,求与这个序列刚好有m个位置字符不同的由两个相同的串拼接起来的字符串 s, 要求字典序最小的答案 分析: 把字符串折半,分成0 - n/2-1 和 n/2 - n-1 d ...
- HDU 5903 Square Distance
$dp$预处理,贪心. 因为$t$串前半部分和后半部分是一样的,所以只要构造前一半就可以了. 因为要求字典序最小,所以肯定是从第一位开始贪心选择,$a,b,c,d,...z$,一个一个尝试过去,如果发 ...
- 生成式模型之 GAN
生成对抗网络(Generative Adversarial Networks,GANs),由2014年还在蒙特利尔读博士的Ian Goodfellow引入深度学习领域.2016年,GANs热潮席卷AI ...
- Scoring and Modeling—— Underwriting and Loan Approval Process
https://www.fdic.gov/regulations/examinations/credit_card/ch8.html Types of Scoring FICO Scores V ...
随机推荐
- poj 1523 SPF【点双连通求去掉割点后bcc个数】
SPF Time Limit: 1000MS Memory Limit: 10000K Total Submissions: 7246 Accepted: 3302 Description C ...
- delphi 控制 EXCEL 数据透视表
虽说报表多又难做,做报表相当容易. 做报表也可以偷懒的,超级实用又省事.只需要做一个报表,这个报表里面包括几乎所有的数据字段,然后将查询到的数据导出到 excel中,利用excel自带的“数据透视”功 ...
- iPad开发(相对于iPhone开发时专有的API)
iPad开发 一.iPad开发简介 1.什么是iPad 一款苹果公司于2010年发布的平板电脑 定价介于苹果的智能手机iPhone和笔记本电脑产品之间 跟iPhone一样,搭载的是iOS操作系统 2. ...
- Brief描述子
一.Brief算法 1.基本原理 BRIEF是2010年的一篇名为<BRIEF:Binary Robust Independent Elementary Features>的文章中提出,B ...
- PAT 1033. To Fill or Not to Fill (25)
题目地址:http://pat.zju.edu.cn/contests/pat-a-practise/1033 此题是一道贪心算法题,难度较大,关键在于贪心策略的选择: #include <cs ...
- C#使用参数数组
重载,是指在相同的作用域内,声明多个同名的方法.用以对不同类型或数量的参数的参数执行相同的操作.比如,可以求两个或者三个 int类型数中的最大值,我们可以编写这样的方法实现: class Util { ...
- 理解C++中函数的返回
连续几年的C++程序设计课教学中,学生中总有人要求为他们单独解释函数的返回(return)究竟是什么意思.各种书中都会详讲返回值的问题,而学生们掌握的难点却是在返回至何处执行.本文试图通过对一般函数及 ...
- editplus如何设置不自动备份
依次选择:工具,参数设置,文件(默认展开的,要缩回),然后看右边“保存文件时创建备份”,前面的框不要打勾,应用,确定
- 判断richtextbox选中的是否为图片
) { Text = "Img"; } else { Text = "Form1"; }
- Jmail的邮件发送
下载注册dll文件 1. dll文件下载 2.到jmail.dll所在目录,运行cmd regsvr32 目录/jmail.dll 3.c#程序中,行首引用代码 using jmail C#示例代码 ...




