Chi Square Distance
The chi squared distance d(x,y) is, as you already know, a distance between two histograms x=[x_1,..,x_n] and y=[y_1,...,y_n] having n bins both. Moreover, both histograms are normalized, i.e. their entries sum up to one.
The distance measure d is usually defined (although alternative definitions exist) as d(x,y) = sum( (xi-yi)^2 / (xi+yi) ) / 2 . It is often used in computer vision to compute distances between some bag-of-visual-word representations of images.
The name of the distance is derived from Pearson's chi squared test statistic X²(x,y) = sum( (xi-yi)^2 / xi) for comparing discrete probability distributions (i.e histograms). However, unlike the test statistic, d(x,y) is symmetric wrt. x and y, which is often useful in practice, e.g., when you want to construct a kernel out of the histogram distances.
Chi-Square Distance
Consider a frequency table with n rows and p columns, it is possible to calculate row profiles and column profiles. Let us then plot the n or p points from each profile. We can define the distances between these points. The Euclidean distance between the components of the profiles, on which a weighting is defined (each term has a weight that is the inverse of its frequency), is called the chi-square distance. The name of the distance is derived from the fact that the mathematical expression defining the distance is identical to that encountered in the elaboration of the chi square goodness of fit test.
MATHEMATICAL ASPECTS
![]() |
where
| f i. | is the sum of the components of the ith row; |
| f .j | is the sum of the components of the jth column; |
![]() |
is the ith row profile for j = 1,2,...,p. |
![]() |
where
is the jth column profile for j = 1,...,n.
DOMAINS AND LIMITATIONS
The chi-square distance incorporates a weight that is inversely proportional to the total of each row (or column), which increases the importance of small deviations in the rows (or columns) which have a small sum with respect to those with more important sum package.
The chi-square distance has the property of distributional equivalence, meaning that it ensures that the distances between rows and columns are invariant when two columns (or two rows) with identical profiles are aggregated.
EXAMPLES
Consider a contingency table charting how satisfied employees working for three different businesses are. Let us establish a distance table using the chi-square distance.
Values for the studied variable X can fall into one of three categories:
- X 1: high satisfaction;
- X 2: medium satisfaction;
- X 3: low satisfaction.
The observations collected from samples of individuals from the three businesses are given below:
|
Business 1 |
Business 2 |
Business 3 |
Total |
|
|---|---|---|---|---|
|
X 1 |
20 |
55 |
30 |
105 |
|
X 2 |
18 |
40 |
15 |
73 |
|
X 3 |
12 |
5 |
5 |
22 |
|
Total |
50 |
100 |
50 |
200 |
The relative frequency table is obtained by dividing all of the elements of the table by 200, the total number of observations:
|
Business 1 |
Business 2 |
Business 3 |
Total |
|
|---|---|---|---|---|
|
X 1 |
0.1 |
0.275 |
0.15 |
0.525 |
|
X 2 |
0.09 |
0.2 |
0.075 |
0.365 |
|
X 3 |
0.06 |
0.025 |
0.025 |
0.11 |
|
Total |
0.25 |
0.5 |
0.25 |
1 |
We can calculate the difference in employee satisfaction between the the 3 enterprises. The column profile matrix is given below:
|
Business 1 |
Business 2 |
Business 3 |
Total |
|
|---|---|---|---|---|
|
X 1 |
0.4 |
0.55 |
0.6 |
1.55 |
|
X 2 |
0.36 |
0.4 |
0.3 |
1.06 |
|
X 3 |
0.24 |
0.05 |
0.1 |
0.39 |
|
Total |
1 |
1 |
1 |
3 |
![]() |
We can calculate d(1,3) and d(2,3) in a similar way. The distances obtained are summarized in the following distance table:
|
Business 1 |
Business 2 |
Business 3 |
|
|---|---|---|---|
|
Business 1 |
0 |
0.613 |
0.514 |
|
Business 2 |
0.613 |
0 |
0.234 |
|
Business 3 |
0.514 |
0.234 |
0 |
We can also calculate the distances between the rows, in other words the difference in employee satisfaction; to do this we need the line profile table:
|
Business 1 |
Business 2 |
Business 3 |
Total |
|
|---|---|---|---|---|
|
X 1 |
0.19 |
0.524 |
0.286 |
1 |
|
X 2 |
0.246 |
0.548 |
0.206 |
1 |
|
X 3 |
0.546 |
0.227 |
0.227 |
1 |
|
Total |
0.982 |
1.299 |
0.719 |
3 |
![]() |
We can calculate d(1,3) and d(2,3) in a similar way. The differences between the degrees of employee satisfaction are finally summarized in the following distance table:
|
X 1 |
X 2 |
X 3 |
|
|---|---|---|---|
|
X 1 |
0 |
0.198 |
0.835 |
|
X 2 |
0.198 |
0 |
0.754 |
|
X 3 |
0.835 |
0.754 |
0 |
http://www.researchgate.net/post/What_is_chi-squared_distance_I_need_help_with_the_source_code
http://www.springerreference.com/docs/html/chapterdbid/60817.html
Chi Square Distance的更多相关文章
- BestCoder Round #87 1002 Square Distance[DP 打印方案]
Square Distance Accepts: 73 Submissions: 598 Time Limit: 4000/2000 MS (Java/Others) Memory Limit ...
- HDU 5903 Square Distance (贪心+DP)
题意:一个字符串被称为square当且仅当它可以由两个相同的串连接而成. 例如, "abab", "aa"是square, 而"aaa", ...
- hdu 5903 Square Distance(dp)
Problem Description A string is called a square string if it can be obtained by concatenating two co ...
- [HDU5903]Square Distance(DP)
题意:给一个字符串t ,求与这个序列刚好有m个位置字符不同的由两个相同的串拼接起来的字符串 s,要求字典序最小的答案. 分析:按照贪心的想法,肯定在前面让字母尽量小,尽可能的填a,但问题是不知道前面填 ...
- BendFord's law's Chi square test
http://www.siam.org/students/siuro/vol1issue1/S01009.pdf bendford'law e=log10(1+l/n) o=freq of first ...
- HDU 5903 - Square Distance [ DP ] ( BestCoder Round #87 1002 )
题意: 给一个字符串t ,求与这个序列刚好有m个位置字符不同的由两个相同的串拼接起来的字符串 s, 要求字典序最小的答案 分析: 把字符串折半,分成0 - n/2-1 和 n/2 - n-1 d ...
- HDU 5903 Square Distance
$dp$预处理,贪心. 因为$t$串前半部分和后半部分是一样的,所以只要构造前一半就可以了. 因为要求字典序最小,所以肯定是从第一位开始贪心选择,$a,b,c,d,...z$,一个一个尝试过去,如果发 ...
- 生成式模型之 GAN
生成对抗网络(Generative Adversarial Networks,GANs),由2014年还在蒙特利尔读博士的Ian Goodfellow引入深度学习领域.2016年,GANs热潮席卷AI ...
- Scoring and Modeling—— Underwriting and Loan Approval Process
https://www.fdic.gov/regulations/examinations/credit_card/ch8.html Types of Scoring FICO Scores V ...
随机推荐
- HW3.11
import java.util.Scanner; public class Solution { public static void main(String[] args) { Scanner i ...
- Miller-Rabin素性测试(POJ3641)
一.概念引入 在以往判断一个数n是不是素数时,我们都是采用i从2到sqrt(n)能否整除n.如果能整除,则n是合数;否则是素数.但是该算法的时间复杂度为O(sqrt(n)),当n较大时,时间性能很差, ...
- iframe与include的区别
iframe与include区别和使用问题 1.iframe可以用在静态和动态页面,include只能用在动态页面. 2.iframe是视图级组合,include是代码级组合. 3.iframe独立成 ...
- bzoj1095: [ZJOI2007]Hide 捉迷藏 线段树维护括号序列 点分治 链分治
这题真是十分难写啊 不管是点分治还是括号序列都有一堆细节.. 点分治:时空复杂度$O(n\log^2n)$,常数巨大 主要就是3个堆的初始状态 C堆:每个节点一个,为子树中的点到它父亲的距离的堆. B ...
- Java用OpenOffice将word转换为PDF
一. 软件安装以及jar包下载 官网的下载地址如下(英文): OpenOffice 下载地址http://www.openoffice.org/ JodConverter 下载地址http: ...
- psd via fft and pwelch
%fft and pwelch方法求取功率谱load x.mat Fs = 1; t = (0:1/Fs:1-1/Fs).'; Nx = length(x); % Window data w = ha ...
- Python实现二叉树的前序遍历、中序遍历
计算根节点到叶子节点的所组成的数字(1247, 125, 1367)以及叶子节点到根节点组成的数字(7421, 521, 8631),其二叉树树型结构如下 计算从根节点到叶子节点组成的数字,本质上来说 ...
- 权限管理(java+struts2(自定义标签)实现)--------->全代码演示
地址:http://blog.chinaunix.net/uid-24343152-id-3673026.html 最近由于项目不是很紧所以总结了之前做了n遍的权限管理功能.以便之后系统copy之用. ...
- 学习笔记:暴力破解WIFI小软件
小弟 自己的学习笔记,做练习的 ,缺陷还很多,做到无法解决速度问题就不想做下去了,如果要看的话 主要是思路问题,获取句柄,控制句柄而已,代码比较简单.大神勿喷啊 破解DEMO源码:http://dow ...
- 一个不喜欢读书的Javaer的读书单
很可惜,从我一开始学技术开始,我就不喜欢看书,严重的时候翻不到两页就会开始狂打瞌睡.很幸运,有互联网能够为我提供很多知识,甚至一些知识从网上看来的会更加权威一些.但是,我的经验告诉我,无论是从功利性的 ...




