The chi squared distance d(x,y) is, as you already know, a distance between two histograms x=[x_1,..,x_n] and y=[y_1,...,y_n] having n bins both. Moreover, both histograms are normalized, i.e. their entries sum up to one.
The distance measure d is usually defined (although alternative definitions exist) as d(x,y) = sum( (xi-yi)^2 / (xi+yi) ) / 2 . It is often used in computer vision to compute distances between some bag-of-visual-word representations of images.

The name of the distance is derived from Pearson's chi squared test statistic X²(x,y) = sum( (xi-yi)^2 / xi) for comparing discrete probability distributions (i.e histograms). However, unlike the test statistic, d(x,y) is symmetric wrt. x and y, which is often useful in practice, e.g., when you want to construct a kernel out of the histogram distances.

Chi-Square Distance

Consider a frequency table with n rows and p columns, it is possible to calculate row profiles and column profiles. Let us then plot the n or p points from each profile. We can define the distances between these points. The Euclidean distance between the components of the profiles, on which a weighting is defined (each term has a weight that is the inverse of its frequency), is called the chi-square distance. The name of the distance is derived from the fact that the mathematical expression defining the distance is identical to that encountered in the elaboration of the chi square goodness of fit test.

MATHEMATICAL ASPECTS

Let (fij), be the frequency of the ith row and jth column in a frequency table with n rows an p columns. The chi-square distance between two rows i and i is given by the formula:

where

i. is the sum of the components of the ith row;
.j is the sum of the components of the jth column;
is the ith row profile for j = 1,2,...,p.
Likewise, the distance between two columns j and j is given by:

where  is the jth column profile for j = 1,...,n.

DOMAINS AND LIMITATIONS

The chi-square distance incorporates a weight that is inversely proportional to the total of each row (or column), which increases the importance of small deviations in the rows (or columns) which have a small sum with respect to those with more important sum package.

The chi-square distance has the property of distributional equivalence, meaning that it ensures that the distances between rows and columns are invariant when two columns (or two rows) with identical profiles are aggregated.

EXAMPLES

Consider a contingency table charting how satisfied employees working for three different businesses are. Let us establish a distance table using the chi-square distance.

Values for the studied variable X can fall into one of three categories:

  • 1: high satisfaction;
  • 2: medium satisfaction;
  • 3: low satisfaction.

The observations collected from samples of individuals from the three businesses are given below:

 

Business 1

Business 2

Business 3

Total

1

20

 55

30

105

2

18

 40

15

 73

3

12

  5

 5

 22

Total

50

100

50

200

The relative frequency table is obtained by dividing all of the elements of the table by 200, the total number of observations:

 

Business 1

Business 2

Business 3

Total

1

0.1

0.275

0.15

0.525

2

0.09

0.2

0.075

0.365

3

0.06

0.025

0.025

0.11

Total

0.25

0.5

0.25

1

We can calculate the difference in employee satisfaction between the the 3 enterprises. The column profile matrix is given below:

 

Business 1

Business 2

Business 3

Total

1

0.4 

0.55

0.6

1.55

2

0.36

0.4 

0.3

1.06

3

0.24

0.05

0.1

0.39

Total

1  

1  

1 

3  

This allows us to calculate the distances between the different columns:

We can calculate d(1,3) and d(2,3) in a similar way. The distances obtained are summarized in the following distance table:

 

Business 1

Business 2

Business 3

Business 1

0

0.613

0.514

Business 2

0.613

0

0.234

Business 3

0.514

0.234

0

We can also calculate the distances between the rows, in other words the difference in employee satisfaction; to do this we need the line profile table:

 

Business 1

Business 2

Business 3

Total

1

0.19 

0.524

0.286

1

2

0.246

0.548

0.206

1

3

0.546

0.227

0.227

1

Total

0.982

1.299

0.719

3

This allows us to calculate the distances between the different rows:

We can calculate d(1,3) and d(2,3) in a similar way. The differences between the degrees of employee satisfaction are finally summarized in the following distance table:

 

1

2

3

1

0

0.198

0.835

2

0.198

0

0.754

3

0.835

0.754

0

http://www.researchgate.net/post/What_is_chi-squared_distance_I_need_help_with_the_source_code

http://www.springerreference.com/docs/html/chapterdbid/60817.html

Chi Square Distance的更多相关文章

  1. BestCoder Round #87 1002 Square Distance[DP 打印方案]

    Square Distance  Accepts: 73  Submissions: 598  Time Limit: 4000/2000 MS (Java/Others)  Memory Limit ...

  2. HDU 5903 Square Distance (贪心+DP)

    题意:一个字符串被称为square当且仅当它可以由两个相同的串连接而成. 例如, "abab", "aa"是square, 而"aaa", ...

  3. hdu 5903 Square Distance(dp)

    Problem Description A string is called a square string if it can be obtained by concatenating two co ...

  4. [HDU5903]Square Distance(DP)

    题意:给一个字符串t ,求与这个序列刚好有m个位置字符不同的由两个相同的串拼接起来的字符串 s,要求字典序最小的答案. 分析:按照贪心的想法,肯定在前面让字母尽量小,尽可能的填a,但问题是不知道前面填 ...

  5. BendFord's law's Chi square test

    http://www.siam.org/students/siuro/vol1issue1/S01009.pdf bendford'law e=log10(1+l/n) o=freq of first ...

  6. HDU 5903 - Square Distance [ DP ] ( BestCoder Round #87 1002 )

    题意: 给一个字符串t ,求与这个序列刚好有m个位置字符不同的由两个相同的串拼接起来的字符串 s, 要求字典序最小的答案    分析: 把字符串折半,分成0 - n/2-1 和 n/2 - n-1 d ...

  7. HDU 5903 Square Distance

    $dp$预处理,贪心. 因为$t$串前半部分和后半部分是一样的,所以只要构造前一半就可以了. 因为要求字典序最小,所以肯定是从第一位开始贪心选择,$a,b,c,d,...z$,一个一个尝试过去,如果发 ...

  8. 生成式模型之 GAN

    生成对抗网络(Generative Adversarial Networks,GANs),由2014年还在蒙特利尔读博士的Ian Goodfellow引入深度学习领域.2016年,GANs热潮席卷AI ...

  9. Scoring and Modeling—— Underwriting and Loan Approval Process

    https://www.fdic.gov/regulations/examinations/credit_card/ch8.html Types of Scoring FICO Scores    V ...

随机推荐

  1. 综合而强大的DATASNAP

    从DELPHI2009开始,DATASNAP技术上完全是全新的架构,多层架构不再基于微软的COM,摆脱COM就等于摆脱了WINDOWS的束缚. TCP/IP通信不再需要先开启scktsrvr.exe程 ...

  2. [poj 2186]Popular Cows[Tarjan强连通分量]

    题意: 有一群牛, a会认为b很帅, 且这种认为是传递的. 问有多少头牛被其他所有牛认为很帅~ 思路: 关键就是分析出缩点之后的有向树只能有一个叶子节点(出度为0). 做法就是Tarjan之后缩点统计 ...

  3. jQuery对象与dom对象相互转换jQuery对象与dom对象相互转换

    转至:http://www.chinaz.com/design/2010/0309/108144.shtml 刚开始学习jQuery,可能一时会分不清楚哪些是jQuery对象,哪些是DOM对象.至于D ...

  4. HTML的id,name,class

    HTML中的id是给JavaScript用的(document.getElementById()) HTML中的name是给JavaScript用的(formUploadFile.submit()) ...

  5. 前端javascript规范文档 (http://www.xuanfengge.com/category/web)

    说明:本文档为前端JS规范 一.规范目的 为提高团队协作效率,便于前端后期优化维护,输出高质量的文档. 二.基本准则 符合web标准,结构表现行为分离,兼容性优良.页面性能方面,代码要求简洁明了有序, ...

  6. HTML5 中的一些新特性

    HTML5是HTML最新的修订版本,包含了新的标签元素,属性和行为,同时包含了一系列可以被用来让 Web 站点和应用更加多样化,功能更强大的技术.HTML5实现了不依赖flash插件播放视频,而且引入 ...

  7. C#和C++中的float类型

    博客搬到了fresky.github.io - Dawei XU,请各位看官挪步.最新的一篇是:C#和C++中的float类型.

  8. 用NGUI做一个计时条!

    1.建立两个UISprite. 2.建立脚本CountingTime 3.编写脚本 public class CountTime : MonoBehaviour { //时间计时器 public fl ...

  9. S2SH商用后台权限系统第一讲

    各位博友: 您好!从今天开始我们做一套商用的权限系统.功能包含用户管理.角色管理.模块管理.权限管理.大家知道每个商用系统肯定会拥有一套后台系统,我们所讲的权限系统是整个系统核心部分.本套系统技术有s ...

  10. matlab两种不同模式的并行运算

    1.distributed job      distributed job是一种比較简单的并行任务.假定用户须要完毕一组作业.各个计算作业之间是独立的.并且相互之间不须要进行数据通信.这意味着各个作 ...