Chapter 1.6 : Information Theory

 
 

Chapter 1.6 : Information Theory

Christopher M. Bishop, PRML, Chapter 1 Introdcution

1. Information h(x)

Given a random variable and we ask how much information is received when we observe a specific value for this variable.

  • The amount of information can be viewed as the “degree of surprise” on learning the value of .
  • information : where the negative sign ensures that information is positive or zero.
  • the units of :
    • using logarithms to the base of 2: the units of are bits (‘binary digits’).
    • using logarithms to the base of , i.e., natural logarithms: the units of are nats.

2. Entropy H(x): average amount of information

2.1 Entropy H(x)

Firstly we interpret the concept of entropy in terms of the average amount of information needed to specify the state of a random variable.

Now suppose that a sender wishes to transmit the value of a random variable to a receiver. The average amount of information that they transmit in the process is obtained by taking the expectation of (1.92) with respect to the distribution and is given as

  • discrete entropy for discrete random variable by
  • or differential/continuous entropy for continuous random variable by
  • Note that and so we shall take whenever we encounter a value for such that .
  • The nonuniform distribution has a smaller entropy than the uniform one.

2.2 Noiseless coding theorem (Shannon, 1948)

The noiseless coding theorem states that the entropy is a lower bound on the number of bits needed to transmit the state of a random variable.

2.3 Alternative view of entropy H(x)

Secondly, let us introduces the concept of entropy in physics in the context of equilibrium thermodynamics and later given a deeper interpretation as a measure of disorder through developments in statistical mechanics.

Consider a set of identical objects that are to be divided amongst a set of bins, such that there are objects in the bin. Consider the number of different ways of allocating the objects to the bins.

  • There are ways to choose the first object, ways to choose the second object, and so on, leading to a total of ways to allocate all objects to the bins.
  • However, we don’t wish to distinguish between rearrangements of objects within each bin. In the bin there are ways of reordering the objects, and so the total number of ways of allocating the objects to the bins is given by which is called the multiplicity.
  • The entropy is then defined as the logarithm of the multiplicity scaled by an appropriate constant
  • We now consider the limit , in which the fractions are held fixed, and apply Stirling’s approximation
  • which gives

    i.e.,

  • where we have used , and is the probability of an object being assigned to the bin.
  • microstate: In physics terminology, the specific arrangements of objects in the bins is called a microstate,
  • macrostate: the overall distribution of occupation numbers, expressed through the ratios , is called macrostate.
  • The multiplicity is also known as the weight of the macrostate.
  • We can interpret the bins as the states of a discrete random variable , where . The entropy of the random variable X is then

2.4 Comparison between discrete entropy and continuous entropy

H(x) Discrete Distribution X Continuous Distribution X
Maximum Uniform X Gaussian X
+/- could be negative
  • Maximum entropy H(x) :

    • In the case of discrete distributions, the maximum entropy configuration corresponded to an equal distribution of probabilities across the possible states of the variable.
    • For a continuous variable, the distribution that maximizes the differential entropy is the Gaussian [see Page 54 in PRML].
  • Is Negative or Positive?
    • The discrete entropy in (1.93) is always , because . It will equal its minimum value of when one of the and all other .
    • The differential entropy can be negative, because in (1.110) for .

      If we evaluate the differential entropy of the Gaussian, we obtain This result also shows that the differential entropy, unlike the discrete entropy, can be negative, because in (1.110) for .

2.5 Conditional entropy H(y|x)

  • Conditional entropy:
    Suppose we have a joint distribution from which we draw pairs of values of and . If a value of is already known, then the additional information needed to specify the corresponding value of is given by . Thus the average additional information needed to specify y can be written as
    which is called the conditional entropy of y given x.
  • It is easily seen, using the product rule, that the conditional entropy satisfies the relation where is the differential entropy (i.e., continuous entropy) of , and is the differential entropy of the marginal distribution .
  • From (1.112) we get to know that

    the information needed to describe and is given by the sum of the information needed to describe alone plus the additional information required to specify given .

3. Relative entropy or KL divergence

3.1 Relative entropy or KL divergence

Problem: How to relate the notion of entropy to pattern recognition?
Description: Consider some unknown distribution , and suppose that we have modeled this using an approximating distribution .

  • Relative entropy or Kullback-Leibler divergence, or KL divergence between the distributions and : If we use to construct a coding scheme for the purpose of transmitting values of to a receiver, then the average additional amount of information (in nats) required to specify the value of (assuming we choose an efficient coding scheme) as a result of using instead of the true distribution is given by
    and

  • It can be rewritten as [see Ref-1]

  • Cross Entropy: where is called the cross entropy,

    • Understanding of Cross Entropy: One can show that the cross entropy is the average number of bits (or nats) needed to encode data coming from a source with distribution when we use model to define our codebook.
    • Understanding of (“Regular”) Entropy: Hence the “regular” entropy , is the expected number of bits if we use the true model.
    • Understanding of Relative Entropy: So the KL divergence is the difference between these (shown in 2.111). In other words, the KL divergence is the average number of extra bits (or nats) needed to encode the data, due to the fact that we used approximation distribution to encode the data instead of the true distribution .
  • Asymmetric: Note that KL divergence is not a symmetrical quantity, that is to say .

  • KL divergence is a way to measure the dissimilarity of two probability distributions, and [see Ref-1].

3.2 Information inequality [see Ref-1]

The “extra number of bits” interpretation should make it clear that , and that the KL is only equal to zero i.f.f. . We now give a proof of this important result.

Proof:

  • 1) Convex functions: To do this we first introduce the concept of convex functions. A function is said to be convex if it has the property that every chord lies on or above the function, as shown in Figure 1.31.

    • Convexity then implies
  • 2) Jensen’s inequality:
    • Using the technique of proof by induction(数学归纳法), we can show from (1.114) that a convex function satisfies where and , for any set of points . The result (1.115) is known as Jensen’s inequality.
    • If we interpret the as the probability distribution over a discrete variable taking the values , then (1.115) can be written For continuous variables, Jensen’s inequality takes the form
  • 3) Apply Jensen’s inequality in the form (1.117) to the KL divergence (1.113) to give where we have used the fact that is a convex function (In fact, is a strictly convex function, so the equality will hold if, and only if, for all ), together with the normalization condition .
  • 4) Similarly, Let be the support of , and apply Jensen’s inequality in the form (1.115) to the discrete form KL divergence (2.110) to get [see Ref-1] where the first inequality follows from Jensen’s. Since is a strictly concave (i.e., the inverse of convex) function, we have equality in Equation (2.115) iff for some . We have equality in Equation (2.116) iff , which implies .
  • 5) Hence iff for all .

3.3 How to use KL divergence

Note that:

  • we can interpret the KL divergence as a measure of the dissimilarity of the two distributions and .
  • If we use a distribution that is different from the true one, then we must necessarily have a less efficient coding, and on average the additional information that must be transmitted is (at least) equal to the Kullback-Leibler divergence between the two distributions.

Problem description:

  • Suppose that data is being generated from an unknown distribution that we wish to model.
  • We can try to approximate this distribution using some parametric distribution , governed by a set of adjustable parameters , for example a multivariate Gaussian.
  • One way to determine is to minimize the KL divergence between and with respect to .
  • We cannot do this directly because we don’t know . Suppose, however, that we have observed a finite set of training points , for , drawn from . Then the expectation with respect to can be approximated by a finite sum over these points, using (1.35), so that (???don’t know how to derive it)
  • the first term is the negative log likelihood function for under the distribution evaluated using the training set.
  • Thus we see that minimizing this KL divergence is equivalent to maximizing the likelihood function.

3.4 Mutual information

Now consider the joint distribution between two sets of variables and given by .

Mutual information between the variables and :

  • If and are independent, p(x, y) = p(x)p(y).
  • If and are not independent, we can gain some idea of whether they are “close” to being independent by considering the KL divergence between the joint distribution and the product of the marginals, given by which is called the mutual information between the variables and .
  • Using the sum and product rules of probability, we see that the mutual information is related to the conditional entropy
    through

Understanding of Mutual information:

  • Thus we can view the mutual information as the reduction in the uncertainty about by virtue of being told the value of (or vice versa).
  • From a Bayesian perspective, we can view as the prior distribution for and as the posterior distribution after we have observed new data . The mutual information therefore represents the reduction in uncertainty about as a consequence of the new observation .

Reference

[1]: Section 2.8.2, Page 57, Kevin P. Murphy. 2012. Machine Learning: A Probabilistic Perspective. The MIT Press.

 

CCJ PRML Study Note - Chapter 1.6 : Information Theory的更多相关文章

  1. 信息熵 Information Theory

    信息论(Information Theory)是概率论与数理统计的一个分枝.用于信息处理.信息熵.通信系统.数据传输.率失真理论.密码学.信噪比.数据压缩和相关课题.本文主要罗列一些基于熵的概念及其意 ...

  2. Tree - Information Theory

    This will be a series of post about Tree model and relevant ensemble method, including but not limit ...

  3. information entropy as a measure of the uncertainty in a message while essentially inventing the field of information theory

    https://en.wikipedia.org/wiki/Claude_Shannon In 1948, the promised memorandum appeared as "A Ma ...

  4. Better intuition for information theory

    Better intuition for information theory 2019-12-01 21:21:33 Source: https://www.blackhc.net/blog/201 ...

  5. 信息论 | information theory | 信息度量 | information measures | R代码(一)

    这个时代已经是多学科相互渗透的时代,纯粹的传统学科在没落,新兴的交叉学科在不断兴起. life science neurosciences statistics computer science in ...

  6. Beginning Scala study note(5) Pattern Matching

    The basic functional cornerstones of Scala: immutable data types, passing of functions as parameters ...

  7. TIJ——Chapter Fourteen:Type Information

    Runtime type information(RTTI) allows you to discover and use type information while a program is ru ...

  8. Beginning Scala study note(9) Scala and Java Interoperability

    1. Translating Java Classes to Scala Classes Example 1: # a class declaration in Java public class B ...

  9. Beginning Scala study note(8) Scala Type System

    1. Unified Type System Scala has a unified type system, enclosed by the type Any at the top of the h ...

随机推荐

  1. bzoj2012: [Ceoi2010]Pin

    Description 给出N(2<=N<=50000)个长度为4的字符串,问有且仅有D(1<=D<=4)处不相同的字符串有几对. Input 第1行: N,D 以下N行每行一 ...

  2. CGI技术原理

    一.CGI技术 1.1 CGI的提出 CGI是外部扩展应用程序与WWW服务器交互的一个标准接口.按照CGI标准编写的外部扩展应用程序可以处理客户端(一般是WWW浏览器)输入的协同工作数据,完成客户端与 ...

  3. ASP.NET MSSQL 依赖缓存设置方法

    更多的时候,我们的服务器性能损耗还是在查询数据库的时候,所以对数据库的缓存还是显得特别重要,上面几种方式都可以实现部分数据缓存功能.但问题是我们的数据有时候是在变化的,这样用户可能在缓存期间查询的数据 ...

  4. HTML网页调用本地QQ

    打开聊天窗口代码: tencent://message/?uin=QQ号码&Site=有事Q我&Menu=yes 使用方法: <a href="tencent://me ...

  5. activiti自定义流程之整合(四):整合自定义表单部署流程定义

    综合前几篇博文内容,我想在整合这一部分中应该会有很多模块会跳过不讲,就如自定义表单的表单列表那一块,因为这些模块在整合的过程中都几乎没有什么改动,再多讲也是重复无用功. 正因为如此,在创建了流程模型之 ...

  6. JavaScript-获得当前时间

    js获得当前时间 var myDate = new Date(); myDate.getYear(); //获取当前年份(2位) myDate.getFullYear(); //获取完整的年份(4位, ...

  7. 一.OSI与TCP

    一. TCP/IP的由来 OSI参考模型由来 计算机网络产生的最初阶段,每个计算机厂商都实现了自己的一套计算机网络体系结构;异构的网络之间无法进行通信.因此,ISO委员会推出了一种用于开放系统互联的网 ...

  8. 资源文件assets和 res下面raw文件的使用不同点

    在建立项目中一般会默认建立assets文件,当然我们还可以在res文件下面建立raw文件夹,这里面都可以存放一些图片,音频或者文本信息,可以供我们在程序当中进行使用,不过他们两个也有不同点: asse ...

  9. UCOS-信号量(学习笔记)

    当事件控制块类型为OS_Event_Type_SEM类型时就是信号量,包含2个内容:信号量计数器OSEventcnt和等待列表OSEventTbl[]. 一创建信号量:OSSemCreat(int16 ...

  10. this web application instance has been stopped already解决办法

    重启tomcat的时候出错 Illegal access: this web application instance has been stopped already.  Could not loa ...