Tree - Information Theory

This will be a series of post about Tree model and relevant ensemble method, including but not limited to Random Forest, AdaBoost, Gradient Boosting and xgboost.

So I will start with some basic of Information Theory, which is an importance piece in Tree Model. For relevant topic I highly recommend the tutorial slide from Andrew moore

What is information?

Andrew use communication system to explain information. If we want to transmit a series of 4 characters ( ABCDADCB... ) using binary code ( 0&1 ). How many bits do we need to encode the above character?

The take away here is the more bit you need, the more information it contains.

I think the first encoding coming to your mind will be following:
A = 00, B=01, C =10, D=11. So on average 2 bits needed for each character.

Can we use less bit on average?

Yes! As long as these 4 characters are not uniformally distributed.

Really? Let's formulate the problem using expectation.

\[ E( N ) = \sum_{k \in {A,B,C,D}}{n_k * p(x=k)} \]

where P( x=k ) is the probability of character k in the whole series, and n_k is the number of bits needed to encode k. For example: P( x=A ) = 1/2, P( x=B ) = 1/4, P( x=c ) = 1/8, P( x=D ) = 1/8, can be encoded in following way: A=0, B=01, C=110, D=111.

Basically we can take advantage of the probability and assign shorter encoding to higher probability variable. And now our average bit is 1.75 < 2 !

Do you find any other pattern here?

the number of bits needed for each character is related to itsprobability : bits = -log( p )
Here log has 2 as base, due to binary encoding

We can understand this from 2 angles:

How many value can n bits represent? $2^n$, where each value has probability $1/2^n$, leading to n = log(1/p).
Transmiting 2 characters independently: P( x1=A, x2 =B ) = P( x1=A ) * P( x2=B ), N( x1, x2 ) = N( x1 ) + N( x2 ), where N(x) is the number of bits. So we can see that probability and information is linked via log.

In summary, let's use H( X ) to represent information of X, which is also known as Entropy

when X is discrete, $H(X) = -\sum_i{p_i \cdot log_2{p_i}}$
when X is continuous, $H(X) = -\int_x{p(x) \cdot log_2{p(x)}} dx$

Deeper Dive into Entropy

1. Intuition of Entropy

I like the way Bishop describe Entropy in the book Pattern Recognition and Machine Learning. Entropy is 'How big the surprise is'. In the following post- tree model, people prefer to use 'impurity'.

Therefore if X is a random variable, then the more spread out X is, the higher Entropy X has. See following:

2. Conditional Entropy

Like the way we learn probability, after learning how to calculate probability and joint probability, we come to conditional probability. Let's discuss conditional Entropy.

H( Y | X ) is given X, how surprising Y is now? If X and Y are independent then H( Y | X ) = H( Y ) (no reduce in surprising). From the relationship between probability and Entropy, we can get following:
\[P(X,Y) = P(Y|X) * P(X)\]

\[H(X,Y) = H(Y|X) + H(X)\]

Above equation can also be proved by entropy. Give it a try! Here let's go through an example from Andrew's tutorial to see what is conditional entropy exactly.

X = college Major, Y = Like 'Gladiator'

X	Y
Math	YES
History	NO
CS	YES
Math	NO
Math	NO
CS	YES
History	NO
Math	YES

Let's compute Entropy using above formula:

Here we see H( Y | X ) < H( Y ), meaning knowing X helps us know more about Y.

When X is continuous, conditional entropy can be deducted in following way:

we draw ( x , y ) from joint distribution P( x , y ). Given x, the additional information on y becomes -log( P( y | x ) ). Then using entropy formula we get:

\[H(Y|X) = \int_y\int_x{ - p(y,x)\log{p(y|x)} dx dy} =\int_x{H(Y|x)p(x) dx} \]

In summary

When X is discrete, $H(Y|X) = \sum_j{ H(Y|x=v_j) p(x=v_j)}$
When X is continuous, $H(Y|X) = \int_x{ H(Y|x)p(x) dx}$

3. Information Gain

If we follow above logic, then information Gain is the reduction of surpise in Y given X. So can you guess how IG is defined now?

IG = H( Y ) - H( Y | X )

In our above example IG = 0.5. And Information Gain will be frequently used in the following topic - Tree Model. Because each tree splitting aims at lowering the 'surprising' in Y, where the ideal case is that in each leaf Y is constant. Therefore split with higher information is preferred

So far most of the stuff needed for the Tree Model is covered. If you are still with me, let's talk a about a few other interesting topics related to information theory.