Entropy

Entropy can be used as a measure of uncertainty in information theory.

Entropy H(Y) of a random Y with n different possible values:
$$
H(Y) =-\sum^{n}_{i=1}P(y_i)log_2P(y_i)
$$

Where $P(y_i)$ is the probaility that random variable Y equals $y_i$ (One of n different possible values of Y).

n represents the number of types of outcomes, for example, if the outcome is 1, 2, 3 then n is 3. $P(y_i)$represents the probability of different outcomes, for example, if 1, 2, 3 then the probability of 1 is $1/3$.

Example

$X_1$	$X_2$	Y
T	F	T
T	T	T
F	F	F
F	T	T
T	F	F

Y has two values: T and F. In the equation n = 2. Therefore, we can get:

$$
H(Y) =-\sum^{n}_{i=1}P(y_i)log_2P(y_i)
$$

$$
H(Y) =-\frac{3}{5}log_2\frac{3}{5}-\frac{2}{5}log_2\frac{2}{5} \approx0.292
$$

Conditional entropy

Conditional entropy $H(Y|X=x_j)$ of Y given $X=x_i$.

$$
H(Y|X=x_j)=-\sum^{n}_{i=1}P(y_i|X=x_j)log_2P(y_i|X=x_j)
$$

After split we can get:

$$
H(Y|X)=\sum^{k}_{j=1}P(x_j)H(Y|X=x_j)
$$

Example

$X_1$	$X_2$	Y
T	F	T
T	T	T
F	F	F
F	T	T
T	F	F

Suppose we split the data based on the value of $X_1$. $X_1$ has two possible values: T and F. We can compute the conditional entropy for $X_1=T$ and $X_1=F$.

Compute $X_1=T$

$X_1$	Y
T	T
T	T
T	F
F	T
F	F

After splitting, when $X_1=T$, Y has two values. Therefore, $P(Y=T|X_1=T) = 2/3$ and $P(Y=F|X_1=T) = 2/3$.

We can jave the following entropy when $X_1=T$:

$$
H(Y|X_1=T)=-\sum^{n}_{i=1}P(y_i|X_1=T)log_2P(y_i|X_1=T)
$$

$$
H(Y|X_1=T)= -\frac{2}{3}log_2(\frac{2}{3})-\frac{1}{3}log_2(\frac{1}{3}) \approx 0.28
$$

Compute $X_1=F$

$X_1$	Y
T	T
T	T
T	F
F	T
F	F

After splitting, when $X_1=F$, Y has two values. Therefore, $P(Y=T|X_1=F) = 1/2$ and $P(Y=F|X_1=T) = 1/2$.

We can jave the following entropy when $X_1=F$:

$$
H(Y|X_1=T)=-\sum^{n}_{i=1}P(y_i|X_1=F)log_2P(y_i|X_1=F)
$$

$$
H(Y|X_1=T)=-\frac{1}{2}log_2(\frac{1}{2})-\frac{1}{2}log_2(\frac{1}{2}) \approx 0.3
$$

Entropy after split

$X_1$	Y
T	T
T	T
T	F
F	T
F	F

We get $P(X_1=T)=3/5$ and $P(X_1=F)=2/5$. We have the following overall conditional entropy:

$$
H(Y|X_1)=\frac{3}{5}0.28+\frac{2}{5}0.3=0.288
$$

Information Gain

Information Gain $I(X,Y)$ is defined as the expected reduction in entropy of target varible Y after split over variable X.

$$
I(X,Y)=H(Y)-H(Y|X)
$$

In the previous example, we can get the infromation gain is:

$$
I(X_1,Y)=H(Y)-H(Y|X_1)=0.292-0.288\approx0.004
$$