Information entropy defined as easy as pie!
- Von Alexei Kurgansky
- Claude Shannon, Cross entropy, Entropy, Information entropy, Kullback-Leibler divergence
Share post:
Let’s assume that in summer we want to send our friends a telegram from Hamburg to St. Petersburg every day about the weather conditions: sunny, slightly cloudy, overcast, rainy. Weather is a random event for us, however we expect sunny weather for about $1/2$ of the days in summer, partly cloudy for $1/4$ of the days, overcast for $1/8$ of the days and rainy for $1/8$ of the days. Only zeros ($0$) and ones (1) can be sent by telegraph. One symbol costs $1$ Euro. If the messages are encoded with symbol pairs $00$, $01$, $10$ and $11$, two bits are sent every day at a total cost of $2$ Euro. The telegrams then cost $184$ Euro for $92$ summer days in total. If you code the messages “sunny”, “slightly cloudy”, “overcast”, “rainy” with $0$, $10$, $110$, $111$, on average every day
$ (1/2)\cdot 1 + (1/4)\cdot 2 + (1/8)\cdot 3 + (1/8)\cdot 3 = 7/4 $
Bit sent, and for $92$ summer days about $7/4 \cdot 1 \cdot 92 = 161$ Euro are spent. You can see the difference, some codes are more expensive and some are cheaper, and it is obvious that there is the cheapest code.If the coding is so intelligent that it causes the lowest cost of telegrams with event reports over the summer, then the average number of bits sent per day is referred to as the information entropy of this event.The generalization of the definition to arbitrary random events is now simple.
How do you calculate information entropy?This question is related to the question of what coding is. Assume that there is a territory of messages, which is represented by the first rectangle in the figure. Let us assume that the area of the entire territory is equal to $1$. If the territory is divided into two parts, each part is given the codenames (coordinates) $0$ and $1$ as shown in the second rectangle. The area sizes of $0$ and $1$ are equal to $1/2$. If each message territory area $0$ and $1$ is also divided into two parts, then these small areas of the territory are given the codes $00$, $01$, $10$, $11$ as shown by the third rectangle. The area sizes of these parts are equal to $1/4$. This shows what the codes are for the areas with area sizes $1/8$, $1/16$ etc. It can be seen that the area size of a message territory area is equal to $\frac{1}{2^n}$, where $n$ is the length of its code. In other words: if $q$ is the area of the region, its code length is $\log_2 \frac{1}{q}=-\log_2 q$. The larger the area, the shorter the code. The 4. and 5. rectangles show examples of the placement of messages over the message territory.
Let $N$ messages be given, so that the probability of the $i$th message is equal to $p_i$ and the $i$th message occupies a message territory area with the area size $q_i$ and a code of length $-\log_2 q_i$. The average number of bits sent daily with this coding is
$
-p_1\cdot \log_2 q_1 -\ldots -p_N\cdot \log_2 q_N.
$
For which values of $q_i$ is this sum minimal?We remind you of the fact $$ p_1 +…+p_N = 1 $$ and a secondary condition $$ q_1 +…+q_N = 1. $$ Let’s solve the problem with Lagrange multipliers. We do not forget that $p_i$ are constants and $q_i$ are variables. The Lagrange function is $$ \lambda(q_1,\ldots,q_N,\lambda) =-p_1\cdot \log_2 q_1 -…-p_N\cdot \log_2 q_N + \lambda\cdot (q_1+…+q_N). $$ Let’s find the derivatives of $\Lambda$ to $q_i$ and set them equal to $0$: $$ -\frac{p_i}{q_i\cdot \ln 2} + \lambda = 0. $$ Reminder, $$ \frac{d}{dx}\log_2x = \frac{d}{dx} \frac{\ln x}{\ln 2} = \frac{1}{x \ln 2}. $$ From this follows $$ p_i = \lambda\cdot q_i\cdot\ln 2. $$ If you sum these equations for all $i$, you get $$ \lambda\cdot\ln 2 = 1. $$ If you insert $\lambda\cdot \ln 2 = 1$ into $p_i = \lambda\cdot q_i\cdot\ln 2$, you get $p_i = q_i$. Optimal coding is therefore achieved when $q_i = p_i$, which means that the formula for calculating the entropy $H$ is as defined: $$ H = -p_1\cdot \log_2 p_1 -…-p_N\cdot \log_2 p_N. $$
Cross entropyNote that $q_i$ behave like probability variables: they are greater than $0$ and their sum is equal to $1$. The size $$ H(p,q)=-p_1\cdot \log_2 q_1 -…-p_N\cdot \log_2 q_N $$ is called the cross entropy of the two probability distributions $(p_i)$ and $(q_i)$, but $q_i$ does not actually have a probability role. The value $H(p,q)$ provides the average bit costs for the transmission of messages that are coded according to ranges $q_i$ but are transmitted with probabilities $p_i$.
Kullback-Leibler divergenceThe difference between $$ -p_1\cdot \log_2 q_1 -…-p_N\cdot \log_2 q_N $$ and $$ -p_1\cdot \log_2 p_1 -…-p_N\cdot \log_2 p_N $$ is the value of average additional bit costs for message transmission due to suboptimal coding. This quantity is referred to as the Kullback-Leibler divergence between the distributions $(p_i)$ and $(q_i)$.
One remarkFor the sake of simplicity, let us now assume that there are only $2$ weather conditions in summer: good weather $G$ and bad weather $S$. Let us also assume the probabilities $P(G)=3/4$, $P(S)=1/4$. Accordingly, the information entropy of the weather is equal to $$ -\frac{3}{4}\cdot \log_2 \frac{3}{4} -\frac{1}{4}\cdot \log_2 \frac{1}{4} \approx 0.8113. $$ For the optimal coding of the event $S$ you need $\log_2 \frac{1}{4}=2$ bits. Everything is in the green here. However, for the $G$ you need about $\log_2 \frac{3}{4}\approx 0.415$ bits. This is less than $1$ bit and does not match reality, that by definition only whole symbols $0$ and $1$ are written in the telegrams. There is no other way and we have to encode both events with $1$ bit: $0$ and $1$, just as in the case $P(G)=P(S)=1/2$. So if we want to send a telegram every day, we can’t save money and are forced to, Telegraph $92$ bit for $92$ days. It means the entropy value equals $1$ and seems to contradict the theory. However, you can actually save money by sending a single telegram with a weather report for all days, not every day but only at the end of the summer. You can see it in an example with $2$ days. With $P(G)=P(S)=1/2$ the probabilities for all possible 2-day variants are the same $$ P(GG)=P(GS)=P(SG)=P(SS)=1/4. $$ The information entropy here is $2$ for $2$-day weather events and $1$ for $1$-day weather events. The weather entropy is therefore $1$. In the case $P(G)=3/4$, $P(S)=1/4$ it is $P(GG)=9/16$, $P(GS)=P(SG)=3/16$, $P(SS)=1/16$. The most optimal coding for the event $GG$ requires approximately $\log_2 \frac{9}{16} \approx 0.83$ Bit. Similarly for the events $GS$, $SG$, $SS$ you need about $2.415$, $2.415$ and $4$ bits accordingly. The most optimal coding looks like this $GG\rightarrow 0$, $GS\rightarrow 10$, $SG\rightarrow 110$, $SS\rightarrow 111$. With this coding, the average report for both days is then only $$ \frac{9}{16}\cdot 1 + \frac{3}{16}\cdot 2 + \frac{3}{16}\cdot 3 + \frac{1}{16}\cdot 3 = 1.6875 \quad (\text{Bit}) $$ and for one day: $$ \frac{1.6875}{2} = 0.84375 \quad (\text{Bit}). $$ The information entropy of the weather event is equal to $0.84375$. Already cheaper. The longer the sequence of events is coded at once, the more optimal the coding is. If you go to infinity, you approach the theoretical information entropy value in praxi.