Infor­ma­tion entropy defined as easy as pie!

Share post:

Let’s assume that in summer we want to send our friends a telegram from Hamburg to St. Peters­burg every day about the weather condi­tions: sunny, slightly cloudy, overcast, rainy. Weather is a random event for us, however we expect sunny weather for about $1/2$ of the days in summer, partly cloudy for $1/4$ of the days, overcast for $1/8$ of the days and rainy for $1/8$ of the days. Only zeros ($0$) and ones (1) can be sent by telegraph. One symbol costs $1$ Euro. If the messages are encoded with symbol pairs $00$, $01$, $10$ and $11$, two bits are sent every day at a total cost of $2$ Euro. The telegrams then cost $184$ Euro for $92$ summer days in total. If you code the messages “sunny”, “slightly cloudy”, “overcast”, “rainy” with $0$, $10$, $110$, $111$, on average every day

$ (1/2)\cdot 1 + (1/4)\cdot 2 + (1/8)\cdot 3 + (1/8)\cdot 3 = 7/4 $

Bit sent, and for $92$ summer days about $7/4 \cdot 1 \cdot 92 = 161$ Euro are spent. You can see the diffe­rence, some codes are more expen­sive and some are cheaper, and it is obvious that there is the cheapest code.
If the coding is so intel­li­gent that it causes the lowest cost of telegrams with event reports over the summer, then the average number of bits sent per day is referred to as the infor­ma­tion entropy of this event.
The genera­liza­tion of the defini­tion to arbitrary random events is now simple.
How do you calcu­late infor­ma­tion entropy?
This question is related to the question of what coding is. Assume that there is a terri­tory of messages, which is repre­sented by the first rectangle in the figure. Let us assume that the area of the entire terri­tory is equal to $1$. If the terri­tory is divided into two parts, each part is given the codenames (coordi­nates) $0$ and $1$ as shown in the second rectangle. The area sizes of $0$ and $1$ are equal to $1/2$. If each message terri­tory area $0$ and $1$ is also divided into two parts, then these small areas of the terri­tory are given the codes $00$, $01$, $10$, $11$ as shown by the third rectangle. The area sizes of these parts are equal to $1/4$. This shows what the codes are for the areas with area sizes $1/8$, $1/16$ etc. It can be seen that the area size of a message terri­tory area is equal to $\frac{1}{2^n}$, where $n$ is the length of its code. In other words: if $q$ is the area of the region, its code length is $\log_2 \frac{1}{q}=-\log_2 q$. The larger the area, the shorter the code. The 4. and 5. rectan­gles show examples of the place­ment of messages over the message terri­tory.
Informationsentropie kinderleicht definiert
Let $N$ messages be given, so that the proba­bi­lity of the $i$th message is equal to $p_i$ and the $i$th message occupies a message terri­tory area with the area size $q_i$ and a code of length $-\log_2 q_i$. The average number of bits sent daily with this coding is $ -p_1\cdot \log_2 q_1 -\ldots -p_N\cdot \log_2 q_N. $
For which values of $q_i$ is this sum minimal?
We remind you of the fact $$ p_1 +…+p_N = 1 $$ and a secon­dary condi­tion $$ q_1 +…+q_N = 1. $$ Let’s solve the problem with Lagrange multi­pliers. We do not forget that $p_i$ are constants and $q_i$ are varia­bles. The Lagrange function is $$ \lambda(q_1,\ldots,q_N,\lambda) =-p_1\cdot \log_2 q_1 -…-p_N\cdot \log_2 q_N + \lambda\cdot (q_1+…+q_N). $$ Let’s find the deriva­tives of $\Lambda$ to $q_i$ and set them equal to $0$: $$ -\frac{p_i}{q_i\cdot \ln 2} + \lambda = 0. $$ Reminder, $$ \frac{d}{dx}\log_2x = \frac{d}{dx} \frac{\ln x}{\ln 2} = \frac{1}{x \ln 2}. $$ From this follows $$ p_i = \lambda\cdot q_i\cdot\ln 2. $$ If you sum these equations for all $i$, you get $$ \lambda\cdot\ln 2 = 1. $$ If you insert $\lambda\cdot \ln 2 = 1$ into $p_i = \lambda\cdot q_i\cdot\ln 2$, you get $p_i = q_i$. Optimal coding is there­fore achieved when $q_i = p_i$, which means that the formula for calcu­la­ting the entropy $H$ is as defined: $$ H = -p_1\cdot \log_2 p_1 -…-p_N\cdot \log_2 p_N. $$
Cross entropy
Note that $q_i$ behave like proba­bi­lity varia­bles: they are greater than $0$ and their sum is equal to $1$. The size $$ H(p,q)=-p_1\cdot \log_2 q_1 -…-p_N\cdot \log_2 q_N $$ is called the cross entropy of the two proba­bi­lity distri­bu­tions $(p_i)$ and $(q_i)$, but $q_i$ does not actually have a proba­bi­lity role. The value $H(p,q)$ provides the average bit costs for the trans­mis­sion of messages that are coded accor­ding to ranges $q_i$ but are trans­mitted with proba­bi­li­ties $p_i$.
Kullback-Leibler diver­gence
The diffe­rence between $$ -p_1\cdot \log_2 q_1 -…-p_N\cdot \log_2 q_N $$ and $$ -p_1\cdot \log_2 p_1 -…-p_N\cdot \log_2 p_N $$ is the value of average additional bit costs for message trans­mis­sion due to subop­timal coding. This quantity is referred to as the Kullback-Leibler diver­gence between the distri­bu­tions $(p_i)$ and $(q_i)$.
One remark
For the sake of simpli­city, let us now assume that there are only $2$ weather condi­tions in summer: good weather $G$ and bad weather $S$. Let us also assume the proba­bi­li­ties $P(G)=3/4$, $P(S)=1/4$. Accor­dingly, the infor­ma­tion entropy of the weather is equal to $$ -\frac{3}{4}\cdot \log_2 \frac{3}{4} -\frac{1}{4}\cdot \log_2 \frac{1}{4} \approx 0.8113. $$ For the optimal coding of the event $S$ you need $\log_2 \frac{1}{4}=2$ bits. Every­thing is in the green here. However, for the $G$ you need about $\log_2 \frac{3}{4}\approx 0.415$ bits. This is less than $1$ bit and does not match reality, that by defini­tion only whole symbols $0$ and $1$ are written in the telegrams. There is no other way and we have to encode both events with $1$ bit: $0$ and $1$, just as in the case $P(G)=P(S)=1/2$. So if we want to send a telegram every day, we can’t save money and are forced to, Telegraph $92$ bit for $92$ days. It means the entropy value equals $1$ and seems to contra­dict the theory. However, you can actually save money by sending a single telegram with a weather report for all days, not every day but only at the end of the summer. You can see it in an example with $2$ days. With $P(G)=P(S)=1/2$ the proba­bi­li­ties for all possible 2-day variants are the same $$ P(GG)=P(GS)=P(SG)=P(SS)=1/4. $$ The infor­ma­tion entropy here is $2$ for $2$-day weather events and $1$ for $1$-day weather events. The weather entropy is there­fore $1$. In the case $P(G)=3/4$, $P(S)=1/4$ it is $P(GG)=9/16$, $P(GS)=P(SG)=3/16$, $P(SS)=1/16$. The most optimal coding for the event $GG$ requires appro­xi­m­ately $\log_2 \frac{9}{16} \approx 0.83$ Bit. Similarly for the events $GS$, $SG$, $SS$ you need about $2.415$, $2.415$ and $4$ bits accor­dingly. The most optimal coding looks like this $GG\rightarrow 0$, $GS\rightarrow 10$, $SG\rightarrow 110$, $SS\rightarrow 111$. With this coding, the average report for both days is then only $$ \frac{9}{16}\cdot 1 + \frac{3}{16}\cdot 2 + \frac{3}{16}\cdot 3 + \frac{1}{16}\cdot 3 = 1.6875 \quad (\text{Bit}) $$ and for one day: $$ \frac{1.6875}{2} = 0.84375 \quad (\text{Bit}). $$ The infor­ma­tion entropy of the weather event is equal to $0.84375$. Already cheaper. The longer the sequence of events is coded at once, the more optimal the coding is. If you go to infinity, you approach the theore­tical infor­ma­tion entropy value in praxi.
Picture of Alexei Kurgansky

Alexei Kurgansky

Project request

Thank you for your interest in m²hycon’s services. We look forward to hearing about your project and attach great importance to providing you with detailed advice.

We store and use the data you enter in the form exclusively for processing your request. Your data is transmitted in encrypted form. We process your personal data in accordance with our privacy policy.