##
Section 7.1 An Introduction to Entropy

The proofs in this chapter will be more probabilistic in nature than any of the other proofs given in this course so far. However, the probability theory we will use is quite basic, requiring nothing more than an understanding of expectation and conditional probability. A key idea is the notion of the “entropy” of a random variable, which is defined as follows.

##

###
Definition 7.1.1.

Let \(X\) be a discrete random variable with finite range \(S\text{.}\) The *entropy * of \(X\) is defined to be

\begin{equation*}
\mathbb{H}(X) = \sum_{x\in S}\mathbb{P}(X=x) \log_2\left(\frac{1}{\mathbb{P}(X=x)}\right)\text{.}
\end{equation*}

A nice way to think of the quantity \(\log\left(\frac{1}{\mathbb{P}(X=x)}\right)\) is as quantifying the amount of “surprise” that you feel from seeing that \(X=x\text{.}\) If \(X=x\) is something that happens with probability close to \(1\text{,}\) i.e. it is very likely, then the amount of surprise that you feel from seeing that \(X=x\) is close to zero. E.g., the fact that the sun comes up in the morning and sets in the evening is never really surprising because the probability that it would not happen is so low. Accordingly, if \(\mathbb{P}(X=x)\) is close to \(1\text{,}\) then \(\log\left(\frac{1}{\mathbb{P}(X=x)}\right)\) is close to zero. Conversely, it is surprising when unlikely events happen, and this is reflected by the fact that \(\log\left(\frac{1}{\mathbb{P}(X=x)}\right)\) is large when \(\mathbb{P}(X=x)\) is small. Given this intuition, one can think of entropy \(H(X)\) as being the “expected surprise” of the random variable \(X\text{.}\)

###
Example 7.1.3.

Suppose that \(X\) is uniformly distributed on a finite set \(S\text{.}\) That is, for every \(x\in S\text{,}\) we have \(\mathbb{P}(X=x)=\frac{1}{|S|}\text{.}\) Then

\begin{equation*}
H(X) = \sum_{x\in S}\mathbb{P}(X=x)\log\left(\frac{1}{\mathbb{P}(X=x)}\right) = \sum_{x\in S}\frac{\log(|S|)}{|S|} = \log(|S|)\text{.}
\end{equation*}

So, in general, if \(X\) is uniformly distributed on a finite set, then \(H(X)\) is the logarithm of the cardinality of the range of \(X\text{.}\)

We will need to build up some basic facts about entropy. Many of these are related to the notion of “conditional” probabilities, random variables and entropy; we start by giving the relevant definitions. Anyone who has taken Math 352, or any other probability or stats course, at UVic will be familiar with several of these definitions.

###
Definition 7.1.4.

Given two events \(A\) and \(B\) in a probability space such that \(\mathbb{P}(B)\neq 0\text{,}\) the *conditional probability * of \(A\) *given* \(B\) is defined by

\begin{equation*}
\mathbb{P}(A|B)=\frac{\mathbb{P}(A\cap B)}{\mathbb{P}(B)}\text{.}
\end{equation*}

###
Definition 7.1.5.

Say that events \(A\) and \(B\) are *independent * if \(\mathbb{P}(A\cap B)=\mathbb{P}(A)\mathbb{P}(B)\text{.}\)

###
Definition 7.1.7.

Two discrete random variables \(X\) and \(Y\) on the same probability space are *independentRV * if the events \(\{X=x\}\) and \(\{Y=y\}\) are independent for every \(x\) and \(y\text{.}\)

###
Definition 7.1.8.

Given discrete random variables \(X\) and \(Y\) and \(y\) in the range of \(Y\text{,}\) let the *conditional entropy of \(X\) given \(Y=y\)* be given by

\begin{equation*}
H(X\mid Y=y) = \sum_{x}\mathbb{P}(X=x\mid Y=y)\log\left(\frac{1}{\mathbb{P}(X=x\mid Y=y)}\right)
\end{equation*}

where this sum is over all \(x\) such that \(\mathbb{P}(X=x\mid Y=y)\neq 0\text{.}\)

###
Definition 7.1.9.

For discrete random variables \(X\) and \(Y\text{,}\) the *conditional entropy of \(X\) given \(Y\)* is

\begin{equation*}
\mathbb{H}(X|Y) = \sum_{y}\mathbb{P}(Y=y)H(X\mid Y=y)\text{.}
\end{equation*}

With all of these new definitions, its time for a few examples.

###
Example 7.1.10.

Suppose that \(X\) and \(Y\) are independent. Then, for any \(y\) in the range of \(Y\text{,}\) we have

\begin{equation*}
\mathbb{P}(X=x\mid Y=y) = \mathbb{P}(X=x)\text{.}
\end{equation*}

Therefore, \(H(X\mid Y=y)\) is simply equal to \(H(X)\text{.}\) Also,

\begin{equation*}
H(X\mid Y)=\sum_{y}\mathbb{P}(Y=y)H(X\mid Y=y)=\sum_{y}\mathbb{P}(Y=y)H(X)=H(X)\text{.}
\end{equation*}

The entropy

\(H(X)\) of

\(X\) can be thought of as the amount of “information” carried by the random variable

\(X\text{.}\) With this interpretation in mind, what

Example 7.1.10 is illustrating is that, if

\(X\) and

\(Y\) are independent, then, by “revealing” the information that

\(Y=y\text{,}\) we do not reduce the amount of information contained in

\(X\text{.}\) That is, if

\(X\) and

\(Y\) are independent, then knowing that

\(Y=y\) tells us nothing about the value of

\(X\text{.}\) We are still equally “surprised” by seeing the value of

\(X\) as we would be if we had known nothing about

\(Y\text{.}\) An example to think about is if

\(X\) is the indicator function of you winning the lottery and

\(Y\) is the indicator function of you forgetting to take out the garbage. Whether or not you forgot to take out the garbage has no effect on how surprised you would be to win the lottery, as these two events have nothing to do with one another.

The next example illustrates another extreme. Say that a random variable \(X\) is *determined* by a random variable \(Y\) if, for every \(y\) in the range of \(Y\text{,}\) there exists \(f(y)\) such that \(\mathbb{P}(X=f(y)\mid Y=y)=1\text{.}\)

###
Example 7.1.11.

Suppose that \(X\) is determined by \(Y\text{.}\) Then, for each \(y\) in the range of \(Y\text{,}\) we have that

\begin{equation*}
H(X\mid Y=y) = \sum_{x}\mathbb{P}(X=x\mid Y=y)\log\left(\frac{1}{\mathbb{P}(X=x\mid Y=y)}\right)
\end{equation*}

\begin{equation*}
= 1\log(1) = 0\text{.}
\end{equation*}

Therefore, we also have that \(H(X\mid Y) = 0\text{.}\)

Example 7.1.11 illustrates the fact that, when

\(X\) is determined by

\(Y\text{,}\) if we already know the value of

\(Y\text{,}\) then there is no additional surprise from learning the value of

\(X\text{.}\) Thus,

\(H(X\mid Y)=0\text{.}\)
###
Example 7.1.12.

Let

\(Y\) be uniformly distributed on

\(\{0,1\}\text{.}\) Let

\(X\) be the random variable such that, if

\(Y=0\text{,}\) then

\(X\) is uniformly distributed on

\(\{4,5\}\) and, if

\(Y=1\text{,}\) then

\(X\) is uniformly distributed on

\(\{6,7\}\text{.}\) If we think about

\(X\) in isolation, then we see that it is equally likely to be equal to any of the values in

\(\{4,5,6,7\}\text{.}\) Therefore, by the result of

Example 7.1.3,

\begin{equation*}
H(X)=\log(4)\text{.}
\end{equation*}

Also, by definition,

\begin{equation*}
H(Y)=\log(2)\text{.}
\end{equation*}

Now, let’s think about

\(H(X\mid Y)\text{.}\) By going through a calculation similar to

Example 7.1.3, we see that

\(H(X\mid Y=0)=H(X\mid Y=1)=\log(2)\text{.}\) Thus, we have

\begin{equation*}
H(X\mid Y)=\log(2)\text{.}
\end{equation*}

Note that, if we think about \(H(Y\mid X)\) instead, then the situation is different. Knowing the value of \(X\) completely determines \(Y\text{;}\) therefore, \(H(Y\mid X)=0\text{.}\)

The following lemma collates several of the basic facts that we will need about entropy and conditional entropy.

###
Lemma 7.1.13.

Let \(X,Y,Z,X_1,\dots,X_n\) be discrete random variables and let \(S\) be the range of \(X\text{.}\) The following hold:

\(H(X)\leq \log(|S|)\) where equality holds if and only if \(X\) is uniformly distributed on \(S\text{,}\)

\(H(X_1,\dots,X_n) = H(X_1)+H(X_2\mid X_1) + \cdots + H(X_n\mid X_1,\dots,X_{n-1})\text{,}\)

if \(Z\) is determined by \(Y\text{,}\) then \(H(Y,Z)=H(Y)\) and

\(H(X\mid Y)\leq H(X)\text{,}\) with equality if and only if \(X\) and \(Y\) are independent.

### Proof.

We have already seen in

Example 7.1.3 that the uniform distribution has entropy equal to the log of the size of its range. The inequality in

Item a follows from Jensen’s Inequality (

Theorem C.0.3) and the fact that the function

\(x\log(x)\) is convex. The case of equality follows from the fact that

\(x\log(x)\) is actually strictly convex; we omit the details.

Now, let’s think about

property b. First, consider the case

\(n=2\text{.}\) In this case,

\begin{equation*}
H(X_1,X_2)= \sum_{(x_1,x_2)}\mathbb{P}((X_1,X_2)=(x_1,x_2))\log\left(\frac{1}{\mathbb{P}((X_1,X_2)=(x_1,x_2))}\right)
\end{equation*}

\begin{equation*}
= \sum_{(x_1,x_2)}\mathbb{P}(X_1=x_1 \text{ and } X_2=x_2)\log\left(\frac{1}{\mathbb{P}(X_1=x_1 \text{ and } X_2=x_2)}\right)
\end{equation*}

\begin{equation*}
= \sum_{(x_1,x_2)}\mathbb{P}(X_1=x_1)\mathbb{P}(X_2=x_2\mid X_1=x_1)\log\left(\frac{1}{\mathbb{P}(X_1=x_1)\mathbb{P}(X_2=x_2\mid X_1=x_1)}\right)
\end{equation*}

\begin{equation*}
=\sum_{(x_1,x_2)}\mathbb{P}(X_1=x_1)\mathbb{P}(X_2=x_2\mid X_1=x_1)\left[\log\left(\frac{1}{\mathbb{P}(X_1=x_1)}\right)+\log\left(\frac{1}{\mathbb{P}(X_2=x_2\mid X_1=x_1)}\right)\right]\text{.}
\end{equation*}

Now, split this into two separate sums. The first one is

\begin{equation*}
\sum_{x_1}\sum_{x_2}\mathbb{P}(X_1=x_1)\mathbb{P}(X_2=x_2\mid X_1=x_1)\log\left(\frac{1}{\mathbb{P}(X_1=x_1)}\right) = H(X_1)
\end{equation*}

and the second is

\begin{equation*}
\sum_{x_1}\sum_{x_2}\mathbb{P}(X_1=x_1)\mathbb{P}(X_2=x_2\mid X_1=x_1)\log\left(\frac{1}{\mathbb{P}(X_2=x_2\mid X_1=x_1)}\right) = H(X_2\mid X_1)
\end{equation*}

as desired. Now, for \(n\geq3\text{,}\) we proceed by induction. Let \(Y_1=(X_1,X_2)\) and, for \(2\leq i\leq n-1\text{,}\) let \(Y_i=X_{i+1}\text{.}\) Then, by induction,

\begin{equation*}
H(Y_1,\dots,Y_{n-1})=H(Y_1)+H(Y_2\mid Y_1)+\dots+H(Y_{n-1}\mid Y_1,\dots,Y_{n-2})\text{.}
\end{equation*}

But we also know that \(H(Y_1)=H(X_1,X_2)=H(X_1)+H(X_2\mid X_1)\text{.}\) So, if we substitute this for \(H(Y_1)\) and write out the above expression in terms of \(X_1,\dots,X_n\text{,}\) then we get the result.

Here is a YouTube video related to this application of the container method: