Skip to main content

Section 7.1 An Introduction to Entropy

The proofs in this chapter will be more probabilistic in nature than any of the other proofs given in this course so far. However, the probability theory we will use is quite basic, requiring nothing more than an understanding of expectation and conditional probability. A key idea is the notion of the “entropy” of a random variable, which is defined as follows.
described in detail following the image
“Information.”

Definition 7.1.1.

Let \(X\) be a discrete random variable. The range of \(X\) is the set \(\rng(X)=\{x:\mathbb{P}(X=x)>0\}\text{.}\)

Definition 7.1.2.

Let \(X\) be a discrete random variable. The entropy of \(X\) is defined to be
\begin{equation*} \mathbb{H}(X) = \sum_{x\in \rng(X)}\mathbb{P}(X=x) \log_2\left(\frac{1}{\mathbb{P}(X=x)}\right)\text{.} \end{equation*}

Remark 7.1.3.

Most of the time, the base of the logarithm is not important. Note that, since \(\log_a(x)=\frac{\log_b(x)}{\log_b(a)}\text{,}\) changing the base in the definition of entropy boils down to nothing more than multiplying the whole expression by a constant.
Let us try to build up some intuition about what the entropy of a random variable is measuring. For this, it is valuable to speak in terms of “information.” We think of information as being measured in “bits;” i.e. a sequence of zeros and ones of length \(n\) has the capacity to store \(n\) bits of information. The entropy of a random variable can be thought of as measuring the expected number of bits of information that are “revealed” by learning the outcome of a random experiment.
To begin to understand this, let us start with the simplest random variable, a fair coin flip. A coin flip has two possible outcomes, both of which are equally likely. If one reveals the outcome of a single coin flip to you, then they have conveyed exactly one bit of information to you. And, indeed, the entropy of a coin flip agrees with this intuitive definition; if \(X\) is the outcome of a coin flip, then
\begin{equation*} \mathbb{H}(X) = \mathbb{P}(X=\text{H})\log_2\left(\frac{1}{\mathbb{P}(X=\text{H})}\right)+\mathbb{P}(X=\text{T})\log_2\left(\frac{1}{\mathbb{P}(X=\text{T})}\right) \end{equation*}
\begin{equation*} =\frac{1}{2}\log_2(2) + \frac{1}{2}\log_2(2)=1. \end{equation*}
So, indeed, a coin flip does have entropy equal to one. Analogously, if you take \(n\) independent coin flips there are \(2^n\) outcomes, each of which occurs with probability \(2^{-n}\) and so, plugging this into the entropy formula, we get that the entropy is \(n\) which matches up with the intuition that revealing the outcome of \(n\) independent coin flips conveys \(n\) bits of information.
The intuition that entropy measures the number of bits of information that is conveyed when revealing the outcome of a random variable is easy to see when that random variable is chosen uniformly at random from a range of cardinality \(2^n\text{.}\) In other situations, while it may be a little harder to see the connections to revealing bits of information, this is still the correct intuition. We will try to explain why this is the case through various examples in this section.
One of the basic principles of entropy is that, among all random variables with a given range of cardinality \(n\text{,}\) the maximum entropy is achieved by the uniform random variable. In terms of information, this says that the average amount of information that can be gleaned from the outcome of a random experiment is achieved when that experiment is as unbiased as possible; such experiments have the least “predictable” and therefore most “informative” outcomes.

Proof.

Let’s do the easy part first. If \(X\) is uniform on its range, then
\begin{equation*} \mathbb{H}(X)=\sum_{x\in\rng(X)}\mathbb{P}(X=x)\log_2\left(\frac{1}{\mathbb{P}(X=x)}\right) \end{equation*}
\begin{equation*} =\sum_{x\in\rng(X)}\left(\frac{1}{|\rng(X)|}\right)\log_2\left(|\rng(X)|\right) = \log_2(|\rng(X)|). \end{equation*}
So, \(\mathbb{H}(X)=\log_2(|\rng(X)|)\) when \(X\) is uniform on its range.
z
Now, for the general bound and the characterization of equality, let us first label the elements of \(\rng(X)\) by \(x_1,\dots,x_n\text{.}\) Observe that
\begin{equation*} \mathbb{H}(X)=\sum_{i=1}^n\mathbb{P}(X=x_i)\log_2\left(\frac{1}{\mathbb{P}(X=x_i)}\right) \end{equation*}
\begin{equation*} =-\sum_{i=1}^n\mathbb{P}(X=x_i)\log_2\left(\mathbb{P}(X=x_i)\right). \end{equation*}
Now, observe that \(x\log_2(x)\) is a strictly convex function from \((0,\infty)\) to \(\mathbb{R}\) (which can be seen by taking its second derivative and seeing that it is always positive). So, if we apply Jensen’s Inequality with \(\varphi(x)=x\log_2(x)\text{,}\) \(a_i=1\) and \(z_i=\mathbb{P}(X=x_i)\) for \(1\leq i\leq n\text{,}\) we get
\begin{equation*} -\sum_{i=1}^n\mathbb{P}(X=x_i)\log_2\left(\mathbb{P}(X=x_i)\right)=-\sum_{i=1}^n\varphi(z_i) \end{equation*}
\begin{equation*} \leq -\left(\sum_{i=1}^na_i\right)\varphi\left(\frac{\sum_{i=1}^na_iz_i}{\sum_{i=1}^na_i}\right) = -n\varphi(1/n) = -n(1/n)\log_2(1/n)=\log_2(n). \end{equation*}
Moreover, since \(x\log_2(x)\) is strictly convex, this bound is tight if and only if all of the \(z_i\)s are equal.
A powerful concept in entropy is the notion of conditional entropy \(\mathbb{H}(X\mid Y)\) where \(X\) and \(Y\) are two random variables. Essentially, this measures the expected amount of “additional” information that one gains from learning the outcome of \(X\) given that the outcome of \(Y\) is already known. Before getting into the technical definition, let us discuss a little example. Suppose that I generate a uniformly random element \(X\) of \(\{0,1\}^2\) and I don’t tell you what it is. Then, I generate a random variable \(Y\) by letting it equal \(0\) if the two coordinates of \(X\) sum to an even number and letting \(Y=X\) otherwise. Suppose that I now reveal \(Y\) to you (but \(X\) is still hidden). The entropy of \(Y\) is
\begin{equation*} \frac{1}{2}\log_2(2) + \frac{1}{4}\log_2(4) + \frac{1}{4}\log_2(4) = \frac{3}{2}. \end{equation*}
Of course, the entropy of \(X\) is \(2\) and so, in expectation, the amount of information conveyed by \(X\) is half a bit more than that conveyed by \(Y\) on average. This matches up with the fact that the entropy of \(Y\) is less than that of \(X\) by \(1/2\text{.}\)
Now, suppose that, after I have already shown you \(Y\text{,}\) I reveal the random variable \(X\) to you. How much total information is conveyed by the pair \((X,Y)\text{?}\) Well, if you think about it, if we know the outcome of \(X\text{,}\) then the outcome of \(Y\) can already be deduced from that information. Therefore, when we know the outcome of \(X\text{,}\) we get no additional information from \(Y\text{.}\) So, the entropy of \((X,Y)\) is equal to the entropy of \(X\text{,}\) which is 2. (Note: If you wanted to check this, you could do so using the original formula for entropy). In terms of the notion of conditional entropy that we roughly described in the previous paragraph, this means that \(\mathbb{H}(Y\mid X)=0\text{.}\) On the other hand, when we reveal \(X\) after revealing \(Y\text{,}\) we gain half a bit of information on average. This translates to the statement that \(\mathbb{H}(X\mid Y)=\frac{1}{2}\text{.}\) With this example in mind, let us now develop a general definition of conditional entropy and study its properties. First, we need the notion of conditional probability.

Definition 7.1.5.

Given two events \(A\) and \(B\) in a probability space such that \(\mathbb{P}(B)\neq 0\text{,}\) the conditional probability of \(A\) given \(B\) is defined by
\begin{equation*} \mathbb{P}(A|B)=\frac{\mathbb{P}(A\cap B)}{\mathbb{P}(B)}\text{.} \end{equation*}

Definition 7.1.6.

Given two discrete random variables \(X\) and \(Y\) and \(y\in \rng(Y)\text{,}\) let \(\rng(X\mid Y=y)=\{x: \mathbb{P}(X=x\mid Y=y)>0\}\text{.}\)

Definition 7.1.7.

Given discrete random variables \(X\) and \(Y\) and \(y\in\rng(Y)\text{,}\) the conditional entropy of \(X\) given \(Y=y\) is
\begin{equation*} \mathbb{H}(X\mid Y=y) = \sum_{x\in\rng(X\mid Y=y)}\mathbb{P}(X=x\mid Y=y)\log_2\left(\frac{1}{\mathbb{P}(X=x\mid Y=y)}\right). \end{equation*}
Let’s return to the variables \(X\) and \(Y\) from earlier; that is, \(X\) is a uniformly random binary string of length two and \(Y=X\) if the two bits of \(X\) are different and \(Y=0\) otherwise. If we reveal that \(Y=(0,1)\) or \(Y=(1,0)\text{,}\) then there is no information gain by learning the outcome of \(X\) since it is equal to \(Y\) with probability 1. This matches with the fact that \(\mathbb{H}(X\mid Y=(0,1))=\mathbb{H}(X\mid Y=(1,0))=1\log_2(1) = 0\text{.}\) On the other hand, if \(Y=0\text{,}\) then there is a 50/50 chance that \(X\) is \((0,0)\) or \((1,1)\text{.}\) Thus, there are two possible outcomes and each is equally likely, which suggests that the expected amount of information gained by learning the outcome of \(X\) after having revealed that \(Y=0\) is 1 bit. This matches with the fact that \(\mathbb{H}(X\mid Y=0)=\frac{1}{2}\log_2(2)+\frac{1}{2}\log_2(2)=1\text{.}\) Finally, we present the notion of conditional entropy that will be very useful.

Definition 7.1.8.

For discrete random variables \(X\) and \(Y\text{,}\) the conditional entropy of \(X\) given \(Y\) is
\begin{equation*} \mathbb{H}(X\mid Y) = \sum_{y\in\rng(Y)}\mathbb{P}(Y=y)\mathbb{H}(X\mid Y=y)\text{.} \end{equation*}
Earlier, we intuited that, if conditional entropy measures the amount of additional information gleaned from learning the outcome of one variable after another, then, in our running example, we would have \(\mathbb{H}(Y\mid X)=0\) and \(\mathbb{H}(X\mid Y)=\frac{1}{2}\text{.}\) And, in fact, this is exactly what we would get by computing conditional entropy via the above formula; we leave it to the reader to check this themselves.
Next, we present the very valuable “Chain Rule” of entropy, which formalizes the idea that conditional entropy measures the expected amount of additional information gained by learning the outcome of one variable when the outcome of others are already known.

Proof.

First, consider the case \(n=2\text{.}\) In this case,
\begin{equation*} \mathbb{H}(X_1,X_2)= \sum_{(x_1,x_2)}\mathbb{P}((X_1,X_2)=(x_1,x_2))\log_2\left(\frac{1}{\mathbb{P}((X_1,X_2)=(x_1,x_2))}\right) \end{equation*}
\begin{equation*} = \sum_{(x_1,x_2)}\mathbb{P}(X_1=x_1 \text{ and } X_2=x_2)\log_2\left(\frac{1}{\mathbb{P}(X_1=x_1 \text{ and } X_2=x_2)}\right) \end{equation*}
\begin{equation*} = \sum_{(x_1,x_2)}\mathbb{P}(X_1=x_1)\mathbb{P}(X_2=x_2\mid X_1=x_1)\log_2\left(\frac{1}{\mathbb{P}(X_1=x_1)\mathbb{P}(X_2=x_2\mid X_1=x_1)}\right) \end{equation*}
\begin{equation*} =\sum_{(x_1,x_2)}\mathbb{P}(X_1=x_1)\mathbb{P}(X_2=x_2\mid X_1=x_1)\left[\log_2\left(\frac{1}{\mathbb{P}(X_1=x_1)}\right)+\log_2\left(\frac{1}{\mathbb{P}(X_2=x_2\mid X_1=x_1)}\right)\right]\text{.} \end{equation*}
Now, split this into two separate sums. The first one is
\begin{equation*} \sum_{x_1}\sum_{x_2}\mathbb{P}(X_1=x_1)\mathbb{P}(X_2=x_2\mid X_1=x_1)\log_2\left(\frac{1}{\mathbb{P}(X_1=x_1)}\right) = \mathbb{H}(X_1) \end{equation*}
and the second is
\begin{equation*} \sum_{x_1}\sum_{x_2}\mathbb{P}(X_1=x_1)\mathbb{P}(X_2=x_2\mid X_1=x_1)\log_2\left(\frac{1}{\mathbb{P}(X_2=x_2\mid X_1=x_1)}\right) = \mathbb{H}(X_2\mid X_1) \end{equation*}
as desired.
Now, for \(n\geq3\text{,}\) we apply the case \(n=2\) and induction. Let \(Y_1=(X_1,\dots,X_{n-1})\) and \(Y_2=X_n\text{.}\) Then, using the case \(n=2\text{,}\) we get
\begin{equation*} \mathbb{H}(X_1,\dots,X_n)=\mathbb{H}(Y_1,Y_2)=\mathbb{H}(Y_1)+\mathbb{H}(Y_2\mid Y_1)=\mathbb{H}(X_1,\dots,X_{n-1})+\mathbb{H}(X_n\mid X_1,\dots,X_n). \end{equation*}
By induction, the first term can be broken down using the Chain Rule which gives us the result we wanted.
The last thing we need to address is the effect of independence, or lack thereof, on conditional entropy. Intuitively speaking, two variables \(X\) and \(Y\) are independent if knowing the outcome of \(X\) has no influence on the distribution of \(Y\text{.}\) As an example, one could think of \(X\) and \(Y\) as independent coin flips. Given our intuitive understanding of conditional entropy as the amount of additional information gained by revealing one random variable after another, it is intuitively clear that, if \(X\) and \(Y\) are independent, then \(\mathbb{H}(Y\mid X)\) should just be equal to \(\mathbb{H}(Y)\text{,}\) as we still get just as much information if we learn the outcome of \(Y\) after \(X\) as we would have if we didn’t know anything about \(X\) at all. This checks out with our coin flip examples. Each coin flip imparts 1 bit of information, and they are all independent, and so the entropy of \(n\) coin flips should be \(1+1+\cdots+1=n\text{,}\) which matches with the value that one gets by computing the entropy with the original formula. Let’s now formally define independence and verify our intuition about conditional entropy of independent random variables. It will actually be worthwhile to talk in terms of a more general notion of conditional independence defined below.

Definition 7.1.10.

Let \(X,Y,Z\) be discrete random variables. We say that \(X\) and \(Y\) are conditionally independent given \(Z\) if it holds that
\begin{equation*} \mathbb{P}(X=x\mid Y=y,Z=z)=\mathbb{P}(X=x\mid Z=z) \end{equation*}
and
\begin{equation*} \mathbb{P}(Y=y\mid X=x,Z=z)=\mathbb{P}(Y=y\mid Z=z) \end{equation*}
for all \(z\in\rng(Z),x\in\rng(X\mid Z=z)\) and \(y\in\rng(Y\mid Z=z)\text{.}\)
Less generally, we say that discrete random variables are independent if they are conditionally independent given \(Z\) where \(Z\) is a constant random variable, i.e., \(\mathbb{P}(Z=z)=1\) for some \(z\text{.}\)

Proof.

See the exercises at the end of this chapter.
At the other end of the spectrum, let us talk about the conditional entropy of variables that are highly dependent. We say that a random variable \(Y\) is determined by another random variable \(X\) if there exists a function \(f:\rng(X)\to\rng(Y)\) such that \(\mathbb{P}(Y=f(x)\mid X=x) = 1\text{.}\) This is the situation with the “running example” from earlier in this section. As we said earlier, in this case, we should have \(\mathbb{H}(Y\mid X)=0\) since the outcome of \(Y\) can already be deduced from that of \(X\) and so no further information will be gleaned by revealing \(Y\text{.}\) The next lemma confirms this intuition.

Proof.

See the exercises at the end of this chapter.
Here are a couple of YouTube videos related to entropy: