Skip to article frontmatterSkip to article content
from __init__ import install_dependencies

await install_dependencies()

As mentioned in previous lectures, the following two lists coin_flips and dice_rolls simulate the random coin flips and rollings of a dice:

# Do NOT modify any variables defined here because some tests rely on them
import random
import math
import matplotlib.pyplot as plt
import ipywidgets as widgets
%matplotlib widget
%reload_ext divewidgets

random.seed(0)  # for reproducible results.
num_trials = 200
coin_flips = ["H" if random.random() <= 1 / 2 else "T" for i in range(num_trials)]
dice_rolls = [random.randint(1, 6) for i in range(num_trials)]
print("coin flips: ", *coin_flips)
print("dice rolls: ", *dice_rolls)

What is the difference of the two random processes? Can we say one process has more information content than the other?

In this lab, we will use dictionaries to store distributions and then compute their information content using information theory, which was introduced by Claude E. Shannon in his seminal work in 1948. It has numerous applications:

  • Compression (to keep files small).
  • Communications (to send data to mobile phones).
  • Cybersecurity (to protect data using provably secure methods).
  • Machine learning (to identify relevant features for learning).

The first conference on Information Theory was held in London in 1950, where Alan Turing also contributed.

Entropy

The following code shown in the lecture uses a dictionary to store the (empirical) distribution for a sequence efficiently without storing outcomes with zero counts:

# Do NOT modify any variables defined here because some tests rely on them
def distribute(seq):
    """Returns a dictionary where each value in a key-value pair is
    the probability of the associated key occurring in the sequence.
    """
    p = {}
    for i in seq:
        p[i] = p.get(i, 0) + 1 / len(seq)
    return p


# tests
coin_flips_dist = distribute(coin_flips)
dice_rolls_dist = distribute(dice_rolls)
print("Distribution of coin flips:", coin_flips_dist)
print("Distribution of dice rolls:", dice_rolls_dist)

You can verify that the probabilities pip_i’s have to sum to 1 for the empirical distributions of coin flips and dice rolls:

assert math.isclose(sum(coin_flips_dist.values()), 1) and math.isclose(
    sum(dice_rolls_dist.values()), 1
)

How to measure the information content?

For instance, if p=(pH,pT)=(0.5,0.5)\M{p}=(p_{H},p_{T})=(0.5,0.5), then

H(p)=0.5log210.5+0.5log210.5=0.5+0.5=1 bit,\begin{aligned} H(\M{p}) &= 0.5 \log_2 \frac{1}{0.5} + 0.5 \log_2 \frac{1}{0.5} \\ &= 0.5 + 0.5 = 1 \text{ bit,}\end{aligned}

i.e., an outcome of a fair coin flip has one bit of information content, as expected.

On the other hand, if p=(pH,pT)=(1,0)\M{p}=(p_{H},p_{T})=(1,0), then

H(p)=1log211+0log210=0+0=0 bits,\begin{aligned} H(\M{p}) &= 1 \log_2 \frac{1}{1} + 0 \log_2 \frac{1}{0} \\ &= 0 + 0 = 0 \text{ bits,}\end{aligned}

i.e., an outcome of a biased coin flip that always comes up head has no information content, again as expected.

def entropy(dist):
    # YOUR CODE HERE
    raise NotImplementedError
Source
# tests
assert math.isclose(entropy({"H": 0.5, "T": 0.5}), 1)
assert math.isclose(entropy({"H": 1, "T": 0}), 0)
assert math.isclose(entropy(dict.fromkeys(range(1, 7), 1 / 6)), math.log2(6))

Uniform distribution maximizes entropy

Intuitively,

  • for large enough numbers of fair coin flips, we should have S={H,T}\mc{S}=\{H,T\} and pH=pT=0.5p_H=p_T=0.5, i.e., equal chance for head and tail.
  • for large enough numbers of fair dice rolls, we should have pi=16p_i=\frac16 for all iS={1,2,3,4,5,6}i\in \mc{S}=\{1,2,3,4,5,6\}.
def plot_distribution(seq, *args, **kwargs):
    fig = plt.figure(*args, **kwargs)
    plt.xlabel("Outcomes")
    plt.title("Distribution")
    plt.ylim(0, 1)
    dist_plot = ()

    @widgets.interact(
        n=widgets.IntSlider(
            value=1,
            min=1,
            max=len(seq),
            step=1,
            description="n:",
            continuous_update=False,
        )
    )
    def plot(n):
        plt.figure(fig.number)
        nonlocal dist_plot
        dist_plot and dist_plot.remove()
        dist = distribute(seq[:n])
        dist_plot = plt.stem(dist.keys(), dist.values())
        plt.show()
plot_distribution(coin_flips, num=1, clear=True)
plot_distribution(dice_rolls, num=2, clear=True)
def uniform(outcomes):
    """Returns the uniform distribution (dict) over distinct items in outcomes."""
    # YOUR CODE HERE
    raise NotImplementedError
Source
# tests
assert uniform("HT") == {"H": 0.5, "T": 0.5}
assert uniform("HTH") == {"H": 0.5, "T": 0.5}
fair_dice_dist = uniform(range(1, 7))
assert all(
    math.isclose(fair_dice_dist[k], v)
    for k, v in {
        1: 0.16666666666666666,
        2: 0.16666666666666666,
        3: 0.16666666666666666,
        4: 0.16666666666666666,
        5: 0.16666666666666666,
        6: 0.16666666666666666,
    }.items())

What is the entropy for uniform distributions?

By definition,

H(p):=iSpilog21pi=iS1Slog2S=log2S(bits)\begin{aligned} H(\M{p}) &:= \sum_{i\in \mc{S}} p_i \log_2 \tfrac{1}{p_i} \\ &= \sum_{i\in \mc{S}} \frac{1}{|\mc{S}|} \log_2 |\mc{S}| = \log_2 |\mc{S}| \kern1em \text{(bits)} \end{aligned}

Entropy reduces to the formula in Lab 1 regarding the number of bits required to encode a set of integers or characters. It is the maximum possible entropy for a given finite set of possible outcomes.

Use this result to test whether you have implemented both entropy and uniform correctly:

assert all(
    math.isclose(entropy(uniform(range(n))), math.log2(n)) for n in range(1, 100)
)

Joint distribution and its entropy

If we duplicate a sequence of outcomes, the total information content should remain unchanged, not doubled, because the duplicate contains the same information as the original. We will verify this fact by creating a joint distribution.

def jointly_distribute(seq1, seq2):
    """Returns the joint distribution of the tuple (i,j) of outcomes from zip(seq1,seq2)."""
    # YOUR CODE HERE
    raise NotImplementedError
Source
# tests
assert jointly_distribute("HT", "HT") == {("H", "H"): 0.5, ("T", "T"): 0.5}
assert jointly_distribute("HHTT", "HTHT") == {
    ("H", "H"): 0.25,
    ("H", "T"): 0.25,
    ("T", "H"): 0.25,
    ("T", "T"): 0.25}
coin_flips_duplicate_dist = {
    ("T", "T"): 0.5350000000000004,
    ("H", "H"): 0.4650000000000003}
coin_flips_duplicate_ans = jointly_distribute(coin_flips, coin_flips)
assert all(
    math.isclose(coin_flips_duplicate_ans[i], pi)
    for i, pi in coin_flips_duplicate_dist.items())

If you have implemented entropy and jointly_distribute correctly, you can verify that duplicating a sequence will give the same entropy.

assert math.isclose(
    entropy(jointly_distribute(coin_flips, coin_flips)), entropy(distribute(coin_flips))
)
assert math.isclose(
    entropy(jointly_distribute(dice_rolls, dice_rolls)), entropy(distribute(dice_rolls))
)

However, for two sequences generated independently, the joint entropy is roughly the sum of the individual entropies.

coin_flips_entropy = entropy(distribute(coin_flips))
dice_rolls_entropy = entropy(distribute(dice_rolls))
cf_dr_entropy = entropy(jointly_distribute(coin_flips, dice_rolls))
print(
    f"""Entropy of coin flip: {coin_flips_entropy}
Entropy of dice roll: {dice_rolls_entropy}
Sum of the above entropies: {coin_flips_entropy + dice_rolls_entropy}
Joint entropy: {cf_dr_entropy}"""
)

Conditional distribution and entropy

For example, suppose we want to compute the distribution of coin flips given dice rolls, then the following assign q_H_1 and q_T_1 to the values qH1q_{H|1} and qT1q_{T|1} respectively:

coin_flips_1 = [j for i, j in zip(dice_rolls, coin_flips) if i == 1]
q_H_1 = coin_flips_1.count("H") / len(coin_flips_1)
q_T_1 = coin_flips_1.count("T") / len(coin_flips_1)
print("Coin flips given dice roll is 1:", coin_flips_1)
print(
    'Distribution of coin flip given dice roll is 1: {{ "H": {}, "T": {}}}'.format(
        q_H_1, q_T_1
    )
)
assert math.isclose(q_H_1 + q_T_1, 1)

Note that q_H_1 + q_T_1 is 1 as expected. Similarly, we can assign q_H_2 and q_T_2 to the values qH2q_{H|2} and qT2q_{T|2} respectively.

coin_flips_2 = [j for i, j in zip(dice_rolls, coin_flips) if i == 2]
q_H_2 = coin_flips_2.count("H") / len(coin_flips_2)
q_T_2 = coin_flips_2.count("T") / len(coin_flips_2)
print("Coin flips given dice roll is 2:", coin_flips_2)
print(
    'Distribution of coin flip given dice roll is 2: {{ "H": {}, "T": {}}}'.format(
        q_H_2, q_T_2
    )
)
assert math.isclose(q_H_2 + q_T_2, 1)

Finally, we want to store the conditional distribution as a nested dictionary so that q[i] points to the distribution

[qji]jTfor iS.\begin{bmatrix} q_{j|i} \end{bmatrix}_{j\in \mc{T}} \kern1em \text{for }i\in \mc{S}.
q = {}
q[1] = dict(zip("HT", (q_H_1, q_T_1)))
q[2] = dict(zip("HT", (q_H_2, q_T_2)))
...  # entries for q[i] other possible outcomes i of the dice rolls.
q
def conditionally_distribute(seq1, seq2):
    """Returns the conditional distribution q of seq2 given seq1 such that
    q[i] is a dictionary for observed outcome i in seq1 and
    q[i][j] is the probability of observing j in seq2 given the
    corresponding outcome in seq1 is i."""
    # YOUR CODE HERE
    raise NotImplementedError
Source
# tests
cf_given_dr_dist = {
    4: {"T": 0.5588235294117647, "H": 0.4411764705882353},
    1: {"T": 0.46511627906976744, "H": 0.5348837209302325},
    3: {"H": 0.5135135135135135, "T": 0.4864864864864865},
    6: {"H": 0.5454545454545454, "T": 0.45454545454545453},
    2: {"T": 0.7586206896551724, "H": 0.2413793103448276},
    5: {"T": 0.5416666666666666, "H": 0.4583333333333333}}
cf_given_dr_ans = conditionally_distribute(dice_rolls, coin_flips)
assert all(
    math.isclose(cf_given_dr_ans[i][j], v)
    for i, d in cf_given_dr_dist.items()
    for j, v in d.items())
# YOUR CODE HERE
raise NotImplementedError
Source
# tests
cf_given_dr_dist = {
    4: {"T": 0.5588235294117647, "H": 0.4411764705882353},
    1: {"T": 0.46511627906976744, "H": 0.5348837209302325},
    3: {"H": 0.5135135135135135, "T": 0.4864864864864865},
    6: {"H": 0.5454545454545454, "T": 0.45454545454545453},
    2: {"T": 0.7586206896551724, "H": 0.2413793103448276},
    5: {"T": 0.5416666666666666, "H": 0.4583333333333333}}
assert (
    conditional_entropy(
        {"H": 0.5, "T": 0.5}, {"H": {"H": 0.5, "T": 0.5}, "T": {"H": 0.5, "T": 0.5}})
    == 1)
assert (
    conditional_entropy(
        {"H": 0, "T": 1}, {"H": {"H": 0.5, "T": 0.5}, "T": {"H": 0.5, "T": 0.5}})
    == 1)
assert (
    conditional_entropy(
        {"H": 0.5, "T": 0.5}, {"H": {"H": 1, "T": 0}, "T": {"H": 0, "T": 1}})
    == 0)
assert (
    conditional_entropy(
        {"H": 0.5, "T": 0.5}, {"H": {"H": 1, "T": 0}, "T": {"H": 0.5, "T": 0.5}})
    == 0.5)
assert math.isclose(
    conditional_entropy(dice_rolls_dist, cf_given_dr_dist), 0.9664712793722372)

The joint probability pijp_{ij} over iSi\in \mc{S} and jTj\in \mc{T} can be calculated as follows

pij=piqjip_{ij} = p_{i} q_{j|i}

where, as introduced earlier, pip_i is the probability of ii and qjiq_{j|i} is the conditional probability of jj given ii.

def joint_distribution(p, q):
    # YOUR CODE HERE
    raise NotImplementedError
Source
# tests
assert joint_distribution(
    {"H": 0.5, "T": 0.5}, {"H": {"H": 0.5, "T": 0.5}, "T": {"H": 0.5, "T": 0.5}}) == {("H", "H"): 0.25, ("H", "T"): 0.25, ("T", "H"): 0.25, ("T", "T"): 0.25}
assert joint_distribution(
    {"H": 0, "T": 1}, {"H": {"H": 0.5, "T": 0.5}, "T": {"H": 0.5, "T": 0.5}}) == {("H", "H"): 0.0, ("H", "T"): 0.0, ("T", "H"): 0.5, ("T", "T"): 0.5}
assert joint_distribution(
    {"H": 0.5, "T": 0.5}, {"H": {"H": 1, "T": 0}, "T": {"H": 0, "T": 1}}) == {("H", "H"): 0.5, ("H", "T"): 0.0, ("T", "H"): 0.0, ("T", "T"): 0.5}, {
    "H": 0.5,
    "T": 0.5}

Finally, a fundamental information identity relating the joint and conditional entropies is the chain rule:

If you have implemented jointly_distribute, conditionally_distribute, entropy, and conditional_entropy correctly, we can verify the identity as follows.

def validate_chain_rule(seq1, seq2):
    p = distribute(seq1)
    q = conditionally_distribute(seq1, seq2)
    pq = jointly_distribute(seq1, seq2)
    H_pq = entropy(pq)
    H_p = entropy(p)
    H_q_p = conditional_entropy(p, q)
    print(
        f"""Entropy of seq1: {H_p}
Conditional entropy of seq2 given seq1: {H_q_p}
Sum of the above entropies: {H_p + H_q_p}
Joint entropy: {H_pq}"""
    )
    assert math.isclose(H_pq, H_p + H_q_p)
validate_chain_rule(coin_flips, coin_flips)
validate_chain_rule(dice_rolls, coin_flips)
References
  1. Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379–423. 10.1002/j.1538-7305.1948.tb01338.x