Skip to article frontmatterSkip to article content
# Execute this cell first to import the required libraries
import os

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm

%matplotlib widget
if not os.getenv(
    "NBGRADER_EXECUTION"
):
    %load_ext jupyter_ai
    %ai update chatgpt dive:chat
    # Uncomment the following line to use the Azure model
    # %ai update chatgpt dive-azure:gpt4o
%%ai chatgpt -f text
Add comments to the first code cell to explain the code concisely:
--
{In[1]}

This notebook will demonstrate the fundamental concepts of bias and consistency using Monte Carlo simulations.[1]

Estimation Problem

Given a coin that is possibly biased, i.e., a coin toss has an unknown probability p[0,1]p\in [0,1] of landing heads, how to estimate pp based on a sequence of coin tosses? More precisely:

A convenient way to obtain the i.i.d. sample is to simulate the random process using a pseudorandom number generator:

# Initialize random number generator with a seed
rng = np.random.default_rng(seed=0)

# Generate the probability of head randomly
p = rng.random()

# Set the number of coin tosses we want to simulate
n = 5000

# Use the choice function to simulate n coin tosses,
# with "H" (heads) and "T" (tails) as possible outcomes,
# and the probability of heads and tails given by p and 1-p, respectively.
coin_tosses = rng.choice(["H", "T"], size=n, p=[p, 1 - p])
coin_tosses

The above code uses NumPy,[2] which is a popular python package for efficient computation on high-dimensional arrays:

  • default_rng returns a random number generator. Setting the seed ensures reproducibility of the results.
  • choice is a method of the generator to generates a random sample from a given list of possible outcomes.
  • p keeps the value of the unknown probability pp, which is uniformly randomly picked from the unit interval [0,1)[0,1).
  • coin_tosses is a NumPy array of "H" and "T", which denote the outcomes “head” and “tail” respectively.
%%ai chatgpt -f text
Why should we set a random seed in Monte-Carlo simulation? 
%%ai chatgpt -f text
What is a NumPy array and why should we use it to store and process
sequences of numbers?

To obtain the i.i.d. sample Zn\R{Z}^n in (3), simply run

coin_tosses == "H"

which gives a list of 1 (True) and 0 (False) obtained by element-wise equality comparisons with "H".

M-Estimation

A natural way to estimate pp is the M-estimate (sample average estimate)

p^:=1ni=0n1Zi={1inZi=1}n,\begin{align} \R{\hat{p}} &:= \frac1n \sum_{i=0}^{n-1} \R{Z}_i\\ &= \frac{\abs{\{1\leq i\leq n| \R{Z}_i=1\}} }{n}, \end{align}

which is the fraction of the coin tosses coming up heads. The observed distribution (p^,1p^)(\R{\hat{p}}, 1-\R{\hat{p}}) of heads and tails is called empirical distribution.

The estimate can be implemented as follows using the method mean of a numpy array:

(coin_tosses == "H").mean()

Is the estimate good? The following is one desirable property:

YOUR ANSWER HERE

An unbiased estimate, while correct in expectation, can be far away from the ground truth especially when the variance is large. It is desirable for an estimate to be nearly correct with high probability, which can be stated more formally below:

%%ai chatgpt -f text
Explain in simple words in one paragraph the different between convergence 
in probability and almost sure convergence.

YOUR ANSWER HERE

%%ai chatgpt -f text
Explain in one paragraph how the LLN can be used to show that 
an M-estimate is consistent.
%%ai chatgpt -f text
Explain in one paragraph what Chernoff bound is and how to use it to prove the LLN.

To illustrate consistency, the following code generates and plots the estimate p^\R{\hat{p}} for different sample size nn.

size = 5000
n = np.arange(1, size + 1)
phat = (coin_tosses == "H").cumsum() / n  # use first n tosses to estimate
sigma = (p * (1 - p) / n) ** 0.5  # true standard deviations of the estimates

# Create Figure 1, or clear it if it exists
plt.figure(1, clear=True)

# plot the ground truth p
plt.axhline(p, color="red", label=r"$p$")

# fill the region 2 sigma away from p
plt.fill_between(
    n, p - 2 * sigma, p + 2 * sigma, color="red", alpha=0.2, label=r"$p\pm 2\sigma$"
)

# plot the estimates phat
plt.plot(
    n,
    phat,
    marker=".",
    color="blue",
    linestyle="",
    markersize=1,
    label=r"$\hat{\mathsf{p}}$",
)

# configure the plot
plt.ylim([0, 1])
plt.xlim([0, n.size])
plt.title(r"Plot of ${\hat{p}}$ vs sample size")
plt.xlabel("sample size")
plt.ylabel("probability")
plt.legend()
plt.show()

For a consistent estimator, it is expected that the estimates (blue dots) converge to the ground truth (red line) as the sample size increases.

In addition, observe that the estimates mostly fall within 2 standard deviation away from the ground truth, i.e., the convergence rate follows the rate of drop in the standard deviation:

The proof uses the central limit theorem (CLT): As nn goes to , the estimate almost surely has the gaussian/normal distribution plotted as follows:

plt.figure(num=2, clear=True)

# plot the stardard normal distribution
x = np.linspace(-4, 4, 8 * 10 + 1)
plt.plot(x, norm.pdf(x), color="red", label=r"$\frac{1}{\sqrt{2\pi}}e^{-x^2/2}$")

# Fill the area under curve within certain number of standard deviations from the mean
for i in range(3, 0, -1):
    plt.fill_between(
        x,
        norm.pdf(x),
        alpha=2 ** (-i),
        color="blue",
        label=rf"$\Pr(|\hat{{p}}-p|\leq{i}\sigma)\approx {(norm.cdf(i)- norm.cdf(-i))*100:.3g}\%$",
        where=(abs(x) <= i),
    )

# Label the plot
plt.title(
    r"Standard normal distribution for $\frac{{\hat{p}}-{p}}{\sigma}$ as $n\to \infty$"
)
plt.xlabel(r"x")
plt.ylabel(r"probability density")
plt.legend()
plt.show()
%%ai chatgpt -f text
Give a scenario where LLN holds but CLT does not.
%%ai chatgpt -f text
Explain in simple words how to prove and understand the CLT.

A Coin Tossing Game

To understand the concept of bias in estimation, imagine playing the coin-tossing game:

  • You win if a coin toss comes up head.
  • You get to choose 1 out of the mm coins i{0,,m1}i\in \{0,\dots,m-1\} with unknown probability pip_i of coming up head.
  • You can flip each coin nn times before making your choice.

How to play the game?

A particular strategy for playing the game is to

  1. estimate the chance pip_i by the empirical probability p^i\R{\hat{p}}_i for each coin ii, and
  2. select the coin (with ties broken arbitrarily)
J:=argmaxip^i.\R{J} := \arg\max_i \R{\hat{p}}_i.

It is easy to see that the chance of winning by the given strategy is E[pJ]E[p_{\R{J}}]. Is the strategy optimal? Can a player evaluate or estimate the chance of winning without knowing pip_i’s? Is the following a good estimate:

p^J=maxip^i\R{\hat{p}}_{\R{J}} = \max_i\R{\hat{p}}_i

How to evaluate the estimate?

Consider the simple case n=1n=1, m=2m=2, and p0=p1=0.5p_0=p_1=0.5. We have the following four equally likely events:

p^0p^1maxip^i000011101111\begin{array}{ccc} \R{\hat{p}}_0 & \R{\hat{p}}_1 & \max_i \R{\hat{p}}_i \\\hline 0 & 0 & 0\\ 0 & 1 & 1\\ 1 & 0 & 1\\ 1 & 1 & 1\\ \end{array}

YOUR ANSWER HERE

Instead of carrying out the exact analysis, which involves order statistics, we will conduct the Monte-Carlo simulation of the coin tossing game. The simulation can be verified by hand-calculating the closed-form solution for n=1n=1, m=2m=2, and p0=p1=0.5p_0=p_1=0.5.

The following initializes the list p_list of probabilities of head for different coins:

m = 2
p_list = np.array([0.4] * (m - 1) + [0.6])
# To generate the probability randomly instead, use
# p_list = rng.random(m)
p_list

Instead of generating a sequence of coin tosses, we will simulate p^i\R{\hat{p}}_i directly using the binomial distribution since

np^iBinomial(n,pi).n\R{\hat{p}}_i\sim \operatorname{Binomial}(n,p_i).
size = 10
n = np.arange(1, size + 1)
k = 100000
phat = rng.binomial(
    n.reshape(-1, 1, 1), p_list.reshape(1, -1, 1), (size, m, k)
) / n.reshape(-1, 1, 1)
max_phat = phat.max(axis=1)
max_phat

max_phat is a 2-dimensional array of samples of maxip^i\max_{i}\R{\hat{p}}_i:

  • The first axis indexes samples obtained from different number of tosses.
  • The second axis indexes k independent samples for the same number of tosses.

The k independent samples can be used to approximates E[maxip^i]E[\max_{i}\R{\hat{p}}_i] as follows.

E_max_phat = max_phat.mean(axis=-1)
E_max_phat

Similarly, the winning probability can be approximated as follows:

win_prob = p_list[phat.argmax(axis=1)].mean(axis=-1)
win_prob

The following plots compare the probabilities as a function of nn.

plt.figure(3, clear=True)
plt.axhline(p_list.max(), color="red", label=r"$\max_i p_i$")
plt.plot(
    n,
    E_max_phat,
    linestyle="--",
    marker=".",
    color="blue",
    markersize=10,
    label=r"$E[\max_i{\hat{p}}_i]$",
)
plt.plot(
    n,
    win_prob,
    linestyle=":",
    marker="x",
    color="green",
    markersize=10,
    label="winning probability",
)

plt.ylim([0, 1])
plt.xlim([n[0], n[-1]])
plt.title(r"Plot of $E[\max_i{\hat{p}}_i]$ vs $n$")
plt.xlabel("$n$")
plt.ylabel("probability")
plt.legend()
plt.show()

YOUR ANSWER HERE

YOUR ANSWER HERE

YOUR ANSWER HERE

%%ai chatgpt -f text
State in simple words the uniform LLM and how it differs from the usual LLM.
Explain what makes uniform convergence possible and why it can be applied to
data mining and machine learning?

Glossary

Footnotes
  1. See the official documentation for more details.

  2. Indeed, one can consider the stronger notion of convergence

    P[limnf(Zn)=E[Z]]=1,P\left[\lim_{n\to \infty} f(\R{Z}^n) = E[\R{Z}]\right] = 1,

    i.e., the estimate converge to the ground truth almost surely.