Skip to article frontmatterSkip to article content
import pandas as pd
import plotly.graph_objects as go

Man vs Machine Rematch

Segment Challenge Results

def plot_man_vs_machine():

    # Load the data
    rf_data = pd.read_csv("RF.csv")
    human_data = pd.read_csv("human.csv")

    # Create a combined dataframe with an additional column to distinguish the datasets
    rf_data["source"] = "RF"
    human_data["source"] = "Human"
    combined_data = pd.concat([rf_data, human_data])

    # Exclude data points with missing values
    combined_data = combined_data.dropna()

    # Function to filter out dominating points
    def filter_max_accuracy_points(data):
        data = data.sort_values(by="depth")
        filtered_data = []

        for i, row in data.iterrows():
            if not any(
                (data["depth"] <= row["depth"]) & (data["accuracy"] > row["accuracy"])
            ):
                filtered_data.append(row)

        return pd.DataFrame(filtered_data)

    # Apply the filtering function for each source
    max_accuracy_points = (
        combined_data.groupby("source")
        .apply(filter_max_accuracy_points, include_groups=False)
        .reset_index(drop=True)
    )

    # Create the scatter plot using go.Scatter
    fig = go.Figure()

    # Add traces for each source
    for source in combined_data["source"].unique():
        source_data = combined_data[combined_data["source"] == source]
        fig.add_trace(
            go.Scatter(
                x=source_data["depth"],
                y=source_data["accuracy"],
                mode="markers+text",
                text=source_data["name"],
                name=source,
                textfont=dict(color="rgba(0,0,0,0)"),  # Make text transparent
                marker=dict(size=10),
            )
        )

    # Update layout with labels and title
    fig.update_layout(
        title="Man vs Machine", xaxis_title="Tree Depth", yaxis_title="Accuracy"
    )

    # Add hover information
    fig.update_traces(hovertemplate="<b>%{text}</b><br>Accuracy: %{y}<br>Depth: %{x}")

    # Add annotations for the points with the highest accuracy
    for i, row in max_accuracy_points.iterrows():
        fig.add_annotation(
            x=row["depth"],
            y=row["accuracy"],
            text=f"{row['name']}, {row['accuracy']}",
            showarrow=True,
            arrowhead=2,
            ax=20,
            ay=-30,
            bgcolor="rgba(255, 255, 255, 0.6)",
            opacity=1,
            font=dict(size=10),
            hovertext=f"{row['name']}, {row['accuracy']}",
        )

    return fig


man_vs_machine_fig = plot_man_vs_machine()
man_vs_machine_fig.show()

Two heads are better than one

  • Bing/Baidu/Google translation.
  • The story in Chinese and its translation to English.
  • Can we combine two poor classifiers into a good classifier?
  • What is the benefit of doing so?
  • Accuracies of f^1\hat{f}_1 and f^2\hat{f}_2 are both ________%. Are they good?
  • Can we combine them into a better classifier f^(x):=g(f^1(x),f^2(x))\hat{f}(x) := g(\hat{f}_1(x), \hat{f}_2(x))?
  • {f^1(x),f^2(x)}\underline{\kern3em}\{\hat{f}_1(x), \hat{f}_2(x)\} achieves an accuracy of ______________________%.
  • How does it work in general?

Architecture

  • Base classifiers f^j\hat{f}_j’s are simple but possibly have weak preliminary predictions y^j\hat{y}_j’s.
  • Combined classifier f^\hat{f} uses the combination rule gg to merge y^j\hat{y}_j’s into a good final prediction y^\hat{y}.

Architecture for probabilistic classifiers

  • Base classifiers f^j\hat{f}_j’s are simple but possibly have weak probability estimates P^YX(j)(x)\hat{P}_{\R{Y}|\RM{X}}^{(j)}(\cdot \mid \M{x}).
  • Combined classifier f^\hat{f} uses the combination rule gg to merge P^YX(j)(x)\hat{P}_{\R{Y}|\RM{X}}^{(j)}(\cdot \mid \M{x})'s into a good final prediction P^YX(x)\hat{P}_{\R{Y}|\RM{X}}(\cdot \mid \M{x}).

How to get good performance?

  • Reduce risk by avoiding underfitting and overfitting.
  • For many loss functions LL (0-1 loss, sum of squared error, ...):
    E[L(Y,f^W(X))]RiskE[L(Y,fˉ(X))]Bias+E[L(fˉ(X),f^W(X))]Variance\underbrace{\E[L(\R{Y}, \hat{f}_{\R{W}}(\RM{X}))]}_{\text{Risk}} \leq \underbrace{\E[L(\R{Y}, \bar{f}(\RM{X}))]}_{\text{Bias}} + \underbrace{\E[L(\bar{f}(\RM{X}), \hat{f}_{\R{W}}(\RM{X}))]}_{\text{Variance}}
    where
  • fˉ:=xE[f^W(x)]\bar{f} := \M{x} \mapsto \E[\hat{f}_{\R{W}}(\M{x})] is the expected predictor (W is a random variable. Why?).
  • Variance is the dependence of f^W(X)\hat{f}_{\R{W}}(\RM{X}) on the data, also known as overfitting/underfitting.
  • Bias is the deviation of f^(X)\hat{f}(\RM{X}) from Y\R{Y}, also known as overfitting/underfitting.
  • See the bias-variance trade-off.

Bias and variance for probabilistic classifiers

  • For probabilistic classifiers,
    E[L(PYX(X),PY^X,W(X,W))]RiskE[L(PYX(X),PY^X(X))]Bias+I(Y^;WX)Variance\underbrace{\E\left[L(P_{\R{Y}|\RM{X}}(\cdot \mid \RM{X}), P_{\hat{\R{Y}}|\RM{X}, \R{W}}(\cdot \mid \RM{X}, \R{W}))\right]}_{\text{Risk}} \leq \underbrace{\E\left[L(P_{\R{Y}|\RM{X}}(\cdot \mid \RM{X}), P_{\hat{\R{Y}}|\RM{X}}(\cdot \mid \RM{X}))\right]}_{\text{Bias}} + \underbrace{I(\hat{\R{Y}}; \R{W} \mid \RM{X})}_{\text{Variance}}
    where
    • fW(x):=PY^X,W(x,W)f_{\R{W}}(\M{x}) := P_{\hat{\R{Y}}|\RM{X}, \R{W}}(\cdot \mid \M{x}, \R{W}) implies
      fˉ(x)=E[PY^X,W(x,W)]=PY^X(x),\bar{f}(\M{x}) = \E\left[P_{\hat{\R{Y}}|\RM{X}, \R{W}}(\cdot \mid \M{x}, \R{W})\right] = P_{\hat{\R{Y}}|\RM{X}}(\cdot \mid \M{x}),
      called m______________;
    • PYX(X)P_{\R{Y}|\RM{X}}(\cdot \mid \RM{X}) instead of Y\R{Y} is used as the ground truth;
    • Information (or Kullback-Leibler) divergence is used as the loss function
      L(Q,P):=DKL(PQ):=YdPlogdPdQ;L(Q, P) := D_{KL}(P \parallel Q) := \int_{\mathcal{Y}} dP \log \frac{dP}{dQ};
    • variance becomes the mutual information
      E[DKL(PY^X,W(X,W)PY^X(X))]=I(Y^;WX)I(X;W)=0.\E[D_{KL}(P_{\hat{\R{Y}}|\RM{X}, \R{W}}(\cdot \mid \RM{X}, \R{W}) \parallel P_{\hat{\R{Y}}|\RM{X}}(\cdot \mid \RM{X}))] = I(\hat{\R{Y}}; \R{W} \mid \RM{X}) \quad \because I(\RM{X}; \R{W}) = 0.

How to reduce variance and bias?

  • Base classifiers should be diverse, i.e., capture as many different pieces of relevant information as possible to reduce ______.
  • The combination rule should reduce variance by smoothing out the noise while aggregating relevant information into the final decision.

Bagging (Bootstrap Aggregation) Base classifiers

  • Construct mm bootstrap samples.
  • Construct a base classifier for each bootstrap sample.

Bagging (Bootstrap Aggregation) Majority voting

f^(x):=argmaxy^(j1(f^j(x)=y^)){jf^j(x)=y^}=\hat{f}(\M{x}) := \arg\max_{\hat{y}} \overbrace{\left( \sum_{j} \mathbb{1} \left( \hat{f}_j(\M{x}) = \hat{y} \right) \right)}^{\left| \left\{ j \mid \hat{f}_j(\M{x}) = \hat{y} \right\} \right| = }

Example

  • Accuracy = _________________________%.

Is it always good to follow the majority?

  • Accuracy = _________________________%.
  • It is beneficial to return 0 more often because _________________________.
  • How to do this in general?

Sum rule and threshold moving

  • f^(x)=1\hat{f}(\M{x}) = 1 iff

    12[f^1(x)+f^2(x)]>\frac{1}{2} \left[ \hat{f}_1(\M{x}) + \hat{f}_2(\M{x}) \right] > \underline{\hspace{2cm}}
  • Binary classification: Choose f^(x)=1\hat{f}(\M{x}) = 1 iff

    1mtf^t(x)>γ\frac{1}{m} \sum_{t} \hat{f}_t(\M{x}) > \gamma

    for some chosen threshold γ.

  • What about multi-class classification?

Bagging (Bootstrap Aggregation) Average of probabilities

f^(x):=1mtf^t(x)>γ\hat{f}(\M{x}) := \frac{1}{m} \sum_{t} \hat{f}_t(\M{x}) > \gamma

Other techniques to diversify base classifiers

  • Random forest: Bagging with modified decision tree induction
    • Forest-RI: For each split, consider random i___________________ s___________________ where only FF randomly chosen features are considered.
    • Forest-RC: For each split, consider FF random l___________________ c___________________ of LL randomly chosen features.
  • Voting (weka.classifier.meta.vote) and Stacking (weka.classifier.meta.stacking):
    • Use different classification algorithms.
  • Adaptive boosting (Adaboost):
    • Each base classifier tries to _______________________________ made by previous base classifiers.

Other techniques to combine decisions

  • Random forest: Bagging with modified decision tree induction
    • Majority voting
    • Average of probabilities
  • Voting
    • Majority voting or median
    • Average/product/minimum/maximum probabilities
  • Stacking: Use a meta classifier.
    • Adaptive boosting (Adaboost): 2003 Gödel Prize winner
    • Weighted majority voting

What is Adaboost

  • An ensemble method that learns from mistakes:
    • Combined classifier: Majority voting but with more weight on more accurate base classifier.
      f^(x):=argmaxy^twt1((f^t)(x)=y^)\hat{f}(\M{x}) := \arg\max_{\hat{y}} \sum_{t} w_t \cdot \mathbb{1}((\hat{f}_t)(\M{x}) = \hat{y})
      where
      wt:=12ln(1error(f^t)error(f^t))w_t := \frac{1}{2} \ln \left( \frac{1 - \text{error}(\hat{f}_t)}{\text{error}(\hat{f}_t)} \right)
      is the amount of say of f^t\hat{f}_t and error(f^t)\text{error}(\hat{f}_t) is the error rate w.r.t. DtD_t. (See the precise formula below.)
    • Base classifiers: Train f^t\hat{f}_t sequentially in tt on DtD_t obtained by Bagging (xi,Yi)D(\M{x}_i, \R{Y}_i) \in D with
      pi(t):=pi(t1)Zt×{ewt1,f^t1(xi)Yi (incorrectly classified example)ewt1,otherwise (correctly classified example)p_i^{(t)} := \frac{p_i^{(t-1)}}{Z_t} \times \begin{cases} e^{w_{t-1}}, & \hat{f}_{t-1}(\M{x}_i) \neq \R{Y}_i \text{ (incorrectly classified example)} \\ e^{-w_{t-1}}, & \text{otherwise (correctly classified example)} \end{cases}
      starting with pi(1):=1Dp_i^{(1)} := \frac{1}{|D|} and with Zt>0Z_t > 0 chosen so that ipi(t)=1\sum_{i} p_i^{(t)} = 1.
    • Compute the error rate
      error(f^t):=ipi(t)1((f^t)(xi)Yi)\text{error}(\hat{f}_t) := \sum_{i} p_i^{(t)} \cdot \mathbb{1}((\hat{f}_t)(\M{x}_i) \neq \R{Y}_i)

Machine vs Machine

def plot_machine_vs_machine():

    # Load the data
    rf_data = pd.read_csv("RF.csv")
    adb_data = pd.read_csv("ADB.csv")

    # Create a combined dataframe with an additional column to distinguish the datasets
    rf_data["source"] = "RF"
    adb_data["source"] = "ADB"
    combined_data = pd.concat([rf_data, adb_data])

    # Exclude data points with missing values
    combined_data = combined_data.dropna()

    # Function to filter out dominating points
    def filter_max_accuracy_points(data):
        data = data.sort_values(by="depth")
        filtered_data = []

        for i, row in data.iterrows():
            if not any(
                (data["depth"] <= row["depth"]) & (data["accuracy"] > row["accuracy"])
            ):
                filtered_data.append(row)

        return pd.DataFrame(filtered_data)

    # Apply the filtering function for each source
    max_accuracy_points = (
        combined_data.groupby("source")
        .apply(filter_max_accuracy_points, include_groups=False)
        .reset_index(drop=True)
    )

    # Create the scatter plot using go.Scatter
    fig = go.Figure()

    # Add traces for each source
    for source in combined_data["source"].unique():
        source_data = combined_data[combined_data["source"] == source]
        fig.add_trace(
            go.Scatter(
                x=source_data["depth"],
                y=source_data["accuracy"],
                mode="markers+text",
                text=source_data["name"],
                name=source,
                textfont=dict(color="rgba(0,0,0,0)"),  # Make text transparent
                marker=dict(size=10),
            )
        )

    # Update layout with labels and title
    fig.update_layout(
        title="Machine vs Machine", xaxis_title="Tree Depth", yaxis_title="Accuracy"
    )

    # Add hover information
    fig.update_traces(hovertemplate="<b>%{text}</b><br>Accuracy: %{y}<br>Depth: %{x}")

    # Add annotations for the points with the highest accuracy
    for i, row in max_accuracy_points.iterrows():
        fig.add_annotation(
            x=row["depth"],
            y=row["accuracy"],
            text=f"{row['name']}, {row['accuracy']}",
            showarrow=True,
            arrowhead=2,
            ax=20,
            ay=-30,
            bgcolor="rgba(255, 255, 255, 0.6)",
            opacity=1,
            font=dict(size=10),
            hovertext=f"{row['name']}, {row['accuracy']}",
        )

    return fig


machine_vs_machine_fig = plot_machine_vs_machine()
machine_vs_machine_fig.show()

References

References
  1. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140. 10.1007/bf00058655
  2. Breiman, L. (2001). Machine Learning, 45(1), 5–32. 10.1023/a:1010933404324