Decision Tree Induction with scikit-learn

import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets, tree

%matplotlib widget
if not os.getenv(
    "NBGRADER_EXECUTION"
):
    %load_ext jupyter_ai
    %ai update chatgpt dive:chat
    # %ai update chatgpt dive-azure:gpt4o

Decision Tree Induction¶

In this notebook, we will use scikit-learn to build decision trees on the iris dataset from sklearn.datasets package.

iris = datasets.load_iris()

Recall that the classification task is to train a model that can automatically classify the species (target) based on the lengths and widths of the petals and sepals (input features).

To build a decision tree, we can use DecisionTreeClassifier from sklearn.tree and apply its fit method on the training set.

clf_gini = tree.DecisionTreeClassifier(random_state=0).fit(iris.data, iris.target)

To display the decision tree, we can use the function plot_tree from sklearn.tree:

plt.figure(num=1, clear=True)
tree.plot_tree(clf_gini)
plt.show()

For each node:

___ <= ___ is the splitting criterion for internal nodes, satisfied only by samples going left.
gini = ... shows the impurity index. By default, the algorithm uses the Gini impurity index to find the best binary split. Observe that the index decreases down the tree towards the leaves.
value = [_, _, _] shows the number of associated instances for each of the three classes.

Additional options may be provided to customize the look of the decision tree:

options = {
    "feature_names": iris.feature_names,
    "class_names": iris.target_names,
    "label": "root",
    "filled": True,
    "node_ids": True,
    "proportion": True,
    "rounded": True,
    "fontsize": 7,
}  # store options as dict for reuse

plt.figure(num=2, clear=True, figsize=(9, 9))
tree.plot_tree(clf_gini, **options)  # ** unpacks dict as keyword arguments
plt.show()

Each node now indicates the majority class, which may be used as the decision. The majority classes are also color coded. Observe that the color gets lighter towards the root as the class distribution becomes more impure. In particular, the iris setosa is distinguished immediately after checking the petal width/length.

# YOUR CODE HERE
raise NotImplementedError

plt.figure(num=3, clear=True, figsize=(9, 9))
tree.plot_tree(clf_entropy, **options)
plt.show()

YOUR ANSWER HERE

%%ai chatgpt -f text
What is one-hot encoding, and why is it suitable for converting categorical 
features to numeric values for scikit-learn decision trees, which do not 
support categorical data?

Splitting Criterion¶

To induce a good decision tree efficiently, the splitting criterion is chosen

greedily to maximize the reduction in impurity and
recursively starting from the root.

Overview using pandas¶

To have a rough idea of what are good features to split on, we will use pandas DataFrame to operate on the iris dataset.

# write the input features first
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# append the target values to the last column
df["target"] = iris.target
df.target = df.target.astype("category").cat.rename_categories(
    dict(zip(range(3), iris.target_names))
)
df

To display some statistics of the input features for different classes:

df.groupby("target", observed=False).boxplot(rot=90, layout=(1, 3))
df.groupby("target", observed=False).agg(["mean", "std"]).round(2)
plt.show()

YOUR ANSWER HERE

Measuring impurity¶

Suppose nearly all instances of a dataset belong to the same class. In that case, we can return the majority class as the decision without further splitting. A measure of impurity is the Gini impurity index, defined as follows:

We can represent a distribution simply as a NumPy array. To return the empirical class distributions of the iris dataset:

def dist(values):
    """
    Compute the empirical distribution of the given 1D array of values.

    This function calculates the empirical distribution of the input array,
    returning a 1D array of probabilities. The order of the probabilities in
    the output array is immaterial.

    Parameters
    ----------
    values : array-like
        A 1D array of values for which the empirical distribution is to be 
        computed.

    Returns
    -------
    probabilities : ndarray
        A 1D array of probabilities corresponding to the empirical distribution
        of the input values.

    Examples
    --------
    >>> import numpy as np
    >>> values = np.array([1, 2, 2, 3, 3, 3])
    >>> dist(values)
    array([0.16666667, 0.33333333, 0.5       ])
    """
    counts = np.unique(values, return_counts=True)[-1]
    return counts / counts.sum()


print(f"Distribution of target: {dist(iris.target).round(3)}")

The Gini impurity index can be implemented as follows:

def g(p):
    """Returns the Gini impurity of the distribution p."""
    return 1 - (p**2).sum()
_code = In[-1].rsplit(maxsplit=1)[0] # store the code for chatting with LLM

g?
print(f"Gini(D) = {g(dist(iris.target)):.3g}")

%%ai chatgpt -f text
Improve the docstring to conform to NumPy style:
--
{_code}

Another measure of impurity uses the information measure called entropy in information theory:

Exercise 3

Complete the following function to compute the entropy of a distribution. You may use the function log2 from numpy to calculate the logarithm base 2.

def h(p):
    # Improve the docstring below to conform to NumPy style.
    """Returns the entropy of distribution p (1D array)."""
    p = np.array(p)
    p = p[(p > 0) & (p < 1)]  # 0 log 0 = 1 log 1 = 0
    # YOUR CODE HERE
    raise NotImplementedError


h?
print(f"Info(D): {h(dist(iris.target)):.3g}")

Source

# tests
assert np.isclose(h([1 / 2, 1 / 2]), 1)
assert np.isclose(h([1, 0]), 0)
assert np.isclose(h([1 / 2, 1 / 4, 1 / 4]), 1.5)

# hidden tests

Drop in impurity¶

We will consider the binary splitting criterion $\R{X}\leq s$ in particular, which gives

\begin{align} \Delta \Gini_{A}(D) = g(\hat{P}_\R{Y}) - \left[\hat{P}[\R{X}\leq s] g(\hat{p}_{\R{Y}|\R{X}\leq s}) + \hat{P}[\R{X}> s]g(\hat{p}_{\R{Y}|\R{X}> s})\right] \end{align}

(6)

where

$\R{Y}$ denotes the target,
$\hat{P}$ denotes the empirical distribution, and
$\hat{p}_{\R{Y}|\R{X}\leq s}$ , $\hat{p}_{\R{Y}|\R{X}> s}$ , and $\hat{p}_{\R{Y}}$ denote the empirical probability mass functions of $\R{Y}$ with or without conditioning.

def drop_in_gini(X, Y, s):
    """
    Calculate the drop in Gini impurity for a binary split.

    This function computes the reduction in Gini impurity of the target `Y`
    when the input feature `X` is split at the threshold `s`.

    Parameters
    ----------
    X : 1D array-like
        Input feature values for different instances.
    Y : 1D array-like
        Target values corresponding to `X`.
    s : float
        Splitting point for `X`.

    Returns
    -------
    float
        The reduction in Gini impurity resulting from the split.
    """
    S = X <= s
    q = S.mean()
    return g(dist(Y)) - q * g(dist(Y[S])) - (1 - q) * g(dist(Y[~S]))


X, Y = df["petal width (cm)"], df.target
print(f"Drop in Gini: {drop_in_gini(X, Y, 0.8):.4g}")

To compute the best splitting point for a given input feature, we check every consecutive mid-points of the observed feature values:

def find_best_split_pt(X, Y, gain_function):
    """
    Find the best splitting point and the maximum gain for a binary split.

    This function identifies the optimal splitting point `s` that maximizes the
    gain as evaluated by the provided `gain_function` for the split `X <= s` 
    and target `Y`.

    Parameters
    ----------
    X : 1D array-like
        Input feature values for different instances.
    Y : 1D array-like
        Target values corresponding to `X`.
    gain_function : function
        A function that evaluates the gain for a splitting criterion `X <= s`.
        For example, `drop_in_gini`.

    Returns
    -------
    tuple
        (s, g) where `s` is the best splitting point and `g` is the maximum 
        gain.

    See Also
    --------
    drop_in_gini : Function to calculate the drop in Gini impurity.
    """
    values = np.sort(np.unique(X))
    split_pts = (values[1:] + values[:-1]) / 2
    gain = np.array([gain_function(X, Y, s) for s in split_pts])
    i = np.argmax(gain)
    return split_pts[i], gain[i]


print(
    """Best split point: {0:.3g}
Maximum gain: {1:.3g}""".format(
        *find_best_split_pt(X, Y, drop_in_gini)
    )
)

The following ranks the features according to the gains of their best binary splits:

rank_by_gini = pd.DataFrame(
    {
        "feature": feature,
        **(lambda s, g: {"split point": s, "gain": g})(
            *find_best_split_pt(df[feature], df.target, drop_in_gini)
        ),
    }
    for feature in iris.feature_names
).sort_values(by="gain", ascending=False)
rank_by_gini

Using the entropy to measure impurity, we have the following alternative gain function:

We will again consider the binary splitting criterion $\R{X}\leq s$ in particular, which gives

\begin{align} \Gain_{\R{X}\leq s}(D) = h(\hat{P}_Y) - \left[\hat{P}[\R{X}\leq s] h(\hat{p}_{\R{Y}|\R{X}\leq s}) + \hat{P}[\R{X}> s]h(\hat{p}_{\R{Y}|\R{X}> s})\right] \end{align}

(8)

def gain(X, Y, s):
    """
    Calculate the information gain for a binary split.

    This function computes the information gain of the target `Y` when the 
    input feature `X` is split at the threshold `s`.

    Parameters
    ----------
    X : 1D array-like
        Input feature values for different instances.
    Y : 1D array-like
        Target values corresponding to `X`.
    s : float
        Splitting point for `X`.

    Returns
    -------
    float
        The information gain resulting from the split.
    """
    S = X <= s
    q = S.mean()
    # YOUR CODE HERE
    raise NotImplementedError


print(f"Information gain: {gain(X, Y, 0.8):.4g}")

Source

# tests
rank_by_entropy = pd.DataFrame(
    {
        "feature": feature,
        **(lambda s, g: {"split point": s, "gain": g})(
            *find_best_split_pt(df[feature], df.target, gain)
        ),
    }
    for feature in iris.feature_names
).sort_values(by="gain", ascending=False)
rank_by_entropy

# hidden tests

The C4.5 induction algorithm uses information gain ratio instead of information gain:

For binary split $\R{X}\leq s$ ,

\operatorname{SplitInfo}_{\R{X}\leq s}(D) := h\left(\hat{P}[\R{X}\leq s], \hat{P}[\R{X}> s]\right)

(11)

in terms of the empirical distribution.

def gain_ratio(X, Y, split_pt): 
    # Add docstring here
    S = X <= split_pt
    q = S.mean()
    # YOUR CODE HERE
    raise NotImplementedError

Source

# tests
rank_by_gain_ratio = pd.DataFrame(
    {
        "feature": feature,
        **(lambda s, g: {"split point": s, "gain": g})(
            *find_best_split_pt(df[feature], df.target, gain_ratio)
        ),
    }
    for feature in iris.feature_names
).sort_values(by="gain", ascending=False)
rank_by_gain_ratio

# hidden tests

YOUR ANSWER HERE