import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets, tree
%matplotlib widget
if not os.getenv(
"NBGRADER_EXECUTION"
):
%load_ext jupyter_ai
%ai update chatgpt dive:chat
# %ai update chatgpt dive-azure:gpt4o
Decision Tree Induction¶
In this notebook, we will use scikit-learn to build decision trees on the iris dataset from sklearn.datasets
package.
iris = datasets.load_iris()
Recall that the classification task is to train a model that can automatically classify the species (target) based on the lengths and widths of the petals and sepals (input features).
To build a decision tree, we can use DecisionTreeClassifier
from sklearn.tree
and apply its fit
method on the training set.
clf_gini = tree.DecisionTreeClassifier(random_state=0).fit(iris.data, iris.target)
To display the decision tree, we can use the function plot_tree
from sklearn.tree
:
plt.figure(num=1, clear=True)
tree.plot_tree(clf_gini)
plt.show()
For each node:
___ <= ___
is the splitting criterion for internal nodes, satisfied only by samples going left.gini = ...
shows the impurity index. By default, the algorithm uses the Gini impurity index to find the best binary split. Observe that the index decreases down the tree towards the leaves.value = [_, _, _]
shows the number of associated instances for each of the three classes.
How is the tree stored?
The information of the decision is stored in the tree_
attribute of the classifier. For more details, run help(clf_gini.tree_)
.
Additional options may be provided to customize the look of the decision tree:
options = {
"feature_names": iris.feature_names,
"class_names": iris.target_names,
"label": "root",
"filled": True,
"node_ids": True,
"proportion": True,
"rounded": True,
"fontsize": 7,
} # store options as dict for reuse
plt.figure(num=2, clear=True, figsize=(9, 9))
tree.plot_tree(clf_gini, **options) # ** unpacks dict as keyword arguments
plt.show()
Each node now indicates the majority class, which may be used as the decision. The majority classes are also color coded. Observe that the color gets lighter towards the root as the class distribution becomes more impure. In particular, the iris setosa is distinguished immediately after checking the petal width/length.
# YOUR CODE HERE
raise NotImplementedError
plt.figure(num=3, clear=True, figsize=(9, 9))
tree.plot_tree(clf_entropy, **options)
plt.show()
YOUR ANSWER HERE
%%ai chatgpt -f text
What is one-hot encoding, and why is it suitable for converting categorical
features to numeric values for scikit-learn decision trees, which do not
support categorical data?
Splitting Criterion¶
To induce a good decision tree efficiently, the splitting criterion is chosen
- greedily to maximize the reduction in impurity and
- recursively starting from the root.
Overview using pandas¶
To have a rough idea of what are good features to split on, we will use pandas DataFrame
to operate on the iris dataset.
# write the input features first
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
# append the target values to the last column
df["target"] = iris.target
df.target = df.target.astype("category").cat.rename_categories(
dict(zip(range(3), iris.target_names))
)
df
To display some statistics of the input features for different classes:
df.groupby("target", observed=False).boxplot(rot=90, layout=(1, 3))
df.groupby("target", observed=False).agg(["mean", "std"]).round(2)
plt.show()
YOUR ANSWER HERE
Measuring impurity¶
Suppose nearly all instances of a dataset belong to the same class. In that case, we can return the majority class as the decision without further splitting. A measure of impurity is the Gini impurity index, defined as follows:
We can represent a distribution simply as a NumPy array. To return the empirical class distributions of the iris dataset:
def dist(values):
"""
Compute the empirical distribution of the given 1D array of values.
This function calculates the empirical distribution of the input array,
returning a 1D array of probabilities. The order of the probabilities in
the output array is immaterial.
Parameters
----------
values : array-like
A 1D array of values for which the empirical distribution is to be
computed.
Returns
-------
probabilities : ndarray
A 1D array of probabilities corresponding to the empirical distribution
of the input values.
Examples
--------
>>> import numpy as np
>>> values = np.array([1, 2, 2, 3, 3, 3])
>>> dist(values)
array([0.16666667, 0.33333333, 0.5 ])
"""
counts = np.unique(values, return_counts=True)[-1]
return counts / counts.sum()
print(f"Distribution of target: {dist(iris.target).round(3)}")
The Gini impurity index can be implemented as follows:
def g(p):
"""Returns the Gini impurity of the distribution p."""
return 1 - (p**2).sum()
_code = In[-1].rsplit(maxsplit=1)[0] # store the code for chatting with LLM
g?
print(f"Gini(D) = {g(dist(iris.target)):.3g}")
%%ai chatgpt -f text
Improve the docstring to conform to NumPy style:
--
{_code}
Another measure of impurity uses the information measure called entropy in information theory:
def h(p):
# Improve the docstring below to conform to NumPy style.
"""Returns the entropy of distribution p (1D array)."""
p = np.array(p)
p = p[(p > 0) & (p < 1)] # 0 log 0 = 1 log 1 = 0
# YOUR CODE HERE
raise NotImplementedError
h?
print(f"Info(D): {h(dist(iris.target)):.3g}")
Source
# tests
assert np.isclose(h([1 / 2, 1 / 2]), 1)
assert np.isclose(h([1, 0]), 0)
assert np.isclose(h([1 / 2, 1 / 4, 1 / 4]), 1.5)
Drop in impurity¶
We will consider the binary splitting criterion in particular, which gives
where
- denotes the target,
- denotes the empirical distribution, and
- , , and denote the empirical probability mass functions of with or without conditioning.
def drop_in_gini(X, Y, s):
"""
Calculate the drop in Gini impurity for a binary split.
This function computes the reduction in Gini impurity of the target `Y`
when the input feature `X` is split at the threshold `s`.
Parameters
----------
X : 1D array-like
Input feature values for different instances.
Y : 1D array-like
Target values corresponding to `X`.
s : float
Splitting point for `X`.
Returns
-------
float
The reduction in Gini impurity resulting from the split.
"""
S = X <= s
q = S.mean()
return g(dist(Y)) - q * g(dist(Y[S])) - (1 - q) * g(dist(Y[~S]))
X, Y = df["petal width (cm)"], df.target
print(f"Drop in Gini: {drop_in_gini(X, Y, 0.8):.4g}")
To compute the best splitting point for a given input feature, we check every consecutive mid-points of the observed feature values:
def find_best_split_pt(X, Y, gain_function):
"""
Find the best splitting point and the maximum gain for a binary split.
This function identifies the optimal splitting point `s` that maximizes the
gain as evaluated by the provided `gain_function` for the split `X <= s`
and target `Y`.
Parameters
----------
X : 1D array-like
Input feature values for different instances.
Y : 1D array-like
Target values corresponding to `X`.
gain_function : function
A function that evaluates the gain for a splitting criterion `X <= s`.
For example, `drop_in_gini`.
Returns
-------
tuple
(s, g) where `s` is the best splitting point and `g` is the maximum
gain.
See Also
--------
drop_in_gini : Function to calculate the drop in Gini impurity.
"""
values = np.sort(np.unique(X))
split_pts = (values[1:] + values[:-1]) / 2
gain = np.array([gain_function(X, Y, s) for s in split_pts])
i = np.argmax(gain)
return split_pts[i], gain[i]
print(
"""Best split point: {0:.3g}
Maximum gain: {1:.3g}""".format(
*find_best_split_pt(X, Y, drop_in_gini)
)
)
The following ranks the features according to the gains of their best binary splits:
rank_by_gini = pd.DataFrame(
{
"feature": feature,
**(lambda s, g: {"split point": s, "gain": g})(
*find_best_split_pt(df[feature], df.target, drop_in_gini)
),
}
for feature in iris.feature_names
).sort_values(by="gain", ascending=False)
rank_by_gini
Using the entropy to measure impurity, we have the following alternative gain function:
We will again consider the binary splitting criterion in particular, which gives
def gain(X, Y, s):
"""
Calculate the information gain for a binary split.
This function computes the information gain of the target `Y` when the
input feature `X` is split at the threshold `s`.
Parameters
----------
X : 1D array-like
Input feature values for different instances.
Y : 1D array-like
Target values corresponding to `X`.
s : float
Splitting point for `X`.
Returns
-------
float
The information gain resulting from the split.
"""
S = X <= s
q = S.mean()
# YOUR CODE HERE
raise NotImplementedError
print(f"Information gain: {gain(X, Y, 0.8):.4g}")
Source
# tests
rank_by_entropy = pd.DataFrame(
{
"feature": feature,
**(lambda s, g: {"split point": s, "gain": g})(
*find_best_split_pt(df[feature], df.target, gain)
),
}
for feature in iris.feature_names
).sort_values(by="gain", ascending=False)
rank_by_entropy
The C4.5 induction algorithm uses information gain ratio instead of information gain:
def gain_ratio(X, Y, split_pt):
# Add docstring here
S = X <= split_pt
q = S.mean()
# YOUR CODE HERE
raise NotImplementedError
Source
# tests
rank_by_gain_ratio = pd.DataFrame(
{
"feature": feature,
**(lambda s, g: {"split point": s, "gain": g})(
*find_best_split_pt(df[feature], df.target, gain_ratio)
),
}
for feature in iris.feature_names
).sort_values(by="gain", ascending=False)
rank_by_gain_ratio
YOUR ANSWER HERE