Skip to article frontmatterSkip to article content

Different Classifiers with scikit-learn

City University of Hong Kong
import os

import ipywidgets as widgets
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import display
from ipywidgets import interact
from sklearn import datasets, neighbors, preprocessing, tree
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import make_pipeline
from util import plot_decision_regions

%matplotlib widget
if not os.getenv(
    "NBGRADER_EXECUTION"
):
    %load_ext jupyter_ai
    %ai update chatgpt dive:chat
    # %ai update chatgpt dive-azure:gpt4o

Normalization of Attributes

For this notebook, we consider the binary classification problem on the breast cancer dataset in (Street el al. 2013):

# load the dataset from sklearn
dataset = datasets.load_breast_cancer()

# create a DataFrame to help further analysis
df = pd.DataFrame(data=dataset.data, columns=dataset.feature_names)
df["target"] = dataset.target
df.target = df.target.astype("category").cat.rename_categories(
    dict(zip(range(3), dataset.target_names))
)
df  # display an overview of the data

The goal is to train a classifier to diagnose whether a breast mass is malignant or benign. The target class distribution is shown as follows:

plt.figure(num=1, clear=True)
display(df.target.value_counts())
df.target.value_counts().plot(kind="bar", title="counts of different classes", rot=0)
plt.show()

The input features are characteristics of cell images obtained by the fine needle analysis (FNA). See if LLM may help give more information on the input features:

%%ai chatgpt -f text
Explain each input feature of the fine needle analysis in one line:
--
{df.columns}

To gain insights into how each input feature varies with different class values, we can examine the statistics of each input feature grouped by the class values:

def show_feature_statistics(df, **kwargs): 
    grps = df.groupby("target", observed=False)
    fig, axes = plt.subplots(nrows=1, ncols=len(grps), sharey=True, clear=True, figsize=(10, 9), layout="constrained", squeeze=False, **kwargs)
    for grp, ax in zip(df.groupby("target", observed=False), axes[0]):
        grp[1].boxplot(rot=90, fontsize=7, ax=ax).set_title(grp[0])
    plt.show()

show_feature_statistics(df, num=2)

From the above plots, it can be observed that the attributes mean area and worst area have much larger ranges than other features have.

YOUR ANSWER HERE

Min-max Normalization

We can normalize a numeric feature Z\R{Z} to the unit interval as follows:

Z:=Zba\begin{align} \R{Z}':= \frac{\R{Z}}{b - a} \end{align}

where aa and bb are respectively the minimum and maximum possible values of Z\R{Z}.

In practice, aa and bb may be unknown as the distribution of Z\R{Z} is unknown. We perform the normalization on the samples: The min-max normalization of the sequence (in ii) of ziz_i is the sequence of

zi:=ziminjzjmaxjzjminjzj,\begin{align} z'_i := \frac{z_i - \min_j z_j}{\max_j z_j - \min_j z_j}, \end{align}

where minjzj\min_j z_j and maxjzj\max_j z_j are respectively the minimum and maximum sample values. It follows that 0zi10\leq z'_i \leq 1 and the equalities hold with equality for some indices ii.

An implementation is as follows:

def minmax_normalize(df, suffix=" (min-max normalized)"):
    """
    Min-max normalize numerical attributes of the input DataFrame.

    Parameters
    ----------
    df : DataFrame
        Input DataFrame to be min-max normalized. May contain both numeric and 
        categorical attributes.
    suffix : str, optional
        Suffix to append to the names of normalized attributes (default is 
        " (min-max normalized)").

    Returns
    -------
    DataFrame
        A copy of the input DataFrame with its numeric attributes replaced by their 
        min-max normalization. The normalized features are renamed with the suffix 
        appended to the end of their original names.
    """
    df = df.copy()  # avoid overwriting the original dataframe
    min_values = df.select_dtypes(include="number").min()  # Skip categorical features
    max_values = df[min_values.index].max()

    # min-max normalize
    df[min_values.index] = (df[min_values.index] - min_values) / (
        max_values - min_values
    )

    # rename normalized features
    df.rename(columns={c: c + suffix for c in min_values.index}, inplace=True)

    return df

Renaming normalized features helps differentiate them from the original ones. The statistics of the normalized features are given below:

df_minmax_normalized = minmax_normalize(df)
assert df_minmax_normalized.target.to_numpy().base is df.target.to_numpy().base

show_feature_statistics(df_minmax_normalized)

We can see how instances of different classes differ in different input features other than mean area and worst area. In particular, both mean-concavity and worst-concavity are substantially higher for malignant examples than for benign examples.

Standard Normalization

Min-max normalization is not appropriate for features with unbounded support where ba=b-a=\infty in (1). The normalization factor maxjzjminjzj\max_j z_j - \min_j z_j in (2) for i.i.d. samples will approach as the number of samples goes to infinity.

Let us inspect the distribution of each feature using displot provided by the package seaborn, which is imported with

import seaborn as sns
@interact(
    feature=dataset.feature_names, kernel_density_estimation=True, group_by_class=False
)
def plot_distribution(feature, kernel_density_estimation, group_by_class):
    grps = df.groupby("target", observed=False) if group_by_class else [('', df)]
    fig, axes = plt.subplots(nrows=1, ncols=len(grps), clear=True, figsize=(10, 5), layout="constrained", num=4, squeeze=False, sharey=True)
    for grp, ax in zip(grps, axes[0]):
        sns.histplot(data=grp[1], x=feature, kde=kernel_density_estimation, ax=ax)
    plt.show()

Play with the above widgets to check if the input features have unbounded support.

For a feature Z\R{Z} with unbounded support, one may use the zz-score/standard normalization instead:

Z:=ZE[Z]Var(Z).\begin{align} \R{Z}' := \frac{\R{Z} - E[\R{Z}]}{\sqrt{\operatorname{Var}(\R{Z})}}. \end{align}

Since the distribution of Z\R{Z} is unknown, we normalize the sequence of i.i.d. samples ziz_i using its sample mean μ and standard deviation σ to the sequence of

zi:=ziμσ.\begin{align} z'_i := \frac{z_i - \mu}{\sigma}. \end{align}
def standard_normalize(df, suffix=" (standard normalized)"):
    """Returns a DataFrame with numerical attributes of the input DataFrame
    standard normalized.

    Parameters
    ----------
    df: DataFrame
        Input to be standard normalized. May contain both numeric
        and categorical attributes.
    suffix: string
        Suffix to append to the names of normalized attributes.

    Returns
    -------
    DataFrame:
        A new copy of df that retains the categorical attributes but with the
        numeric attributes replaced by their standard normalization.
        The normalized features are renamed with the suffix appended to the end
        of their original names.
    """
    # YOUR CODE HERE
    raise NotImplementedError


df_standard_normalized = standard_normalize(df)
show_feature_statistics(df_standard_normalized, num=5)
Source
# tests
assert np.isclose(
    df_standard_normalized.select_dtypes(include="number").mean(), 0
).all()
assert np.isclose(df_standard_normalized.select_dtypes(include="number").std(), 1).all()

Nearest Neighbor Classification

To create a kk-nearest-neighbor (kk-NN) classifier, we can use sklearn.neighbors.KNeighborsClassifier. The following fits a 1-NN classifier to the entire dataset and returns its training accuracy.

X, Y = df[dataset.feature_names], df.target
kNN1 = neighbors.KNeighborsClassifier(n_neighbors=1)
kNN1.fit(X, Y)

print("Training accuracy: {:0.3g}".format(kNN1.score(X, Y)))

YOUR ANSWER HERE

To avoid overly-optimistic performance estimates, the following uses 10-fold cross validation to compute the accuracies of 1-NN trained on datasets with and without normalization.

cv = StratifiedKFold(n_splits=10, random_state=0, shuffle=True)

dfs = {"None": df, "Min-max": df_minmax_normalized}

acc = pd.DataFrame(columns=dfs.keys())
for norm in dfs:
    acc[norm] = cross_val_score(
        kNN1,
        dfs[norm].loc[:, lambda df: ~df.columns.isin(["target"])],
        # not [dataset.feature_names] since normalized features are renamed
        dfs[norm]["target"],
        cv=cv,
    )

acc.agg(["mean", "std"]).round(3)

The accuracies show that normalization improves the performance of 1-NN. More precisely, the accuracy improvement of 5%\sim 5\% appears statistically insignificant because it is at least twice the standard deviations of 2%\sim 2\%.

%%ai chatgpt -f text
Explain and introduce the formulae for the paired t-test and the 
corrected resampled paired t-test.
%%ai chatgpt -f text
How to reduce improve the statistical significance when comparing the 
performance of two learning algorithms?

Data Leak

The accuracies computed for the normalizations above suffer from a subtle issue that renders them overly optimistic:

This issue can be resolved by computing the normalization factors from the training set instead of the entire dataset. To do so, we will create a pipeline using the following:

from sklearn import preprocessing
from sklearn.pipeline import make_pipeline
  • Like the filtered classifier in Weka, sklearn.pipeline provides the function make_pipeline to combine a filter with a classifier.
  • sklearn.preprocessing provides different filters for preprocessing features, , e.g., StandardScaler and MinMaxScaler for

Creating a pipeline is especially useful for cross validation, where the normalization factors must be recomputed for each fold.

kNN1_standard_normalized = make_pipeline(preprocessing.StandardScaler(), kNN1)
acc["Standard"] = cross_val_score(kNN1_standard_normalized, X, Y, cv=cv)
acc["Standard"].agg(["mean", "std"]).round(3)
# YOUR CODE HERE
raise NotImplementedError
acc["Min-max"].agg(["mean", "std"]).round(5)
%%ai chatgpt -f text
Is there a data leak if I manually preprocess the data to maximize the 
classification accuracy?

Decision Regions

Since sklearn does not provide any function to plot the decision regions of a classifier, we provide the function plot_decision_regions in a module util defined in util.py of the current directory:

from util import plot_decision_regions
plot_decision_regions?

The following plots the decision region for a selected pair of input features.

if not os.getenv(
    "NBGRADER_EXECUTION"
):
    fig, ax = plt.subplots(nrows=1, ncols=1, clear=True, figsize=(10, 10), layout="constrained", num=6, sharey=True)
    @interact(
        normalization=["None", "Min-max", "Standard"],
        feature1=dataset.feature_names,
        feature2=dataset.feature_names,
        k=widgets.IntSlider(1, 1, 5, continuous_update=False),
        resolution=widgets.IntSlider(1, 1, 4, continuous_update=False),
    )
    def decision_regions_kNN(
        normalization,
        feature1=dataset.feature_names[0],
        feature2=dataset.feature_names[1],
        k=1,
        resolution=1,
    ):
        scaler = {
            "Min-max": preprocessing.MinMaxScaler,
            "Standard": preprocessing.StandardScaler,
        }
        kNN = neighbors.KNeighborsClassifier(n_neighbors=k)
        if normalization != "None":
            kNN = make_pipeline(scaler[normalization](), kNN)
        kNN.fit(df[[feature1, feature2]].to_numpy(), df.target.to_numpy())
        ax.clear()
        plot_decision_regions(
            df[[feature1, feature2]], df.target, kNN, N=resolution * 100,
            ax=ax
        )
        ax.set_title("Decision region for {}-NN".format(k))
        ax.set_xlabel(feature1)
        ax.set_ylabel(feature2)
        plt.show()

Interact with the widgets to:

  • Learn the effect on the decision regions/boundaries with different normalizations and choices of kk.
  • Learn to draw the decision boundaries for 1-NN with min-max normalization.
if not os.getenv(
    "NBGRADER_EXECUTION"
):
    fig, ax = plt.subplots(nrows=1, ncols=1, clear=True, figsize=(10, 10), layout="constrained", num=7, sharey=True)
    @interact(
        normalization=["None", "Min-max", "Standard"],
        feature1=dataset.feature_names,
        feature2=dataset.feature_names,
        resolution=widgets.IntSlider(1, 1, 4, continuous_update=False),
    )
    def decision_regions_kNN(
        normalization,
        feature1=dataset.feature_names[0],
        feature2=dataset.feature_names[1],
        resolution=1,
    ):
        scaler = {
            "Min-max": preprocessing.MinMaxScaler,
            "Standard": preprocessing.StandardScaler,
        }
        # YOUR CODE HERE
        raise NotImplementedError
        ax.clear()
        plot_decision_regions(
            df[[feature1, feature2]], df.target, DT, N=resolution * 100,
            ax=ax
        )
        ax.set_title("Decision region for Decision Tree")
        ax.set_xlabel(feature1)
        ax.set_ylabel(feature2)
        plt.show()

YOUR ANSWER HERE

%%ai chatgpt -f text
Is it possible to visualize the decision region of a classifier on 
high-dimensional data?
References
  1. William Wolberg, O. M. (1993). Breast Cancer Wisconsin (Diagnostic). UCI Machine Learning Repository. 10.24432/C5DW2B
  2. Street, W. N., Wolberg, W. H., & Mangasarian, O. L. (1993). <title>Nuclear feature extraction for breast tumor diagnosis</title> In R. S. Acharya & D. B. Goldgof (Eds.), Biomedical Image Processing and Biomedical Visualization. SPIE. 10.1117/12.148698
  3. Nadeau, C. (2003). Machine Learning, 52(3), 239–281. 10.1023/a:1024068626366