Different Classifiers with Weka

import os
import logging
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from IPython import display

%matplotlib widget
if not os.getenv(
    "NBGRADER_EXECUTION"
):
    %load_ext jupyter_ai
    %ai update chatgpt dive:chat
    # %ai update chatgpt dive-azure:gpt4o

In this notebook, you will use Weka to compare different classifiers trained using different algorithms and parameters.

Noise Curve¶

Complete the tutorial exercises in [Witten11] Ex 17.2.6 to 17.2.11 using the dataset glass.arff described at the beginning of [Witten11] 17.2.

The video below demonstrates how to use the Explorer interface to train 3-Nearest-Neighbor (3NN) classifier with 50% of the training data corrupted by noise.

open in new tab

Weka provides a convenient interface, called the Experimenter, to compare the performances of different classification algorithms on different datasets. This is demonstrated by the video below.

open in new tab

A more flexible way is to use the python-weka-wrapper3. Start the java virtual machine and load the glass.arff dataset:

import weka.core.jvm as jvm
from weka.classifiers import Classifier, Evaluation, FilteredClassifier
from weka.core.classes import Random
from weka.core.converters import Loader
from weka.filters import Filter

jvm.start(logging_level=logging.ERROR)
loader = Loader(classname="weka.core.converters.ArffLoader")
data = loader.load_url(
    "https://raw.githubusercontent.com/Waikato/weka-3.8/master/wekadocs/data/"
    + "glass.arff"
)
data.class_is_last()

We can then create a filtered classifier with the following tools:

from weka.classifiers import Classifier, Evaluation, FilteredClassifier
from weka.core.classes import Random
from weka.filters import Filter

add_noise = Filter(classname="weka.filters.unsupervised.attribute.AddNoise")
IBk = Classifier(classname="weka.classifiers.lazy.IBk")
fc = FilteredClassifier()
fc.filter = add_noise
fc.classifier = IBk

To compute the 10-fold cross-validated accuracy of 3-NN classification with 50% noise:

add_noise.options = ["-P", str(50), "-S", str(0)]  # percentage noise  # random seed
IBk.options = ["-K", str(3)]  # number of nearest neighbors
evl = Evaluation(data)
evl.crossvalidate_model(fc, data, 10, Random(1))
evl.percent_correct

Exercise 1 (Ex 17.2.6)

To answer Ex 17.2.6, use any of the above methods and complete the pandas DataFrame in the following cell by filling in the accuracies (as floating point numbers) for different percentages of noise and numbers of nearest neighbors. You can assign each column of accuracies as follows:

noise_df['k=1'] = [___, ___, ...]  # for 1-NN
noise_df['k=3'] = [___, ___, ...]  # for 3-NN
noise_df['k=5'] = [___, ___, ...]  # for 5-NN

To obtain the typical performance, consider repeating the experiment 10 times with different seeds and compute the average accuracies.

noise_df = pd.DataFrame(columns=["k=1", "k=3", "k=5"], dtype=float)
noise_df.insert(0, "Percentage Noise", np.arange(0, 101, 10))

# YOUR CODE HERE
raise NotImplementedError

display.display(noise_df.round(2))

plt.figure(num=1, figsize=(8, 5), clear=True)
for k in ["1", "3", "5"]:
    plt.plot(
        noise_df["Percentage Noise"], noise_df["k=" + k], label="k=" + k, marker="o"
    )
plt.legend()
plt.xlabel("Percentage Noise")
plt.ylabel("Accuracies")
plt.title(r"Training IB$k$ on noisy data for different $k$")
plt.show()

# hidden tests

%%ai chatgpt -f text
Explain how the noise curve can show whether a learning algorithm is prone
to overfitting?

YOUR ANSWER HERE

%%ai chatgpt -f text
Is it possible to overfit even when the training data has no noise, which is
defined as the irregularity irrelevant to the general pattern?

Training Curve¶

Exercise 4 (Ex 17.2.9)

Complete the pandas DataFrame in the following cell by filling in the accuracies (as floating point numbers) for different percentages of the dataset for training and different classifiers. You can assign each column of accuracies as follows:

train_df['IBk'] = [___, ___, ...]
train_df['J48'] = [___, ___, ...]

To obtain the typical performance, consider repeating the experiment 10 times with different seeds and compute the average accuracies.

train_df = pd.DataFrame(columns=["IBk", "J48"], dtype=float)
train_df.insert(0, "Percentage of Training Set", np.arange(10, 101, 10))

# YOUR CODE HERE
raise NotImplementedError

display.display(train_df.round(2))

plt.figure(num=3, figsize=(8, 5), clear=True)
for clf in ["IBk", "J48"]:
    plt.plot(
        train_df["Percentage of Training Set"], train_df[clf], label=clf, marker="o"
    )
plt.legend()
plt.xlabel("Percentage of Training Set")
plt.ylabel("Accuracies")
plt.title(r"Training IB$k$ and J48 with different amount of data")
plt.show()

# hidden tests

%%ai chatgpt -f text
Explain how the training curve can show whether a learning algorithm is prone
to underfitting?

YOUR ANSWER HERE

%%ai chatgpt -f text
Is it always possible to find the best fit (model) for a given training data?

Classification Boundaries¶

Complete the tutorial exercises in [Witten11] Ex 17.3.1 to 17.3.6 using the boundary visualizer for different classifiers on iris.2D.arff (NOT iris.arff) dataset.

For OneR, note that the boundary is decided based on two conditions in Appendix A of Holte93:

(3a) Minimum size of the optimal class should be at least minBucketSize, and
(3b) the optimal class of the smallest attribute value just above the boundary should be different from the optimal class just below the boundary.

%%ai chatgpt -f text
Explain the following two rules in deciding how values are partitioned into intervals so that every interval satisfies the following constraints: 
- (a) there is at least one class that is "optimal" for more than SMALL of the values in the interval. This constraint does not apply to the rightmost interval. 
- (b) If $V[I]$ is the smallest value for attribute $A$ in the training set that is larger than the values in interval $I$ then there is no class $C$ that is optimal both for $V[I]$ and for interval $I$.

YOUR ANSWER HERE

%%ai chatgpt -f text
How does Weka's BoundaryVisualizer plot the decision boundaries, especially
when there are more than two input features?