import os
import logging
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from IPython import display
%matplotlib widget
if not os.getenv(
"NBGRADER_EXECUTION"
):
%load_ext jupyter_ai
%ai update chatgpt dive:chat
# %ai update chatgpt dive-azure:gpt4o
In this notebook, you will use Weka to compare different classifiers trained using different algorithms and parameters.
Noise Curve¶
Complete the tutorial exercises in [Witten11] Ex 17.2.6 to 17.2.11 using the dataset glass.arff
described at the beginning of [Witten11] 17.2.
The video below demonstrates how to use the Explorer interface to train 3-Nearest-Neighbor (3NN) classifier with 50% of the training data corrupted by noise.
Weka provides a convenient interface, called the Experimenter, to compare the performances of different classification algorithms on different datasets. This is demonstrated by the video below.
A more flexible way is to use the python-weka-wrapper3
. Start the java virtual machine and load the glass.arff
dataset:
import weka.core.jvm as jvm
from weka.classifiers import Classifier, Evaluation, FilteredClassifier
from weka.core.classes import Random
from weka.core.converters import Loader
from weka.filters import Filter
jvm.start(logging_level=logging.ERROR)
loader = Loader(classname="weka.core.converters.ArffLoader")
data = loader.load_url(
"https://raw.githubusercontent.com/Waikato/weka-3.8/master/wekadocs/data/"
+ "glass.arff"
)
data.class_is_last()
We can then create a filtered classifier with the following tools:
from weka.classifiers import Classifier, Evaluation, FilteredClassifier
from weka.core.classes import Random
from weka.filters import Filter
add_noise = Filter(classname="weka.filters.unsupervised.attribute.AddNoise")
IBk = Classifier(classname="weka.classifiers.lazy.IBk")
fc = FilteredClassifier()
fc.filter = add_noise
fc.classifier = IBk
To compute the 10-fold cross-validated accuracy of 3-NN classification with 50% noise:
add_noise.options = ["-P", str(50), "-S", str(0)] # percentage noise # random seed
IBk.options = ["-K", str(3)] # number of nearest neighbors
evl = Evaluation(data)
evl.crossvalidate_model(fc, data, 10, Random(1))
evl.percent_correct
noise_df = pd.DataFrame(columns=["k=1", "k=3", "k=5"], dtype=float)
noise_df.insert(0, "Percentage Noise", np.arange(0, 101, 10))
# YOUR CODE HERE
raise NotImplementedError
display.display(noise_df.round(2))
plt.figure(num=1, figsize=(8, 5), clear=True)
for k in ["1", "3", "5"]:
plt.plot(
noise_df["Percentage Noise"], noise_df["k=" + k], label="k=" + k, marker="o"
)
plt.legend()
plt.xlabel("Percentage Noise")
plt.ylabel("Accuracies")
plt.title(r"Training IB$k$ on noisy data for different $k$")
plt.show()
%%ai chatgpt -f text
Explain how the noise curve can show whether a learning algorithm is prone
to overfitting?
YOUR ANSWER HERE
YOUR ANSWER HERE
%%ai chatgpt -f text
Is it possible to overfit even when the training data has no noise, which is
defined as the irregularity irrelevant to the general pattern?
Training Curve¶
train_df = pd.DataFrame(columns=["IBk", "J48"], dtype=float)
train_df.insert(0, "Percentage of Training Set", np.arange(10, 101, 10))
# YOUR CODE HERE
raise NotImplementedError
display.display(train_df.round(2))
plt.figure(num=3, figsize=(8, 5), clear=True)
for clf in ["IBk", "J48"]:
plt.plot(
train_df["Percentage of Training Set"], train_df[clf], label=clf, marker="o"
)
plt.legend()
plt.xlabel("Percentage of Training Set")
plt.ylabel("Accuracies")
plt.title(r"Training IB$k$ and J48 with different amount of data")
plt.show()
%%ai chatgpt -f text
Explain how the training curve can show whether a learning algorithm is prone
to underfitting?
YOUR ANSWER HERE
YOUR ANSWER HERE
YOUR ANSWER HERE
%%ai chatgpt -f text
Is it always possible to find the best fit (model) for a given training data?
Classification Boundaries¶
Complete the tutorial exercises in [Witten11] Ex 17.3.1 to 17.3.6 using the boundary visualizer for different classifiers on iris.2D.arff
(NOT iris.arff
) dataset.
For OneR, note that the boundary is decided based on two conditions in Appendix A of Holte93:
- (3a) Minimum size of the optimal class should be at least
minBucketSize
, and - (3b) the optimal class of the smallest attribute value just above the boundary should be different from the optimal class just below the boundary.
Figure 1:OneR decision boundary
%%ai chatgpt -f text
Explain the following two rules in deciding how values are partitioned into intervals so that every interval satisfies the following constraints:
- (a) there is at least one class that is "optimal" for more than SMALL of the values in the interval. This constraint does not apply to the rightmost interval.
- (b) If $V[I]$ is the smallest value for attribute $A$ in the training set that is larger than the values in interval $I$ then there is no class $C$ that is optimal both for $V[I]$ and for interval $I$.
YOUR ANSWER HERE
YOUR ANSWER HERE
YOUR ANSWER HERE
YOUR ANSWER HERE
YOUR ANSWER HERE
YOUR ANSWER HERE
%%ai chatgpt -f text
How does Weka's BoundaryVisualizer plot the decision boundaries, especially
when there are more than two input features?