import logging
import os
import pprint
import numpy as np
import weka.core.jvm as jvm
import weka.core.packages as packages
from weka.classifiers import (
Classifier,
Evaluation,
FilteredClassifier,
SingleClassifierEnhancer,
)
from weka.core.classes import Random, complete_classname
from weka.core.converters import Loader
from weka.filters import Filter
if not os.getenv(
"NBGRADER_EXECUTION"
):
%load_ext jupyter_ai
%ai update chatgpt dive:chat
# %ai update chatgpt dive-azure:gpt4o
Setup¶
In this notebook, we will train classifiers properly on the skewed dataset for detecting microcalcifications in mammograms.
In particular, we will use the meta classifier ThresholdSelector
and the filter SMOTE
Synthetic Minority Over-sampling Technique. They need to be installed as additional packages in WEKA. To do so, we have imported packages
:
import weka.core.packages as packages
packages
must also be enabled for the java virtual machine:
jvm.start(packages=True, logging_level=logging.ERROR)
The following prints the information of the packages we will install:
pkgs = ["thresholdSelector", "SMOTE"]
for item in packages.all_packages():
if item.name in pkgs:
pprint.pp(item.metadata)
You may install the packages using the Weka package manager. To install them in python-weka-wrapper3
, run the following code:
for pkg in pkgs:
if not packages.is_installed(pkg):
print(f"Installing {pkg}...")
packages.install_package(f"/data/pkgs/{pkg}.zip")
else:
print(f"Skipping {pkg}, already installed. ")
else:
print("Done.")
By default, packages are installed under your home directory ~/wekafiles/packages/
:
!ls ~/wekafiles/packages
After restarting the kernel, check that the packages have been successfully installed using complete_classname
imported by
from weka.core.classes import complete_classname
print(complete_classname("ThresholdSelector"))
print(complete_classname("SMOTE"))
print(packages.installed_packages())
We will use the same mammography dataset from OpenML and J48 as the base classifier. The following loads the dataset into the notebook:
loader = Loader(classname="weka.core.converters.ArffLoader")
data = loader.load_url("https://www.openml.org/data/download/52214/phpn1jVwe")
data.class_is_last()
pos_class = 1
clf = Classifier(classname="weka.classifiers.trees.J48")
Threshold Selector¶
The meta classifier ThresholdSelector
uses the threshold-moving technique to optimize a performance measure you specify, which can be the precision, recall, -score, etc.[1]
The following shows how to maximize recall:
tsc = SingleClassifierEnhancer(classname="weka.classifiers.meta.ThresholdSelector")
tsc.options = ["-M", "RECALL"]
tsc.classifier = clf
evl = Evaluation(data)
evl.crossvalidate_model(tsc, data, 10, Random(1))
print(f"maximum recall: {evl.recall(pos_class):.3g}")
The maximum recall is 100%, as expected by setting the threshold to 1.
# YOUR CODE HERE
raise NotImplementedError
max_precision, max_f
%%ai chatgpt -f text
For multi-class classification, how should the threshold moving scheme work?
Cost-sensitive Classifier¶
In addition to precision and recall, we can build a classifier to minimize a cost with specific weights on the number of correctly/incorrected instances of different classes:
Weka provides a convenient interface for cost/benefit analysis:
- In the explorer interface, train J48 on the mammography dataset with 10-fold cross-validation.
- Right-click on the result in the result list.
- Choose Cost/Benefit analysis and 1 as the positive class value.
- Specify the cost matrix.
- Click
Minimize Cost/Benefit
to minimize the cost.
# YOUR CODE HERE
raise NotImplementedError
cost_matrix
The following test cell demonstrates how to train a meta classifier to minimize the cost defined using the cost matrix you provided.
# tests
csc = SingleClassifierEnhancer(
classname="weka.classifiers.meta.CostSensitiveClassifier",
options=[
"-cost-matrix",
"["
+ " ; ".join(
" ".join(str(entry) for entry in cost_matrix[:, i]) for i in range(2)
)
+ "]",
"-S",
"1",
],
)
csc.classifier = clf
evl = Evaluation(data)
evl.crossvalidate_model(csc, data, 10, Random(1))
precision = evl.precision(pos_class)
print(f"maximum precision: {precision:.3g}")
%%ai chatgpt -f text
For the cost-benefit analysis, is there an implementation that optimizes a more
general cost function, which may be non-linear with respect to the counts of
TP, FP, TN, and FN?
SMOTE¶
Synthetic Minority Over-sampling TEchnique (SMOTE) (Chawler et al., 2002) is a filter that up-samples the minority class. Instead of duplicates of the same instance, it creates new samples as convex combinations of existing ones.[2]
smote = Filter(classname="weka.filters.supervised.instance.SMOTE")
print("Default smote.options:", smote.options)
# YOUR CODE HERE
raise NotImplementedError
print("Your smote.options:", smote.options)
# tests
fc = FilteredClassifier()
fc.filter = smote
fc.classifier = clf
evl = Evaluation(data)
evl.crossvalidate_model(fc, data, 10, Random(1))
f_score = evl.f_measure(pos_class)
print(f"F-score by SMOTE: {f_score:.3g}")
%%ai chatgpt -f text
In SMOTE, since the synthetic data are generated based on existing data with
randomness independent of the data, which should be regarded as noise, how can
it be better than upsampling? In other words, no new relevant information is
generated because no new data is collected.
- Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–357. 10.1613/jair.953