import logging
import os
if not os.getenv(
"NBGRADER_EXECUTION"
):
%load_ext jupyter_ai
%ai update chatgpt dive:chat
# %ai update chatgpt dive-azure:gpt4o

In this notebook, you will compete with your classmates and your machine by
- handcrafting a decision tree using Weka
UserClassifier
, and - using
python-weka-wrapper
to build the J48 (C4.5) decision tree as a comparison.
Let’s find out who is the most intelligent!
Interactive Decision Tree Construction¶
import logging
import os
if not os.getenv("NBGRADER_EXECUTION"):
import weka.core.jvm as jvm
import weka.core.packages as packages
jvm.start(packages=True, logging_level=logging.ERROR)
pkg, version = "userClassifier", "1.0.2"
if not packages.is_installed(pkg):
print(f"Installing {pkg}...")
packages.install_package("userClassifier", version="1.0.2")
print("Done.")
else:
print(f"Skipping {pkg}, already installed.")
Follow the instruction above [Witten11] Ex 17.2.12 to
- install the package
UserClassifier
, - hand-build a decision tree using
segment-challenge.arff
as the training set, and - test the performance using
segment-test.arff
as the test set.
YOUR ANSWER HERE
YOUR ANSWER HERE
Get ready to dive into the thrilling world of decision trees! It’s time to showcase your data science prowess and outshine your classmates. Here’s what you need to do:
YOUR ANSWER HERE
YOUR ANSWER HERE
%%ai chatgpt -f text
I am in a competition to hand build the best decision tree using the
UserClassifier package of Weka. Can you describe in one paragraph how to use
the scatter plots to find pairs of attributes to split? I cannot do detailed
calculations. How to avoid overfitting?
Python Weka Wrapper¶
To see if your hand-built classifier can beat the machine, use J48 (C4.5) to build a decision tree. Instead of using the Weka Explorer Interface, you will run Weka directly from the notebook using python-weka-wrapper3
.
Because Weka is written in Java, we need to start the java virtual machine first.
import weka.core.jvm as jvm
import logging
jvm.start(logging_level=logging.ERROR)
Loading dataset¶
To load the dataset, create an ArffLoader
as follows:
from weka.core.converters import Loader
loader = Loader(classname="weka.core.converters.ArffLoader")
The loader has the method load_url
to load data from the web, such as the Weka GitHub repository:
weka_data_path = (
"https://raw.githubusercontent.com/Waikato/weka-3.8/master/wekadocs/data/"
)
trainset = loader.load_url(
weka_data_path + "segment-challenge.arff"
) # use load_file to load from file instead
For classification, we have to specify the class attribute. For instance, the method class_is_last
mutates trainset
to have the last attribute as the class attribute:
trainset.class_is_last()
from weka.core.dataset import Instances
# YOUR CODE HERE
raise NotImplementedError
print(Instances.summary(testset))
Source
# tests
assert testset.relationname == "segment"
assert testset.num_instances == 810
assert testset.num_attributes == 20
Training using J48¶
To train a decision tree using J48, we create the classifier and then apply the method build_classifier
on the training set.
from weka.classifiers import Classifier
J48 = Classifier(classname="weka.classifiers.trees.J48")
J48.build_classifier(trainset)
J48
To visualize the tree by generating an SVG file:
import pygraphviz as pgv
from IPython.display import SVG
# Create a PyGraphviz AGraph object from the DOT data
pgv.AGraph(string=J48.graph).draw('J48tree.svg', prog='dot')
# Display the SVG file
SVG(filename="J48tree.svg")
How to edit the decision tree?
J48.graph
is a piece of code written in a domain-specific language called DOT graph. You can save the dot file instead of the rendered image, so that you can edit it further. To do so:
Save the string to a text file such as
J48tree.gv
Edit/preview it in vscode using the extension. To install the extension:
- Run the command in a terminal:
install-vscode-extension tintinweb.graphviz-interactive-preview@0.3.5
- Reload the vscode window with the command
> Developer: Reload Window
.
There are also online editors available such as:
Evaluation¶
To evaluate the decision tree on the training set:
from weka.classifiers import Evaluation
J48train = Evaluation(trainset)
J48train.test_model(J48, trainset)
train_accuracy = J48train.percent_correct
print(f"Training accuracy: {train_accuracy:.4g}%")
# YOUR CODE HERE
raise NotImplementedError
print(f"Test accuracy: {test_accuracy:.4g}%")
YOUR ANSWER HERE
To stop the Java virtual machine, run the following line. To restart jvm
, you must restart the kernel.
jvm.stop()