Man vs Machine

import logging
import os

if not os.getenv(
    "NBGRADER_EXECUTION"
):
    %load_ext jupyter_ai
    %ai update chatgpt dive:chat
    # %ai update chatgpt dive-azure:gpt4o

In this notebook, you will compete with your classmates and your machine by

handcrafting a decision tree using Weka UserClassifier, and
using python-weka-wrapper to build the J48 (C4.5) decision tree as a comparison.

Let’s find out who is the most intelligent!

Interactive Decision Tree Construction¶

import logging
import os

if not os.getenv("NBGRADER_EXECUTION"):
    import weka.core.jvm as jvm
    import weka.core.packages as packages

    jvm.start(packages=True, logging_level=logging.ERROR)
    pkg, version = "userClassifier", "1.0.2"
    if not packages.is_installed(pkg):
        print(f"Installing {pkg}...")
        packages.install_package("userClassifier", version="1.0.2")
        print("Done.")
    else:
        print(f"Skipping {pkg}, already installed.")

Follow the instruction above [Witten11] Ex 17.2.12 to

install the package UserClassifier,
hand-build a decision tree using segment-challenge.arff as the training set, and
test the performance using segment-test.arff as the test set.

YOUR ANSWER HERE

Get ready to dive into the thrilling world of decision trees! It’s time to showcase your data science prowess and outshine your classmates. Here’s what you need to do:

Exercise 3

Include the model and result summary sections from the result buffer of your best hand-built decision tree. Your answer should look like:

=== Classifier model (full training set) ===

Split on ...

Time taken to build model: ...

=== Confusion Matrix ===

...

=== Summary ===

Correctly Classified Instances ...

Try your best to beat your classmates and the machines:

Build at least two decision trees and pick the best one.
Share your result (and knowledge) on the discussion page on Interactive Decision Tree Construction.
See if your classmates have posted better decision trees and give them a like if they have.

YOUR ANSWER HERE

%%ai chatgpt -f text
I am in a competition to hand build the best decision tree using the 
UserClassifier package of Weka. Can you describe in one paragraph how to use
the scatter plots to find pairs of attributes to split? I cannot do detailed
calculations. How to avoid overfitting?

Python Weka Wrapper¶

To see if your hand-built classifier can beat the machine, use J48 (C4.5) to build a decision tree. Instead of using the Weka Explorer Interface, you will run Weka directly from the notebook using python-weka-wrapper3.

Because Weka is written in Java, we need to start the java virtual machine first.

import weka.core.jvm as jvm
import logging

jvm.start(logging_level=logging.ERROR)

Loading dataset¶

To load the dataset, create an ArffLoader as follows:

from weka.core.converters import Loader

loader = Loader(classname="weka.core.converters.ArffLoader")

The loader has the method load_url to load data from the web, such as the Weka GitHub repository:

weka_data_path = (
    "https://raw.githubusercontent.com/Waikato/weka-3.8/master/wekadocs/data/"
)
trainset = loader.load_url(
    weka_data_path + "segment-challenge.arff"
)  # use load_file to load from file instead

For classification, we have to specify the class attribute. For instance, the method class_is_last mutates trainset to have the last attribute as the class attribute:

trainset.class_is_last()

from weka.core.dataset import Instances

# YOUR CODE HERE
raise NotImplementedError
print(Instances.summary(testset))

Source

# tests
assert testset.relationname == "segment"
assert testset.num_instances == 810
assert testset.num_attributes == 20

# hidden tests

Training using J48¶

To train a decision tree using J48, we create the classifier and then apply the method build_classifier on the training set.

from weka.classifiers import Classifier

J48 = Classifier(classname="weka.classifiers.trees.J48")
J48.build_classifier(trainset)
J48

To visualize the tree by generating an SVG file:

import pygraphviz as pgv
from IPython.display import SVG

# Create a PyGraphviz AGraph object from the DOT data
pgv.AGraph(string=J48.graph).draw('J48tree.svg', prog='dot')

# Display the SVG file
SVG(filename="J48tree.svg")

How to edit the decision tree?

J48.graph is a piece of code written in a domain-specific language called DOT graph. You can save the dot file instead of the rendered image, so that you can edit it further. To do so:

Save the string to a text file such as J48tree.gv
Edit/preview it in vscode using the extension. To install the extension:
1. Run the command in a terminal:
```
install-vscode-extension tintinweb.graphviz-interactive-preview@0.3.5
```
1. Reload the vscode window with the command > Developer: Reload Window.

There are also online editors available such as:

open in new tab

Evaluation¶

To evaluate the decision tree on the training set:

from weka.classifiers import Evaluation

J48train = Evaluation(trainset)
J48train.test_model(J48, trainset)
train_accuracy = J48train.percent_correct
print(f"Training accuracy: {train_accuracy:.4g}%")

# YOUR CODE HERE
raise NotImplementedError
print(f"Test accuracy: {test_accuracy:.4g}%")

# hidden tests

YOUR ANSWER HERE

To stop the Java virtual machine, run the following line. To restart jvm, you must restart the kernel.

jvm.stop()