Hierarchical Clustering with Weka

import os
import matplotlib.pyplot as plt
import numpy as np
from Bio import Phylo

%matplotlib widget
if not os.getenv(
    "NBGRADER_EXECUTION"
):
    %load_ext jupyter_ai
    %ai update chatgpt dive:chat
    # %ai update chatgpt dive-azure:gpt4o

In this notebook, we will further explore clustering the instances in the Iris 2D dataset, deliberately excluding the class attribute. Specifically, we will employ the hierarchical clustering algorithm known as agglomerative nesting (AGNES), utilizing various cluster distance measures.

Classes-to-clusters evaluation¶

Similar to the last tutorial, use the Cluster panel to cluster the iris.2D dataset:

Using the Preprocess panel, load iris.2D.arff from the Weka data folder.
Using the Cluster panel, choose the Clusterer as HierarchicalClusterer.
The default number of clusters is $k=2$ . Change it to $k=3$ instead, i.e., set numClusters to 3.
Select Classes to clusters evaluation as the cluster mode.
Click Start to run the clustering algorithm.

Unlike other cluster modes, the class attribute is ignored automatically for Classes to clusters evaluation, so there is no need to click the ignore attributes button below the Cluster mode to ignore class.

Single-linkage method¶

The default linkage criterion (linkType) is SINGLE, which means that the hierarchical clustering algorithm iteratively links/merges two clusters $C$ and $C'$ into one to minimize the cluster distance defined as

\begin{align} \op{dist}(C, C'):=\min_{\M{p}\in C, \M{p'}\in C'} \op{dist}(\M{p}, \M{p'}), \end{align}

(1)

namely, the minimum distance between points in different clusters. This gives the single-linkage clustering solution.

# YOUR CODE HERE
raise NotImplementedError
error_rate

# hidden tests

YOUR ANSWER HERE

%%ai chatgpt -f text
What are the pros and cons of single-linkage method as compared to 
the k-means clustering?

Complete-linkage method¶

Repeat the hierarchical clustering procedure with linkType set to COMPLETE. The distance between two clusters is now measured by

\begin{align} \op{dist}(C, C'):=\max_{\M{p}\in C, \M{p'}\in C'} \op{dist}(\M{p}, \M{p'}), \end{align}

(2)

which is the maximum distance between points in different clusters. The measure gives the complete-linkage clustering solution.

# YOUR CODE HERE
raise NotImplementedError
error_rate

# hidden tests

YOUR ANSWER HERE

%%ai chatgpt -f text
What are the pros and cons of the single-linkage method as compared to 
the complete-linkage method?

%%ai chatgpt -f text
What are the pros and cons of the complete-linkage method as compared to 
the k-means clustering?

Dendrogram¶

To visualize the dendrogram in Weka,

right-click the clustering result buffer, and
select Visualize tree.

You should see the dendrogram below for the complete linkage method:

Complete-linkage dendrogram (first cluster only)

There are two issues with the dendrogram:

The dendrogram is incomplete and contains nodes only for the first cluster. The agglomerative clustering stopped immediately after having numClusters = 3 clusters.
The leaf nodes have names that are fractions. By default, the names are taken from the last attribute other than the ignored class attribute. In this case, a name such as 0.2 is a petalwidth, which is not helpful in identifying the samples or evaluating the clustering solution.

To fix the above issues:

Set numClusters to 1 so that the agglomerative clustering continues to merge all nodes into a single cluster.
In the preprocess panel, create a new attribute, ID, as the last attribute to name the leaf node uniquely. You can copy the following configuration into the Filter panel:
```
weka.filters.MultiFilter -F "weka.filters.unsupervised.attribute.AddID -C last -N ID" -F "weka.filters.unsupervised.attribute.NumericToNominal -R last" -F "weka.filters.unsupervised.attribute.NominalToString -C last"
```
Note
The above uses MultiFilter to compose three filters sequentially into one meta-filter:
1. AddID creates a unique numeric ID for each sample.
2. NumericToNominal followed by NominalToString converts the ID to a string. The conversion ensures that the clustering algorithm uses it to name the leaf but not in the calculation of cluster distances.
In the Clusterer Panel, the class attribute may be changed to ID. Switch it back to Class instead.

After applying the filter and rerunning the clustering algorithm with numClusters = 1 and class attribute ignored, the resulting dendrogram is as follows:

YOUR ANSWER HERE

The visualization is hard to read/analyze, especially for a large data set. The height is also not labeled.

Indeed, the dendrogram is printed in the result buffer as a phylogenetic tree in the Newick format with distances and leaf names. E.g., the dendrogram obtained by the complete linkage method is printed as:

Cluster 0
(((((((((((1:0,2:0):0,29:0):0,(5:0,9:0):0):0,((34:0,48:0):0,50:0):0):0.01695,((3:0,43:0):0,(37:0,39:0):0):0.01695):0.01695,((4:0,(8:0,11:0):0):0,(28:0,(40:0,49:0):0):0):0.0339):0.01982,(((7:0,(18:0,46:0):0):0.01695,(41:0,42:0):0.01695):0.01695,20:0.0339):0.01982):0.03625,(((10:0,38:0):0,(33:0,35:0):0):0.01695,13:0.01695):0.07301):0.01746,(14:0.04498,((15:0,36:0):0.0339,23:0.0339):0.01108):0.06245):0.11748,(((((6:0.01695,27:0.01695):0.01695,((16:0,22:0):0,32:0):0.0339):0.0339,17:0.0678):0.02982,(24:0.04498,44:0.04498):0.05264):0.07663,((((((((12:0,26:0):0,30:0):0,31:0):0,47:0):0.01695,21:0.01695):0.02803,19:0.04498):0.00873,25:0.05371):0.04391,45:0.09762):0.07663):0.05066):1.11921,(((((((((51:0,64:0):0.01695,77:0.01695):0.06263,(((56:0.01695,59:0.01695):0.01695,88:0.0339):0.01982,((66:0,76:0):0.0339,92:0.0339):0.01982):0.02586):0.01804,74:0.09762):0.04423,(((((54:0,72:0):0,90:0):0.01695,(89:0,100:0):0.01695):0.0339,((75:0,98:0):0.01695,(95:0,97:0):0.01695):0.0339):0.02873,(91:0.0339,96:0.0339):0.04568):0.06227):0.10672,(((((((((52:0,(79:0,85:0):0):0,67:0):0,69:0):0.01695,55:0.01695):0.01695,87:0.0339):0.01982,(57:0.0339,86:0.0339):0.01982):0.03625,107:0.08996):0.00766,62:0.09762):0.06153,(((53:0,73:0):0.0339,(120:0.01695,134:0.01695):0.01695):0.05114,(78:0.04498,84:0.04498):0.04006):0.07411):0.08942):0.14931,(((((71:0,(127:0,139:0):0):0.01695,(124:0,128:0):0.01695):0.04879,(((102:0,143:0):0.01695,147:0.01695):0.02803,150:0.04498):0.02076):0.04169,(((111:0.01695,114:0.01695):0.01695,122:0.0339):0.04568,(112:0.04498,148:0.04498):0.03459):0.02785):0.1693,(((104:0.01695,(117:0,138:0):0.01695):0.0678,(109:0.0339,126:0.0339):0.05085):0.09518,(130:0.08996,135:0.08996):0.08996):0.0968):0.12116):0.1883,((((58:0,94:0):0.0339,(61:0,80:0):0.0339):0.06054,99:0.09443):0.10278,((60:0.06574,65:0.06574):0.10434,(((63:0.01695,68:0.01695):0.05085,82:0.0678):0.02982,((70:0.01695,81:0.01695):0.03676,(83:0.01695,93:0.01695):0.03676):0.04391):0.07246):0.02714):0.38897):0.24263,(((((101:0.01695,110:0.01695):0.07301,(136:0.0339,144:0.0339):0.05607):0.01746,((137:0,141:0):0.04498,145:0.04498):0.06245):0.09715,((((103:0.0339,125:0.0339):0.05085,((113:0.01695,129:0.01695):0.01695,140:0.0339):0.05085):0.01288,((105:0.0339,133:0.0339):0.01108,121:0.04498):0.05264):0.0868,(115:0.06574,(((116:0.01695,146:0.01695):0.01695,142:0.0339):0.01695,149:0.05085):0.01489):0.11868):0.02016):0.1177,(((106:0.04498,118:0.04498):0.05264,119:0.09762):0.13421,((108:0.05371,131:0.05371):0.05619,(123:0.05085,132:0.05085):0.05905):0.12193):0.09046):0.50654):0.5153)

For the above to be a proper newick format,

Cluster 0 should be removed as it is not part of the Newick format; and
a semi-colon ; should be added to the end to complete the definition of a tree.

The text file complete.tree fixes the above issues:

The tree can be loaded using:

from Bio import Phylo

cl_tree = Phylo.read("complete.tree", "newick")


def plot_tree(tree, **kwargs):
    Phylo.draw(
        tree,
        branch_labels=lambda c: (l := c.branch_length)
        and l > 0.1
        and f"{l:.4g}"
        or None,
        **kwargs,
    )
    plt.gcf().set_size_inches(10, 20)
    plt.title("Phylogenetic tree")
    plt.tight_layout()


plot_tree(cl_tree)

To obtain the 3 clusters returned by Weka with $k=3$ :

cl_clusters = [cl_tree.clade[0], cl_tree.clade[1, 0], cl_tree.clade[1, 1]]
cl_clusters

To plot the clusters in different colors:

cl_tree.root.color = "grey"
for c, color in zip(cl_clusters, ["red", "blue", "green"]):
    c.color = color
plot_tree(cl_tree)

To obtain the leaf nodes for each of the clusters:

cl_clusters_leaves = [[i.name for i in c.get_terminals()] for c in cl_clusters]
for j, leaves in enumerate(cl_clusters_leaves):
    print(f'Cluster {j} with {len(leaves)} nodes: \n{" ".join(leaves)}\n')

To obtain the cophenetic distance of the root, i.e., the cluster containing all nodes, we can compute the total branch distance from the root to any leaf node, say node 1:

cl_d0 = cl_tree.distance("1")
cl_d0

The second largest cophenetic distance at which the root splits into 2 clusters is:

cl_d1 = cl_d0 - cl_tree.root.clades[1].branch_length
cl_d1

The minimum and maximum cophenetic distance at which we have 3 clusters is:

cl_max_distance, cl_min_distance = cl_d1, cl_d1 - min(
    c.branch_length for c in cl_clusters
)
cl_max_distance, cl_min_distance

plot_tree(
    cl_tree,
    axvspan=((cl_d0 - cl_max_distance, cl_d0 - cl_min_distance), {"facecolor": "0.8"}),
)

# YOUR CODE HERE
raise NotImplementedError
sl_max_distance, sl_min_distance

Source

# tests
assert np.isclose(sl_max_distance - sl_min_distance, 0.006629999999999997, rtol=1e-3)

# hidden tests

To obtain the largest cophenetic distance at which we have at least three clusters:

# YOUR CODE HERE
raise NotImplementedError

YOUR ANSWER HERE

%%ai chatgpt -f text
Explain how to measure the stability of a cluster in hierarchical clustering?