import os
import matplotlib.pyplot as plt
import numpy as np
from Bio import Phylo
%matplotlib widget
if not os.getenv(
"NBGRADER_EXECUTION"
):
%load_ext jupyter_ai
%ai update chatgpt dive:chat
# %ai update chatgpt dive-azure:gpt4o
In this notebook, we will further explore clustering the instances in the Iris 2D dataset, deliberately excluding the class attribute. Specifically, we will employ the hierarchical clustering algorithm known as agglomerative nesting (AGNES), utilizing various cluster distance measures.
Classes-to-clusters evaluation¶
Similar to the last tutorial, use the Cluster
panel to cluster the iris.2D
dataset:
- Using the
Preprocess
panel, loadiris.2D.arff
from the Weka data folder. - Using the
Cluster
panel, choose theClusterer
asHierarchicalClusterer
. - The default number of clusters is . Change it to instead, i.e., set
numClusters
to 3. - Select
Classes to clusters evaluation
as thecluster mode
. - Click
Start
to run the clustering algorithm.
Unlike other cluster modes, the class attribute is ignored automatically for Classes to clusters evaluation
, so there is no need to click the ignore attributes
button below the Cluster mode
to ignore class
.
Single-linkage method¶
The default linkage criterion (linkType
) is SINGLE
, which means that the hierarchical clustering algorithm iteratively links/merges two clusters and into one to minimize the cluster distance defined as
namely, the minimum distance between points in different clusters. This gives the single-linkage clustering solution.
# YOUR CODE HERE
raise NotImplementedError
error_rate
YOUR ANSWER HERE
%%ai chatgpt -f text
What are the pros and cons of single-linkage method as compared to
the k-means clustering?
Complete-linkage method¶
Repeat the hierarchical clustering procedure with linkType
set to COMPLETE
. The distance between two clusters is now measured by
which is the maximum distance between points in different clusters. The measure gives the complete-linkage clustering solution.
Assign to error_rate
the fraction (NOT percentage) of incorrectly clustered instances.
# YOUR CODE HERE
raise NotImplementedError
error_rate
Visualize the clustering assignments again and explain whether the complete linkage solution is better than the single-linkage solution.
YOUR ANSWER HERE
%%ai chatgpt -f text
What are the pros and cons of the single-linkage method as compared to
the complete-linkage method?
%%ai chatgpt -f text
What are the pros and cons of the complete-linkage method as compared to
the k-means clustering?
Dendrogram¶
To visualize the dendrogram in Weka,
- right-click the clustering result buffer, and
- select
Visualize tree
.
You should see the dendrogram below for the complete linkage method:
There are two issues with the dendrogram:
- The dendrogram is incomplete and contains nodes only for the first cluster. The agglomerative clustering stopped immediately after having
numClusters = 3
clusters. - The leaf nodes have names that are fractions. By default, the names are taken from the last attribute other than the ignored class attribute. In this case, a name such as 0.2 is a
petalwidth
, which is not helpful in identifying the samples or evaluating the clustering solution.
To fix the above issues:
Set
numClusters
to 1 so that the agglomerative clustering continues to merge all nodes into a single cluster.In the preprocess panel, create a new attribute,
ID
, as the last attribute to name the leaf node uniquely. You can copy the following configuration into theFilter
panel:weka.filters.MultiFilter -F "weka.filters.unsupervised.attribute.AddID -C last -N ID" -F "weka.filters.unsupervised.attribute.NumericToNominal -R last" -F "weka.filters.unsupervised.attribute.NominalToString -C last"
NoteThe above uses
MultiFilter
to compose three filters sequentially into one meta-filter:AddID
creates a unique numeric ID for each sample.NumericToNominal
followed byNominalToString
converts the ID to a string. The conversion ensures that the clustering algorithm uses it to name the leaf but not in the calculation of cluster distances.
In the
Clusterer
Panel, the class attribute may be changed toID
. Switch it back toClass
instead.
After applying the filter and rerunning the clustering algorithm with numClusters = 1
and class
attribute ignored, the resulting dendrogram is as follows:
For the single-linkage method, give the dendrogram where
- the root is the cluster of all nodes, and
- each leaf name a unique ID.
If you are running Weka via the remote desktop app, you should temporarily increase the resolution, e.g., Application->Settings->Display->3200x1800
.
YOUR ANSWER HERE
The visualization is hard to read/analyze, especially for a large data set. The height is also not labeled.
Indeed, the dendrogram is printed in the result buffer as a phylogenetic tree in the Newick format with distances and leaf names. E.g., the dendrogram obtained by the complete linkage method is printed as:
Cluster 0
(((((((((((1:0,2:0):0,29:0):0,(5:0,9:0):0):0,((34:0,48:0):0,50:0):0):0.01695,((3:0,43:0):0,(37:0,39:0):0):0.01695):0.01695,((4:0,(8:0,11:0):0):0,(28:0,(40:0,49:0):0):0):0.0339):0.01982,(((7:0,(18:0,46:0):0):0.01695,(41:0,42:0):0.01695):0.01695,20:0.0339):0.01982):0.03625,(((10:0,38:0):0,(33:0,35:0):0):0.01695,13:0.01695):0.07301):0.01746,(14:0.04498,((15:0,36:0):0.0339,23:0.0339):0.01108):0.06245):0.11748,(((((6:0.01695,27:0.01695):0.01695,((16:0,22:0):0,32:0):0.0339):0.0339,17:0.0678):0.02982,(24:0.04498,44:0.04498):0.05264):0.07663,((((((((12:0,26:0):0,30:0):0,31:0):0,47:0):0.01695,21:0.01695):0.02803,19:0.04498):0.00873,25:0.05371):0.04391,45:0.09762):0.07663):0.05066):1.11921,(((((((((51:0,64:0):0.01695,77:0.01695):0.06263,(((56:0.01695,59:0.01695):0.01695,88:0.0339):0.01982,((66:0,76:0):0.0339,92:0.0339):0.01982):0.02586):0.01804,74:0.09762):0.04423,(((((54:0,72:0):0,90:0):0.01695,(89:0,100:0):0.01695):0.0339,((75:0,98:0):0.01695,(95:0,97:0):0.01695):0.0339):0.02873,(91:0.0339,96:0.0339):0.04568):0.06227):0.10672,(((((((((52:0,(79:0,85:0):0):0,67:0):0,69:0):0.01695,55:0.01695):0.01695,87:0.0339):0.01982,(57:0.0339,86:0.0339):0.01982):0.03625,107:0.08996):0.00766,62:0.09762):0.06153,(((53:0,73:0):0.0339,(120:0.01695,134:0.01695):0.01695):0.05114,(78:0.04498,84:0.04498):0.04006):0.07411):0.08942):0.14931,(((((71:0,(127:0,139:0):0):0.01695,(124:0,128:0):0.01695):0.04879,(((102:0,143:0):0.01695,147:0.01695):0.02803,150:0.04498):0.02076):0.04169,(((111:0.01695,114:0.01695):0.01695,122:0.0339):0.04568,(112:0.04498,148:0.04498):0.03459):0.02785):0.1693,(((104:0.01695,(117:0,138:0):0.01695):0.0678,(109:0.0339,126:0.0339):0.05085):0.09518,(130:0.08996,135:0.08996):0.08996):0.0968):0.12116):0.1883,((((58:0,94:0):0.0339,(61:0,80:0):0.0339):0.06054,99:0.09443):0.10278,((60:0.06574,65:0.06574):0.10434,(((63:0.01695,68:0.01695):0.05085,82:0.0678):0.02982,((70:0.01695,81:0.01695):0.03676,(83:0.01695,93:0.01695):0.03676):0.04391):0.07246):0.02714):0.38897):0.24263,(((((101:0.01695,110:0.01695):0.07301,(136:0.0339,144:0.0339):0.05607):0.01746,((137:0,141:0):0.04498,145:0.04498):0.06245):0.09715,((((103:0.0339,125:0.0339):0.05085,((113:0.01695,129:0.01695):0.01695,140:0.0339):0.05085):0.01288,((105:0.0339,133:0.0339):0.01108,121:0.04498):0.05264):0.0868,(115:0.06574,(((116:0.01695,146:0.01695):0.01695,142:0.0339):0.01695,149:0.05085):0.01489):0.11868):0.02016):0.1177,(((106:0.04498,118:0.04498):0.05264,119:0.09762):0.13421,((108:0.05371,131:0.05371):0.05619,(123:0.05085,132:0.05085):0.05905):0.12193):0.09046):0.50654):0.5153)
For the above to be a proper newick format,
Cluster 0
should be removed as it is not part of the Newick format; and- a semi-colon
;
should be added to the end to complete the definition of a tree.
The text file complete.tree
fixes the above issues:
The tree can be loaded using:
from Bio import Phylo
cl_tree = Phylo.read("complete.tree", "newick")
def plot_tree(tree, **kwargs):
Phylo.draw(
tree,
branch_labels=lambda c: (l := c.branch_length)
and l > 0.1
and f"{l:.4g}"
or None,
**kwargs,
)
plt.gcf().set_size_inches(10, 20)
plt.title("Phylogenetic tree")
plt.tight_layout()
plot_tree(cl_tree)
Different from a dendrogram which plots the tree against cophenetic distance, the above plots the phylogenetic tree against the branch length from the root.
To obtain the 3 clusters returned by Weka with :
cl_clusters = [cl_tree.clade[0], cl_tree.clade[1, 0], cl_tree.clade[1, 1]]
cl_clusters
The longer the branch length, the more stable the cluster is, as a longer branch length means the cluster is less easy to split up or merge with other clusters.
To plot the clusters in different colors:
cl_tree.root.color = "grey"
for c, color in zip(cl_clusters, ["red", "blue", "green"]):
c.color = color
plot_tree(cl_tree)
To obtain the leaf nodes for each of the clusters:
cl_clusters_leaves = [[i.name for i in c.get_terminals()] for c in cl_clusters]
for j, leaves in enumerate(cl_clusters_leaves):
print(f'Cluster {j} with {len(leaves)} nodes: \n{" ".join(leaves)}\n')
The larger the cluster, the more stable it is as such a cluster is supported by many similar instances.
To obtain the cophenetic distance of the root, i.e., the cluster containing all nodes, we can compute the total branch distance from the root to any leaf node, say node 1:
cl_d0 = cl_tree.distance("1")
cl_d0
The second largest cophenetic distance at which the root splits into 2 clusters is:
cl_d1 = cl_d0 - cl_tree.root.clades[1].branch_length
cl_d1
The minimum and maximum cophenetic distance at which we have 3 clusters is:
cl_max_distance, cl_min_distance = cl_d1, cl_d1 - min(
c.branch_length for c in cl_clusters
)
cl_max_distance, cl_min_distance
plot_tree(
cl_tree,
axvspan=((cl_d0 - cl_max_distance, cl_d0 - cl_min_distance), {"facecolor": "0.8"}),
)
For more examples, see Biopython Tutorial and Cookbook.
For the single-linkage clustering solution, assign to sl_max_distance
and sl_min_distance
the maximum and minimum cophenetic distance at which we can obtain 3 clusters.
# YOUR CODE HERE
raise NotImplementedError
sl_max_distance, sl_min_distance
Source
# tests
assert np.isclose(sl_max_distance - sl_min_distance, 0.006629999999999997, rtol=1e-3)
To obtain the largest cophenetic distance at which we have at least three clusters:
Similar to the complete-linkage method, plot the phylogenetic tree for the single-linkage method. Furthermore,
- shade the range of branch length at which we can obtain 3 clusters, and
- color the 3 clusters returned by Weka differently.
# YOUR CODE HERE
raise NotImplementedError
By comparing the dendrograms obtained by the single-linkage and complete-linkage methods, explain why the complete-linkage method gives a better solution for the iris dataset.
Hint
Consider the range of cophenetic distance that gives 3 clusters instead of 2.
YOUR ANSWER HERE
%%ai chatgpt -f text
Explain how to measure the stability of a cluster in hierarchical clustering?