load("datamining.mac")$Centroid-based methods¶
Partitional clusters¶
A cluster assignment is a list of cluster indices, where the -th index is the cluster label of the -th instance in a data set.
block(
[C: [2,2,1,1]],
to_clusters(C)
);In the above, to_cluster(C) converts a cluster assignment C to a pair where
- the first element is the list of unique cluster indices in ascending order, and
- the second element is the corresponding list of clusters, each of which is a list of row indices of C in ascending order associated with the cluster index.
Compute centroids from clusters¶
The centroids can be computed for a list of clusters of data points by centroids(Cs, cfns) where
Csis a list of data sets all with- the features given by the list
cfnsfor calculating the centroids.
block(
[
fns: ['i, 'X_1, 'X_2],
cfns: ['X_1, 'X_2],
data_1, data_2
],
data_1: build_data(fns, lambda([i], [i, random(2), random(2)]),2),
data_2: build_data(fns, lambda([i], [i, 3+random(2), random(2)]),2),
cs: centroids([data_1, data_2], cfns),
[data_1, data_2, cs]
);The centroids are returned as a data set on cfns where the -th instance is the centroid of the cluster Cs[j].
A data set can be split into clusters using split_data_by_clusters(data, C) where
datais the data set to split, andCis the cluster assignment in the form of a list of cluster indices, one for each instance of thedata.
Centroids can also be obtained for a cluster assignment by centroids_for_clusters(data, cfns, C) where
cfnsis the list of features to use for calculating the centroids.
The total variation (sum of squared distances) of the data points to their cluster centers can be computed by variation(data,C,cs) where
csis the data where the -th instance is the center of the cluster indataaccording to the cluster assignmentC.
block(
[
fns: ['X_1, 'X_2],
cfns: ['X_1, 'X_2],
C, data_1, data_2, data,t,ids, ps, Cs
],
data_1: build_data(fns, lambda([i], [random(2), random(2)]),2),
data_2: build_data(fns, lambda([i], [3+random(2), random(2)]),2),
data: stack_data(data_1, data_2),
C: [1, 1, 2, 2],
t:split_data_by_clusters(data, C),
ids: t[1][1],
ps: t[1][2],
Cs: t[2],
cs:centroids_for_clusters(data, cfns, C),
var: variation(data,C,cs),
[data, ids, ps, Cs, cs, var]
);In the above example, ids is a list of unique cluster indices sorted in ascending order, corresponding to which
psis the list of fractional cluster sizes,Csis the list of clusters, andcsis the data where the -th instance is the centroid of the -th cluster.
Compute clusters from centroids¶
Data points can be assigned to their nearest cluster centers using clusters_for_centroids(data, cs), where
datais the data set to cluster, andcsis the data containing cluster centers.
block(
[
fns: ['X_1, 'X_2],
cfns: ['X_1, 'X_2],
data_1, data_2, data, C, cs
],
data_1: build_data(fns, lambda([i], [random(2), random(2)]),3),
data_2: build_data(fns, lambda([i], [3+random(2), random(2)]),3),
data: stack_data(data_1, data_2),
C: [1,2,2,2,1,1],
cs: centroids_for_clusters(data, cfns, C),
[data, cs, clusters_for_centroids(data, cs)]
);An instance is assigned to cluster only if the -th cluster center in cs is the closest cluster center to the instance.
Helper functions¶
The following are some helper functions used to implement the centroid-based method:
block(
[
p: matrix([0,0]),
q: [1,1],
P: matrix(
[0,0],
[0,1]
),
Q: [[1,1],
[1,0]]
],
[centroid(Q), dist(p,q), sq_dist(p,q), nearest_neighbor(p,Q), nearest_neighbors(P,Q)]
);In the above example:
centroid(Q)is the centroid of the points (rows) in the clusterQ, whereQcan be a matrix or a list of lists of coordinates.dist(p,q)is the Euclidean distance from pointpto pointq, where a point can be a row vector or a list of coordinates.sq_dist(p,q)is the squared distance betweenpandq.nearest_neighbor(p,Q)is iff the -th point (row) in the clusterQis the first nearest neighbor of pointp.nearest_neighbors(P,Q)is a list ofnearest_neighbors(p,Q)wherepis a point of the cluster P.
Hierarchical clustering¶
Cluster distances¶
pairwise_cluster_dists(data, C, dist) returns the sorted list of pairwise cluster distances for data with clustering assignment C and the cluster distance dist. The returned list consists of elements [[i,j],d] where d is the distance of
Cluster i and Cluster j.
block(
[
fns: ['X_1, 'X_2],
cfns: ['X_1, 'X_2],
data_1, data_2, data, C
],
data_1: build_data(fns, lambda([i], [random(2), random(2)]),2),
data_2: build_data(fns, lambda([i], [3+random(2), random(2)]),2),
data: stack_data(data_1, data_2),
C: [4,4,1,2],
dists: [min_dist, max_dist, centroid_dist, ward_dist],
[
data,
build_data_from_list(
["metric", "pairwise cluster distances"],
map(lambda([d], [d, pairwise_cluster_dists(data, C, d)]), dists)
)
]
);The possible cluster distances are:
min_dist(P, Q): The closest distance between rows ofPand rows ofQ.max_dist(P, Q): The furthest distance between rows ofPand rows ofQ.centroid_dist(P, Q): The distance between the centroid of rows ofPand the centroid of row onQ.ward_dist(P, Q): The total variation (squared distances) of all the rows ofPandQ.
AGNES¶
Clusters can be agglomerated using agglomerate(data, C, ids) where
idsis a list of cluster indices to merge, and- the returned cluster assignment with have a new index
lmax(C)+1for the merged clusters.
block(
[
fns: ['X_1, 'X_2],
cfns: ['X_1, 'X_2],
data_1, data_2, data, C, pcds
],
data_1: build_data(fns, lambda([i], [random(2), random(2)]),2),
data_2: build_data(fns, lambda([i], [3+random(2), random(2)]),2),
data: stack_data(data_1, data_2),
C: [4,4,1,2],
pcds: pairwise_cluster_dists(data, C, min_dist),
ids: pcds[1][1],
[data, C, ids, agglomerate(data, C, ids)]
);An agglomerative nesting algorithm (AGNES) for clustering can be performed with agnes(data, dist, k) where k is the number of clusters to merge to, and dist is the cluster distance to minimize by merging two clusters;
block(
[
fns: ['X_1, 'X_2],
cfns: ['X_1, 'X_2],
data_1, data_2, data
],
data_1: build_data(fns, lambda([i], [random(2), random(2)]),2),
data_2: build_data(fns, lambda([i], [3+random(2), random(2)]),2),
data: stack_data(data_1, data_2),
[data, agnes(data, min_dist, 2)]
);agnes returns a pair where
- the first element is the desired cluster assignment for k clusters, and
- the second element is a list where the
l-th element is[C, [i,j], d]whereCis the cluster assignment right before thel-th agglomeration,iandjare the cluster indices to be merged to a new cluster, which will have indexlmax(C)+1, anddis the distance of Clusteriandjcalculated usingdist.
Density-based methods¶
The core distances of points in a data set data can be computed by core_dists(data, MinPts, eps) where
MinPts and eps are the usual parameters for OPTICS/DBSCAN.
block(
[
fns: ['X_1, 'X_2],
cfns: ['X_1, 'X_2],
data_1, data_2, data,
MinPts: 3,
eps: 3,
cdm
],
data_1: build_data(fns, lambda([i], [random(2), random(2)]),2),
data_2: build_data(fns, lambda([i], [3+random(2), random(2)]),2),
data: stack_data(data_1, data_2),
cdm: core_dists(data, MinPts, eps),
[data, cdm, feature_values(cdm, "core_dist"), feature_values(cdm, 'd[1])]
);The core distances and other related quantities are returned as a data set where the -th instance consists of the following for the -th instance in data:
"core_dist": The core distance, which isinfby convention if there are fewer thanMinPtsdata points within theeps-neighborhood."neighbors": The list of indices corresponding to points in theeps-neighborhood.'d[j]: The distance to the -th instance.
DBSCAN¶
dbscan(data, MinPts, eps) applies DBSCAN to cluster points in data where a core point has at least MinPts data points in its eps-neighborhood:
block(
[
fns: ['X_1, 'X_2],
cfns: ['X_1, 'X_2],
data_1, data_2, data,
MinPts: 3,
eps: 1,
cdm
],
data_1: build_data(fns, lambda([i], [random(2), random(2)]),3),
data_2: build_data(fns, lambda([i], [5+random(2), random(2)]),3),
data: stack_data(data_1, data_2),
cdm: core_dists(data, MinPts, eps),
[data, cdm, dbscan(data, MinPts, eps)]
);dbscan returns a list consisting of
- the set of clusters (sets of indices),
- the set of noise points,
- the set of core points, and
- the set of border points.
OPTICS¶
optics(data, MinPts, eps) applies OPTICS to clusters points in data where a core point has at least MinPts data points in its eps-neighborhood:
block(
[
fns: ['X_1, 'X_2],
data_1, data_2, data,
MinPts: 3,
eps: 1,
cdm
],
data_1: build_data(fns, lambda([i], [random(2), random(2)]),3),
data_2: build_data(fns, lambda([i], [5+random(2), random(2)]),3),
data: stack_data(data_1, data_2),
cdm: core_dists(data, MinPts, eps),
[data, cdm, optics(data, MinPts, eps)]
);optics returns a pair consisting of
- the data containing
- a column of reachability distances, and
- a column containing the corresponding list of nodes reached, and
- the set of noise points.
Evaluation metrics¶
Pairwise correctness¶
pairwise_correctness(C, L) takes
- the cluster assignment
Cand - the ground truth categorization
L,
and it returns a pair where
- the first element is the pairwise correctness matrix, and
- the second element is the accuracy measured in terms of the average correctness.
block(
[
L: [1,1,2,2,3,3],
C: [1,1,1,2,2,2]
],
pairwise_correctness(C, L)
);B-Cubed precision and recall¶
B-Cubed precision and recall can be computed using BCubed(C, L) for the cluster assignment C and categorization L:
block(
[
L: [1,1,2,2,3,3],
C: [1,1,1,2,2,2]
],
BCubed(C, L)
);- The first element of the returned list is a list
[precision, recall]of the overall precision and recall. - The second element contains more detailed statistics per node. E.g., for the -th instance, the column TPj contains the list of index such that is a true positive, while the feature TP is the total count of such index . The remaining columns give the precision and recall for each instance.
Classes-to-clusters evaluation¶
classes_to_clusters_eval(L, C) carry the classes-to-clusters evaluation on the categorization L and cluster assignment C.
The returned list consists of
- the accuracy maximized over the classes-to-clusters assignment,
- the assignment in the form of a list of
[l,c]wherelis a class index andcis a cluster index, and - the list consisting of
- the list
lof unique class labels in ascending order, - the list
cof unique cluster labels in ascending order, and - the list
lstof list of counts wherelst[i][j]is the counts of instances associated with class indexl[i]and cluster indexc[j].
- the list
block(
[
C: [1,1,2,2,3,3],
L: [1,1,1,2,2,2]
],
classes_to_clusters_eval(L, C)
);Silhouette analysis¶
Silhouette coefficients can be computed by silhouette(data, C) for the data set data with cluster assignment C as follows:
block(
[
fns: ['X_1, 'X_2],
cfns: ['X_1, 'X_2],
data_1, data_2, data,
C
],
data_1: build_data(fns, lambda([i], [random(2), random(2)]),2),
data_2: build_data(fns, lambda([i], [3+random(2), random(2)]),2),
data: stack_data(data_1, data_2),
C: [2,2,1,1],
[data, silhouette(data, C)]
);silhouette(data, C) returns data with columns:
'a: mean intra-cluster distance.'b: mean nearest-cluster distance."nearest": index of the nearest cluster.'s: silhouette coefficient.