load("datamining.mac")$
Centroid-based methods¶
Partitional clusters¶
A cluster assignment is a list of cluster indices, where the -th index is the cluster label of the -th instance in a data set.
block(
[C: [2,2,1,1]],
to_clusters(C)
);
In the above, to_cluster(C)
converts a cluster assignment C
to a pair where
- the first element is the list of unique cluster indices in ascending order, and
- the second element is the corresponding list of clusters, each of which is a list of row indices of C in ascending order associated with the cluster index.
Compute centroids from clusters¶
The centroids can be computed for a list of clusters of data points by centroids(Cs, cfns)
where
Cs
is a list of data sets all with- the features given by the list
cfns
for calculating the centroids.
block(
[
fns: ['i, 'X_1, 'X_2],
cfns: ['X_1, 'X_2],
data_1, data_2
],
data_1: build_data(fns, lambda([i], [i, random(2), random(2)]),2),
data_2: build_data(fns, lambda([i], [i, 3+random(2), random(2)]),2),
cs: centroids([data_1, data_2], cfns),
[data_1, data_2, cs]
);
The centroids are returned as a data set on cfns
where the -th instance is the centroid of the cluster Cs[j]
.
A data set can be split into clusters using split_data_by_clusters(data, C)
where
data
is the data set to split, andC
is the cluster assignment in the form of a list of cluster indices, one for each instance of thedata
.
Centroids can also be obtained for a cluster assignment by centroids_for_clusters(data, cfns, C)
where
cfns
is the list of features to use for calculating the centroids.
The total variation (sum of squared distances) of the data points to their cluster centers can be computed by variation(data,C,cs)
where
cs
is the data where the -th instance is the center of the cluster indata
according to the cluster assignmentC
.
block(
[
fns: ['X_1, 'X_2],
cfns: ['X_1, 'X_2],
C, data_1, data_2, data,t,ids, ps, Cs
],
data_1: build_data(fns, lambda([i], [random(2), random(2)]),2),
data_2: build_data(fns, lambda([i], [3+random(2), random(2)]),2),
data: stack_data(data_1, data_2),
C: [1, 1, 2, 2],
t:split_data_by_clusters(data, C),
ids: t[1][1],
ps: t[1][2],
Cs: t[2],
cs:centroids_for_clusters(data, cfns, C),
var: variation(data,C,cs),
[data, ids, ps, Cs, cs, var]
);
In the above example, ids
is a list of unique cluster indices sorted in ascending order, corresponding to which
ps
is the list of fractional cluster sizes,Cs
is the list of clusters, andcs
is the data where the -th instance is the centroid of the -th cluster.
Compute clusters from centroids¶
Data points can be assigned to their nearest cluster centers using clusters_for_centroids(data, cs)
, where
data
is the data set to cluster, andcs
is the data containing cluster centers.
block(
[
fns: ['X_1, 'X_2],
cfns: ['X_1, 'X_2],
data_1, data_2, data, C, cs
],
data_1: build_data(fns, lambda([i], [random(2), random(2)]),3),
data_2: build_data(fns, lambda([i], [3+random(2), random(2)]),3),
data: stack_data(data_1, data_2),
C: [1,2,2,2,1,1],
cs: centroids_for_clusters(data, cfns, C),
[data, cs, clusters_for_centroids(data, cs)]
);
An instance is assigned to cluster only if the -th cluster center in cs
is the closest cluster center to the instance.
Helper functions¶
The following are some helper functions used to implement the centroid-based method:
block(
[
p: matrix([0,0]),
q: [1,1],
P: matrix(
[0,0],
[0,1]
),
Q: [[1,1],
[1,0]]
],
[centroid(Q), dist(p,q), sq_dist(p,q), nearest_neighbor(p,Q), nearest_neighbors(P,Q)]
);
In the above example:
centroid(Q)
is the centroid of the points (rows) in the clusterQ
, whereQ
can be a matrix or a list of lists of coordinates.dist(p,q)
is the Euclidean distance from pointp
to pointq
, where a point can be a row vector or a list of coordinates.sq_dist(p,q)
is the squared distance betweenp
andq
.nearest_neighbor(p,Q)
is iff the -th point (row) in the clusterQ
is the first nearest neighbor of pointp
.nearest_neighbors(P,Q)
is a list ofnearest_neighbors(p,Q)
wherep
is a point of the cluster P.
Hierarchical clustering¶
Cluster distances¶
pairwise_cluster_dists(data, C, dist)
returns the sorted list of pairwise cluster distances for data
with clustering assignment C
and the cluster distance dist
. The returned list consists of elements [[i,j],d]
where d
is the distance of
Cluster i
and Cluster j
.
block(
[
fns: ['X_1, 'X_2],
cfns: ['X_1, 'X_2],
data_1, data_2, data, C
],
data_1: build_data(fns, lambda([i], [random(2), random(2)]),2),
data_2: build_data(fns, lambda([i], [3+random(2), random(2)]),2),
data: stack_data(data_1, data_2),
C: [4,4,1,2],
dists: [min_dist, max_dist, centroid_dist, ward_dist],
[
data,
build_data_from_list(
["metric", "pairwise cluster distances"],
map(lambda([d], [d, pairwise_cluster_dists(data, C, d)]), dists)
)
]
);
The possible cluster distances are:
min_dist(P, Q)
: The closest distance between rows ofP
and rows ofQ
.max_dist(P, Q)
: The furthest distance between rows ofP
and rows ofQ
.centroid_dist(P, Q)
: The distance between the centroid of rows ofP
and the centroid of row onQ
.ward_dist(P, Q)
: The total variation (squared distances) of all the rows ofP
andQ
.
AGNES¶
Clusters can be agglomerated using agglomerate(data, C, ids)
where
ids
is a list of cluster indices to merge, and- the returned cluster assignment with have a new index
lmax(C)+1
for the merged clusters.
block(
[
fns: ['X_1, 'X_2],
cfns: ['X_1, 'X_2],
data_1, data_2, data, C, pcds
],
data_1: build_data(fns, lambda([i], [random(2), random(2)]),2),
data_2: build_data(fns, lambda([i], [3+random(2), random(2)]),2),
data: stack_data(data_1, data_2),
C: [4,4,1,2],
pcds: pairwise_cluster_dists(data, C, min_dist),
ids: pcds[1][1],
[data, C, ids, agglomerate(data, C, ids)]
);
An agglomerative nesting algorithm (AGNES) for clustering can be performed with agnes(data, dist, k)
where k
is the number of clusters to merge to, and dist
is the cluster distance to minimize by merging two clusters;
block(
[
fns: ['X_1, 'X_2],
cfns: ['X_1, 'X_2],
data_1, data_2, data
],
data_1: build_data(fns, lambda([i], [random(2), random(2)]),2),
data_2: build_data(fns, lambda([i], [3+random(2), random(2)]),2),
data: stack_data(data_1, data_2),
[data, agnes(data, min_dist, 2)]
);
agnes
returns a pair where
- the first element is the desired cluster assignment for k clusters, and
- the second element is a list where the
l
-th element is[C, [i,j], d]
whereC
is the cluster assignment right before thel
-th agglomeration,i
andj
are the cluster indices to be merged to a new cluster, which will have indexlmax(C)+1
, andd
is the distance of Clusteri
andj
calculated usingdist
.
Density-based methods¶
The core distances of points in a data set data
can be computed by core_dists(data, MinPts, eps)
where
MinPts
and eps
are the usual parameters for OPTICS/DBSCAN.
block(
[
fns: ['X_1, 'X_2],
cfns: ['X_1, 'X_2],
data_1, data_2, data,
MinPts: 3,
eps: 3,
cdm
],
data_1: build_data(fns, lambda([i], [random(2), random(2)]),2),
data_2: build_data(fns, lambda([i], [3+random(2), random(2)]),2),
data: stack_data(data_1, data_2),
cdm: core_dists(data, MinPts, eps),
[data, cdm, feature_values(cdm, "core_dist"), feature_values(cdm, 'd[1])]
);
The core distances and other related quantities are returned as a data set where the -th instance consists of the following for the -th instance in data:
"core_dist"
: The core distance, which isinf
by convention if there are fewer thanMinPts
data points within theeps
-neighborhood."neighbors"
: The list of indices corresponding to points in theeps
-neighborhood.'d[j]
: The distance to the -th instance.
DBSCAN¶
dbscan(data, MinPts, eps)
applies DBSCAN to cluster points in data
where a core point has at least MinPts
data points in its eps
-neighborhood:
block(
[
fns: ['X_1, 'X_2],
cfns: ['X_1, 'X_2],
data_1, data_2, data,
MinPts: 3,
eps: 1,
cdm
],
data_1: build_data(fns, lambda([i], [random(2), random(2)]),3),
data_2: build_data(fns, lambda([i], [5+random(2), random(2)]),3),
data: stack_data(data_1, data_2),
cdm: core_dists(data, MinPts, eps),
[data, cdm, dbscan(data, MinPts, eps)]
);
dbscan
returns a list consisting of
- the set of clusters (sets of indices),
- the set of noise points,
- the set of core points, and
- the set of border points.
OPTICS¶
optics(data, MinPts, eps)
applies OPTICS to clusters points in data
where a core point has at least MinPts
data points in its eps
-neighborhood:
block(
[
fns: ['X_1, 'X_2],
data_1, data_2, data,
MinPts: 3,
eps: 1,
cdm
],
data_1: build_data(fns, lambda([i], [random(2), random(2)]),3),
data_2: build_data(fns, lambda([i], [5+random(2), random(2)]),3),
data: stack_data(data_1, data_2),
cdm: core_dists(data, MinPts, eps),
[data, cdm, optics(data, MinPts, eps)]
);
optics
returns a pair consisting of
- the data containing
- a column of reachability distances, and
- a column containing the corresponding list of nodes reached, and
- the set of noise points.
Evaluation metrics¶
Pairwise correctness¶
pairwise_correctness(C, L)
takes
- the cluster assignment
C
and - the ground truth categorization
L
,
and it returns a pair where
- the first element is the pairwise correctness matrix, and
- the second element is the accuracy measured in terms of the average correctness.
block(
[
L: [1,1,2,2,3,3],
C: [1,1,1,2,2,2]
],
pairwise_correctness(C, L)
);
B-Cubed precision and recall¶
B-Cubed precision and recall can be computed using BCubed(C, L)
for the cluster assignment C
and categorization L
:
block(
[
L: [1,1,2,2,3,3],
C: [1,1,1,2,2,2]
],
BCubed(C, L)
);
- The first element of the returned list is a list
[precision, recall]
of the overall precision and recall. - The second element contains more detailed statistics per node. E.g., for the -th instance, the column TPj contains the list of index such that is a true positive, while the feature TP is the total count of such index . The remaining columns give the precision and recall for each instance.
Classes-to-clusters evaluation¶
classes_to_clusters_eval(L, C)
carry the classes-to-clusters evaluation on the categorization L
and cluster assignment C
.
The returned list consists of
- the accuracy maximized over the classes-to-clusters assignment,
- the assignment in the form of a list of
[l,c]
wherel
is a class index andc
is a cluster index, and - the list consisting of
- the list
l
of unique class labels in ascending order, - the list
c
of unique cluster labels in ascending order, and - the list
lst
of list of counts wherelst[i][j]
is the counts of instances associated with class indexl[i]
and cluster indexc[j]
.
- the list
block(
[
C: [1,1,2,2,3,3],
L: [1,1,1,2,2,2]
],
classes_to_clusters_eval(L, C)
);
Silhouette analysis¶
Silhouette coefficients can be computed by silhouette(data, C)
for the data set data
with cluster assignment C
as follows:
block(
[
fns: ['X_1, 'X_2],
cfns: ['X_1, 'X_2],
data_1, data_2, data,
C
],
data_1: build_data(fns, lambda([i], [random(2), random(2)]),2),
data_2: build_data(fns, lambda([i], [3+random(2), random(2)]),2),
data: stack_data(data_1, data_2),
C: [2,2,1,1],
[data, silhouette(data, C)]
);
silhouette(data, C)
returns data with columns:
'a
: mean intra-cluster distance.'b
: mean nearest-cluster distance."nearest"
: index of the nearest cluster.'s
: silhouette coefficient.