Clustering - CS5483

load("datamining.mac")$

Centroid-based methods¶

Partitional clusters¶

A cluster assignment is a list of cluster indices, where the $i$ -th index is the cluster label of the $i$ -th instance in a data set.

block(
    [C: [2,2,1,1]],
    to_clusters(C)
);

In the above, to_cluster(C) converts a cluster assignment C to a pair where

the first element is the list of unique cluster indices in ascending order, and
the second element is the corresponding list of clusters, each of which is a list of row indices of C in ascending order associated with the cluster index.

Compute centroids from clusters¶

The centroids can be computed for a list of clusters of data points by centroids(Cs, cfns) where

Cs is a list of data sets all with
the features given by the list cfns for calculating the centroids.

block(
    [
        fns: ['i, 'X_1, 'X_2],
        cfns: ['X_1, 'X_2],
        data_1, data_2
    ],
    data_1: build_data(fns, lambda([i], [i, random(2), random(2)]),2),
    data_2: build_data(fns, lambda([i], [i, 3+random(2), random(2)]),2),
    cs: centroids([data_1, data_2], cfns),
    [data_1, data_2, cs]
);

The centroids are returned as a data set on cfns where the $j$ -th instance is the centroid of the $j$ cluster Cs[j].

A data set can be split into clusters using split_data_by_clusters(data, C) where

data is the data set to split, and
C is the cluster assignment in the form of a list of cluster indices, one for each instance of the data.

Centroids can also be obtained for a cluster assignment by centroids_for_clusters(data, cfns, C) where

cfns is the list of features to use for calculating the centroids.

The total variation (sum of squared distances) of the data points to their cluster centers can be computed by variation(data,C,cs) where

cs is the data where the $i$ -th instance is the center of the $i$ cluster in data according to the cluster assignment C.

block(
    [
        fns: ['X_1, 'X_2],
        cfns: ['X_1, 'X_2],
        C, data_1, data_2, data,t,ids, ps, Cs
    ],
    data_1: build_data(fns, lambda([i], [random(2), random(2)]),2),
    data_2: build_data(fns, lambda([i], [3+random(2), random(2)]),2),
    data: stack_data(data_1, data_2),
    C: [1, 1, 2, 2],
    t:split_data_by_clusters(data, C),
    ids: t[1][1],
    ps: t[1][2],
    Cs: t[2],
    cs:centroids_for_clusters(data, cfns, C),
    var: variation(data,C,cs),
    [data, ids, ps, Cs, cs, var]
);

In the above example, ids is a list of unique cluster indices sorted in ascending order, corresponding to which

ps is the list of fractional cluster sizes,
Cs is the list of clusters, and
cs is the data where the $j$ -th instance is the centroid of the $j$ -th cluster.

Compute clusters from centroids¶

Data points can be assigned to their nearest cluster centers using clusters_for_centroids(data, cs), where

data is the data set to cluster, and
cs is the data containing cluster centers.

block(
    [
        fns: ['X_1, 'X_2],
        cfns: ['X_1, 'X_2],
        data_1, data_2, data, C, cs
    ],
    data_1: build_data(fns, lambda([i], [random(2), random(2)]),3),
    data_2: build_data(fns, lambda([i], [3+random(2), random(2)]),3),
    data: stack_data(data_1, data_2),
    C: [1,2,2,2,1,1],
    cs: centroids_for_clusters(data, cfns, C),
    [data, cs, clusters_for_centroids(data, cs)]
);

An instance is assigned to cluster $i$ only if the $i$ -th cluster center in cs is the closest cluster center to the instance.

Helper functions¶

The following are some helper functions used to implement the centroid-based method:

block(
    [
        p: matrix([0,0]),
        q: [1,1],
        P: matrix(
            [0,0],
            [0,1]
        ),
        Q: [[1,1],
            [1,0]]
    ],
    [centroid(Q), dist(p,q), sq_dist(p,q), nearest_neighbor(p,Q), nearest_neighbors(P,Q)]
);

In the above example:

centroid(Q) is the centroid of the points (rows) in the cluster Q, where Q can be a matrix or a list of lists of coordinates.
dist(p,q) is the Euclidean distance from point p to point q, where a point can be a row vector or a list of coordinates.
sq_dist(p,q) is the squared distance between p and q.
nearest_neighbor(p,Q) is $i$ iff the $i$ -th point (row) in the cluster Q is the first nearest neighbor of point p.
nearest_neighbors(P,Q) is a list of nearest_neighbors(p,Q) where p is a point of the cluster P.

Hierarchical clustering¶

Cluster distances¶

pairwise_cluster_dists(data, C, dist) returns the sorted list of pairwise cluster distances for data with clustering assignment C and the cluster distance dist. The returned list consists of elements [[i,j],d] where d is the distance of Cluster i and Cluster j.

block(
    [
        fns: ['X_1, 'X_2],
        cfns: ['X_1, 'X_2],
        data_1, data_2, data, C
    ],
    data_1: build_data(fns, lambda([i], [random(2), random(2)]),2),
    data_2: build_data(fns, lambda([i], [3+random(2), random(2)]),2),
    data: stack_data(data_1, data_2),
    C: [4,4,1,2],
    dists: [min_dist, max_dist, centroid_dist, ward_dist],
    [
        data,
        build_data_from_list(
            ["metric", "pairwise cluster distances"],
            map(lambda([d], [d, pairwise_cluster_dists(data, C, d)]), dists)
        )
    ]
);

The possible cluster distances are:

min_dist(P, Q): The closest distance between rows of P and rows of Q.
max_dist(P, Q): The furthest distance between rows of P and rows of Q.
centroid_dist(P, Q): The distance between the centroid of rows of P and the centroid of row on Q.
ward_dist(P, Q): The total variation (squared distances) of all the rows of P and Q.

AGNES¶

Clusters can be agglomerated using agglomerate(data, C, ids) where

ids is a list of cluster indices to merge, and
the returned cluster assignment with have a new index lmax(C)+1 for the merged clusters.

block(
    [
        fns: ['X_1, 'X_2],
        cfns: ['X_1, 'X_2],
        data_1, data_2, data, C,  pcds
    ],
    data_1: build_data(fns, lambda([i], [random(2), random(2)]),2),
    data_2: build_data(fns, lambda([i], [3+random(2), random(2)]),2),
    data: stack_data(data_1, data_2),
    C: [4,4,1,2],
    pcds: pairwise_cluster_dists(data, C, min_dist),
    ids: pcds[1][1],
    [data, C, ids, agglomerate(data, C, ids)]
);

An agglomerative nesting algorithm (AGNES) for clustering can be performed with agnes(data, dist, k) where k is the number of clusters to merge to, and dist is the cluster distance to minimize by merging two clusters;

block(
    [
        fns: ['X_1, 'X_2],
        cfns: ['X_1, 'X_2],
        data_1, data_2, data
    ],
    data_1: build_data(fns, lambda([i], [random(2), random(2)]),2),
    data_2: build_data(fns, lambda([i], [3+random(2), random(2)]),2),
    data: stack_data(data_1, data_2),
    [data, agnes(data, min_dist, 2)]
);

agnes returns a pair where

the first element is the desired cluster assignment for k clusters, and
the second element is a list where the l-th element is [C, [i,j], d] where
- C is the cluster assignment right before the l-th agglomeration,
- i and j are the cluster indices to be merged to a new cluster, which will have index lmax(C)+1, and
- d is the distance of Cluster i and j calculated using dist.

Density-based methods¶

The core distances of points in a data set data can be computed by core_dists(data, MinPts, eps) where MinPts and eps are the usual parameters for OPTICS/DBSCAN.

block(
    [
        fns: ['X_1, 'X_2],
        cfns: ['X_1, 'X_2],
        data_1, data_2, data,
        MinPts: 3,
        eps: 3,
        cdm
    ],
    data_1: build_data(fns, lambda([i], [random(2), random(2)]),2),
    data_2: build_data(fns, lambda([i], [3+random(2), random(2)]),2),
    data: stack_data(data_1, data_2),
    cdm: core_dists(data, MinPts, eps),
    [data, cdm, feature_values(cdm, "core_dist"), feature_values(cdm, 'd[1])]
);

The core distances and other related quantities are returned as a data set where the $i$ -th instance consists of the following for the $i$ -th instance in data:

"core_dist": The core distance, which is inf by convention if there are fewer than MinPts data points within the eps-neighborhood.
"neighbors": The list of indices corresponding to points in the eps-neighborhood.
'd[j]: The distance to the $j$ -th instance.

DBSCAN¶

dbscan(data, MinPts, eps) applies DBSCAN to cluster points in data where a core point has at least MinPts data points in its eps-neighborhood:

block(
    [
        fns: ['X_1, 'X_2],
        cfns: ['X_1, 'X_2],
        data_1, data_2, data,
        MinPts: 3,
        eps: 1,
        cdm
    ],
    data_1: build_data(fns, lambda([i], [random(2), random(2)]),3),
    data_2: build_data(fns, lambda([i], [5+random(2), random(2)]),3),
    data: stack_data(data_1, data_2),
    cdm: core_dists(data, MinPts, eps),
    [data, cdm, dbscan(data, MinPts, eps)]
);

dbscan returns a list consisting of

the set of clusters (sets of indices),
the set of noise points,
the set of core points, and
the set of border points.

OPTICS¶

optics(data, MinPts, eps) applies OPTICS to clusters points in data where a core point has at least MinPts data points in its eps-neighborhood:

block(
    [
        fns: ['X_1, 'X_2],
        data_1, data_2, data,
        MinPts: 3,
        eps: 1,
        cdm
    ],
    data_1: build_data(fns, lambda([i], [random(2), random(2)]),3),
    data_2: build_data(fns, lambda([i], [5+random(2), random(2)]),3),
    data: stack_data(data_1, data_2),
    cdm: core_dists(data, MinPts, eps),
    [data, cdm, optics(data, MinPts, eps)]
);

optics returns a pair consisting of

the data containing
- a column of reachability distances, and
- a column containing the corresponding list of nodes reached, and
the set of noise points.

Evaluation metrics¶

Pairwise correctness¶

pairwise_correctness(C, L) takes

the cluster assignment C and
the ground truth categorization L,

and it returns a pair where

the first element is the pairwise correctness matrix, and
the second element is the accuracy measured in terms of the average correctness.

block(
    [
        L: [1,1,2,2,3,3],
        C: [1,1,1,2,2,2]
    ],
    pairwise_correctness(C, L)
);

B-Cubed precision and recall¶

B-Cubed precision and recall can be computed using BCubed(C, L) for the cluster assignment C and categorization L:

block(
    [
        L: [1,1,2,2,3,3],
        C: [1,1,1,2,2,2]
    ],
    BCubed(C, L)
);

The first element of the returned list is a list [precision, recall] of the overall precision and recall.
The second element contains more detailed statistics per node. E.g., for the $i$ -th instance, the column TPj contains the list of index $j$ such that $(i,j)$ is a true positive, while the feature TP is the total count of such index $j$ . The remaining columns give the precision and recall for each instance.

Classes-to-clusters evaluation¶

classes_to_clusters_eval(L, C) carry the classes-to-clusters evaluation on the categorization L and cluster assignment C. The returned list consists of

the accuracy maximized over the classes-to-clusters assignment,
the assignment in the form of a list of [l,c] where l is a class index and c is a cluster index, and
the list consisting of
- the list l of unique class labels in ascending order,
- the list c of unique cluster labels in ascending order, and
- the list lst of list of counts where lst[i][j] is the counts of instances associated with class index l[i] and cluster index c[j].

block(
    [
        C: [1,1,2,2,3,3],
        L: [1,1,1,2,2,2]
    ],
    classes_to_clusters_eval(L, C)
);

Silhouette analysis¶

Silhouette coefficients can be computed by silhouette(data, C) for the data set data with cluster assignment C as follows:

block(
    [
        fns: ['X_1, 'X_2],
        cfns: ['X_1, 'X_2],
        data_1, data_2, data,
        C
    ],
    data_1: build_data(fns, lambda([i], [random(2), random(2)]),2),
    data_2: build_data(fns, lambda([i], [3+random(2), random(2)]),2),
    data: stack_data(data_1, data_2),
    C: [2,2,1,1],
    [data, silhouette(data, C)]
);

silhouette(data, C) returns data with columns:

'a: mean intra-cluster distance.
'b: mean nearest-cluster distance.
"nearest": index of the nearest cluster.
's: silhouette coefficient.

Maxima

Classification

Maxima

Frequent-Pattern Analysis