This jupyter notebook demonstrates how to cluster the iris.2D
dataset using density-based methods. It uses the language R and can be run live using an R kernel.
Setup¶
The following load and create the iris.2D
data set:
data("iris") # load the iris data set
x <- as.matrix(iris[,1:2]) # load the input attributes: sepal width and length
plot(x)
DBSCAN and OPTICS are implemented in the following package:
library(dbscan) # for DBSCAN and OPTICS
help(package="dbscan") # More information about the package
DBSCAN¶
DBSCAN is implement by the function dbscan
:
?dbscan
To apply DBSCAN to the iris data set with and :
db <- dbscan(x, eps = .3, minPts = 4)
db
To visualize the clustering solution, we can plot the points in different clusters with different colors:
pairs(x, col = db$cluster + 1L)
YOUR ANSWER HERE
For each data point, we can calculate the local outlier factor (LOF), which quantifies how much a point is locally an outlier using the reachability distance:
lof <- lof(x, minPts=5)
pairs(x, cex = lof) # ploting the points scaled relative to the LOF score.
When calculating the Local Outlier Factor (LOF), the reachability distances are used to estimate the local density of a point compared to the local densities of its neighbors. If a point’s local density is significantly lower than that of its neighbors, it implies that the point is more isolated and therefore likely to be an outlier. In other words, a higher reachability distance indicates that the point is further away from its neighbors, signaling that it is an outlier relative to the local neighborhood.
OPTICS¶
The Ordering Points To Identify the Clustering Structure (OPTICS) algorithm is implemented by the function optics
:
?optics
To apply the OPTICS algorithm with parameters (maximum radius of the neighborhood) and (minimum number of points required in the neighborhood to compute the density or core distance):
opt <- optics(x, eps=1, minPts = 4)
plot(opt)
opt
To identify clusters, we can apply a threshold, for instance, 0.3, to the reachability distance. A valley of points with a reachability distance below this threshold are considered to be a cluster:
opt <- extractDBSCAN(opt, eps_cl = .3)
plot(opt)
Another method to identify identify cluster boundaries is use the minimum steepness, i.e., the minimum rate of change in reachability distance. This approach contrasts with simply applying a fixed threshold to the reachability distance, which may not capture more subtle, context-dependent boundaries. To utilize the minimum steepness method, we can call the extractXi
function instead of extractDBSCAN
.
# YOUR CODE HERE
stop("Not implemented yet")
plot(opt)
hullplot(x,opt)
opt