Cluster sharpening using ksharp

## 
## Attaching package: 'dbscan'
## The following object is masked from 'package:stats':
## 
##     as.dendrogram

 

Introduction

Clustering assigns data points to groups, i.e clusters, so that each group is internally coherent and also different from the others. In practice, however, clusters often turn out to be almost merging with one another. Sharpening in this context refers to increasing contrasts between cluster groups.

The concept of cluster sharpening goes back at least to the 1980s and the works of Tukey (Cleveland 1988). Methods that specifically mention cluster sharpening include dendrogram sharpening (Stanberry, Nandy, and Cordes 2003) and variations on the k-means algorithm (Garcı́a-Escudero et al. 2008). Some R packages that produce clusterings and provide controls on the level of sharpening include tclust (Fritz, Garcia-Escudero, and Mayo-Iscar 2012) and trimcluster (Hennig 2018).

In contrast to methods that produce de-novo clusterings with specific sharpness properties, the ksharp package aims to adjust existing clusterings. The package provides a general interface for this purpose along with several distinct algorithms. All are compatible with clusterings produced using the stats package, the cluster package (Maechler et al. 2019), the dbscan package (Hahsler, Piekenbrock, and Doran 2019), as well as custom-made assignments.

This vignette first explains cluster sharpening in the methods section. It then demonstrates the algorithms implemented in the package on a range of examples. Throughout, most code is hidden for readability, but is available in the package source.

Methods

Toy example

As a form of motivation and illustration, let’s look at the toy dataset below.

The raw data shows two circular blobs that are nearby on the plane (above, left). k-means clustering using two colors partitions the points into groups that appear symmetrical, except on the boundary between the groups (above, right). The distorted shapes suggest that the true groups may be in fact overlapping and, therefore, certain items assigned to one group may should belong to the other.

The ambiguous points can be marked by demanding increasing levels of sharpness.

With a mild level of sharpening (above, left), a few of the points on the boundary fade into a third group, displayed in gray. Increasing the sharpening threshold makes this group larger. At the cost of discarding some of the data, the leftover groups become more distinct and well-separated (above, right).

Using ksharp

To achieve this effect using the ksharp package, first we cluster the dataset kdata.1 using k-means.

library(ksharp)
km1 = kmeans(kdata.1, centers=2)
table(km1$cluster)
##     1    2 
##   242  258

The algorithm partitions the data into groups of roughly equal size. Next, we sharpen the clustering using command ksharp.

ks1 = ksharp(km1, threshold=0.05, data=kdata.1)
table(ks1$cluster)
##     0    1    2 
##    25  231  244

The sharpening function creates an additional small group with cluster index 0 containing 5% of the data.

On the first run, the sharpening command requires access to the original data and can take a moment to compute various sharpening-related metrics. But once computed, it is straightforward to re-use those metrics to adjust the level of sharpening. We can adjust the size of the left-out group by varying the sharpness threshold.

ks1.10 = ksharp(ks1, threshold=0.1)
table(ks1.10$cluster)
##     0    1    2 
##    50  222  228
ks1.20 = ksharp(ks1, threshold=0.2)
table(ks1.20$cluster)
##     0    1    2 
##   100  202  198

By increasing the threshold, we delegate a larger proportion of the points to the group with index 0, reproducing the effect in the figure above.

Sharpening heuristics

Just like there are many ways to define a clustering algorithm, there are also many conceivable procedures to mask out ambiguous points in a clustering. ksharp implements three, which can be toggled via an argument called method.

  • method="silhouette" is based on silhouette widths. A silhouette width is defined for each data point as the ratio of two distances. In the numerator, there is the average distance to all other points in its cluster. The denominator is the average distance to elements in the nearest adjacent cluster. Sharpening is achieved via a threshold on these silhouette widths.

  • method="medoids" is based on ratios of distances to cluster centers. This is similar to the method using silhouette widths, but only requires computation of distances to representatives of each cluster. Sharpening can be achieved through a threshold on the ratio of distances to the self-cluster and the nearest adjacent cluster.

  • method="neighbors" is based on local neighborhoods. A neighborhood is defined for each point as the set of n of its nearest neighbors. A sharp neighborhood is one where all the neighbors belong to the same cluster. Sharpening can be achieved through a threshold on the size of the neighborhood.

Each of these methods has characterisitic properties that may be suitable/unsuitable for certain datasets.

Examples

Two overlapping clusters

Before moving to more intricate examples, let’s revisit the dataset with the two overlapping blobs using the three sharpening methods.

Each row represents output from a sharpening method. From left to right, increasing the threshold masks out more points. Although the results differ slightly across methods, the changes here appear to be minor.

Non-overlapping, non-spherical clusters

Another dataset with two groups is kdata.2.

Here, points are arranged in non-circular shapes (above, left) and this can confuse the k-means algorithm (above, center). To avoid that, we can create the initial clustering using a density-based algorithm instead, dbscan (above, right).

Sharpening of a dbscan clustering proceeds in the same way as before. (A caveat is that dbscan provides cluster assignments in unnamed lists. ksharp requires a named list, so the initial clustering must be adjusted manually before sharpening.)

The silhouette and medoid methods (top and middle) tend to remove points from the left and right sides of the rectangles. The neighbor-based method (bottom) instead subtracts from the upper and lower sides.

Non-convex groups

Differences between sharpening methods can also be observed when data form groups with non-convex shapes. As an example, let’s use dataset kdata.3 with a clustering by dbscan.

The result from dbscan contains three main groups. It also already assigns some points to a noise group, which has a cluster index of 0. But we can sharpen the result further.

An interesting feature that appears here is that the presence of a noise group in the original clustering also removes points in the real clusters that might overlap with the noise.

Noise points

The last example consists of groups that are well defined, but are surrounded by noise, kdata.4.

Here, the initial clustering is generated by k-means with a manual group seeding. As k-means partitions all points to the specified number of clusters, the surrounding noise points are assigned to groups along the four dense regions near the center. Sharpening can in this case remove some of the noise points.

Interestingly, an already-sharp cluster assignment can be produced in this case by dbscan, which can produce a noise group in its raw output. The above result thus reproduces a similar effect using a different algorithm.

Discussion

Sharpening is a procedure that adjusts an existing clustering in order to better bring out the distinctions between cluster groups. Package ksharp implements a specific form of cluster sharpening called ‘sharpening by excision’ (Cleveland, 1988). This works by removing some boundary points from a dataset to increase contrast between the remaining groups.

There are several possible algorithms for cluster sharpening. The ksharp command delivers three different ones. The object output by function ksharp contains details relevant to all of them. This makes it possible to adjust the sharpening level of already-sharpened clusterings in time O(N), where N is the number of data points. However, the initial creation of these details requires recalculation and sorting of the distance matrix and can thus be expensive on large datasets. Speeding up these calculations using alternative implementations is a possible avenue for development.

References

Cleveland, William S. 1988. The Collected Works of John w. Tukey: Graphics 1965-1985. Vol. 5. CRC Press.
Fritz, Heinrich, Luis A. Garcia-Escudero, and Agustin Mayo-Iscar. 2012. tclust: An R Package for a Trimming Approach to Cluster Analysis.” Journal of Statistical Software 47 (12): 1–26. https://doi.org/10.18637/jss.v047.i12.
Garcı́a-Escudero, Luis A, Alfonso Gordaliza, Carlos Matrán, Agustin Mayo-Iscar, et al. 2008. “A General Trimming Approach to Robust Cluster Analysis.” The Annals of Statistics 36 (3): 1324–45.
Hahsler, Michael, Matthew Piekenbrock, and Derek Doran. 2019. dbscan: Fast Density-Based Clustering with R.” Journal of Statistical Software 91 (1): 1–30. https://doi.org/10.18637/jss.v091.i01.
Hennig, Christian. 2018. Trimcluster: Cluster Analysis with Trimming. https://CRAN.R-project.org/package=trimcluster.
Maechler, Martin, Peter Rousseeuw, Anja Struyf, Mia Hubert, and Kurt Hornik. 2019. Cluster: Cluster Analysis Basics and Extensions.
Stanberry, Larissa, Rajesh Nandy, and Dietmar Cordes. 2003. “Cluster Analysis of fMRI Data Using Dendrogram Sharpening.” Human Brain Mapping 20 (4): 201–19.

Appendix

Implementation notes

Function ksharp is the main API function of the package. This function is designed to be compatible with clusterings produced by k-means and by functions from the cluster package, e.g. pam. It is also compatible with output from dbscan after an adjustment of that clustering output to include data-point names.

The function can also sharpen self-made clusterings, as long as they are prepared as a list with a component cluster consisting of a named vector associating each data point to a numeric cluster id.

Beside the ksharp function, the package also exports other functions tat compute auxiliary information such as silhouette widths. These implementations differ slightly from existing ones from package cluster; see the documentation, e.g. help(silinfo) for details.

Session info

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ksharp_0.1.0.1 dbscan_1.2-0   cluster_2.1.6  Rcssplot_1.1.0
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.37     R6_2.5.1          fastmap_1.2.0     xfun_0.49        
##  [5] maketools_1.3.1   cachem_1.1.0      knitr_1.48        htmltools_0.5.8.1
##  [9] generics_0.1.3    rmarkdown_2.29    buildtools_1.0.0  lifecycle_1.0.4  
## [13] cli_3.6.3         sass_0.4.9        jquerylib_0.1.4   compiler_4.4.2   
## [17] highr_0.11        sys_3.4.3         tools_4.4.2       evaluate_1.0.1   
## [21] bslib_0.8.0       Rcpp_1.0.13-1     yaml_2.3.10       jsonlite_1.8.9   
## [25] rlang_1.1.4