--- title: "Cluster sharpening using `ksharp`" output: html_document: toc: true fig_width: 7.2 fig_height: 5.4 fig_retina: 2 bibliography: ksharp.bib cls: biomed-central.cls vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{ksharp vignette} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE} suppressMessages(library(Rcssplot)) library(cluster) library(dbscan) library(ksharp) set.seed(300299) RcssDefaultStyle <- Rcss("ksharp.Rcss") ``` ```{r, echo=FALSE} dbscan.named = function(data, eps) { result = dbscan(data, eps=eps) names(result$cluster) = rownames(data) result$data = data result } ``` ```{r, echo=FALSE} plotClusters <- function(data, clusters=NULL, main="", Rcssclass="sharp") { # when clusters not specified, assume all belong to class "plain" if (is.null(clusters)) { clusters = setNames(rep("plain", nrow(data)), rownames(data)) } if (is.null(names(clusters))) { names(clusters) = rownames(data) } RcssCompulsoryClass = RcssGetCompulsoryClass(Rcssclass) titlepos = RcssValue("ksharp", "title.pos", default=c(0, 1)) padding = RcssValue("ksharp", "padding", default=0.1) xlim = range(data[,1]) ylim = range(data[,2]) if (xlim[2]-xlim[1] > ylim[2]-ylim[1]) { ylim = xlim } else { xlim = ylim } xysize = xlim[2]-xlim[1] xylim = xlim+(c(-1,1)*xysize*padding) xysize = xylim[2]-xylim[1] rm(xlim, ylim) plot(xylim, xylim) rect(xylim[1], xylim[1], xylim[2], xylim[2]) # create cluster names like C0, C1, etc cnames = paste0("C", clusters) clusters = split(names(clusters), cnames) # add points to the chart for (iname in names(clusters)) { idata = data[clusters[[iname]], , drop=FALSE] points(idata[,1], idata[,2], Rcssclass=iname) } text(xylim[1] + titlepos[1]*xysize, xylim[1] + titlepos[2]*xysize, main, Rcssclass="main") } emptyplot = function(n) { for (i in 1:n) { plot(c(0, 1), c(0, 1), type="n") } } ``` ```{r, echo=FALSE} # override table to remove spurious empty line table = function(x) { x = base::table(x) width = 5 padspaces = function(y) { if (length(y)>1) { return(sapply(y, padspaces)) } y = as.character(y) paste(c(rep(" ", width-nchar(y)), y), collapse="") } line1 = paste(sapply(names(x), padspaces), collapse="") line2 = paste(sapply(x, padspaces), collapse="") cat(line1, "\n") cat(line2, "\n") } ```   # Introduction Clustering assigns data points to groups, i.e clusters, so that each group is internally coherent and also different from the others. In practice, however, clusters often turn out to be almost merging with one another. Sharpening in this context refers to increasing contrasts between cluster groups. The concept of cluster sharpening goes back at least to the 1980s and the works of Tukey [@cleveland1988collected]. Methods that specifically mention cluster sharpening include dendrogram sharpening [@stanberry2003cluster] and variations on the k-means algorithm [@garcia2008general]. Some R packages that produce clusterings and provide controls on the level of sharpening include `tclust` [@tclust] and `trimcluster` [@trimcluster]. In contrast to methods that produce *de-novo* clusterings with specific sharpness properties, the `ksharp` package aims to adjust *existing* clusterings. The package provides a general interface for this purpose along with several distinct algorithms. All are compatible with clusterings produced using the `stats` package, the `cluster` package [@cluster], the `dbscan` package [@dbscan], as well as custom-made assignments. This vignette first explains cluster sharpening in the methods section. It then demonstrates the algorithms implemented in the package on a range of examples. Throughout, most code is hidden for readability, but is available in the package source. # Methods ## Toy example As a form of motivation and illustration, let's look at the toy dataset below. ```{r, echo=FALSE} K1 = kdata.1 km1 = kmeans(K1, 2) ``` ```{r, echo=FALSE, fig.width=7.2, fig.height=1.8} oldpar = par(mfrow=c(1,4)) plotClusters(K1, main="raw data", Rcssclass="unsharp") plotClusters(K1, km1$cluster, main="kmeans (k=2)", Rcssclass="unsharp") emptyplot(2) # this vignette uses Rcssplot, which changes the behavior of par a little # resetting par is achieved either via graphics:: or dev.off() graphics::par(oldpar) ``` The raw data shows two circular blobs that are nearby on the plane (above, left). k-means clustering using two colors partitions the points into groups that appear symmetrical, except on the boundary between the groups (above, right). The distorted shapes suggest that the true groups may be in fact overlapping and, therefore, certain items assigned to one group may should belong to the other. The ambiguous points can be marked by demanding increasing levels of sharpness. ```{r, echo=FALSE, fig.width=7.2, fig.height=1.8} sharp1 = ksharp(km1, data=K1) oldpar = par(mfrow=c(1,4)) thresholds = c(0.05, 0.1, 0.2, 0.4) for (t in thresholds) { plotClusters(K1, ksharp(sharp1, t, method="sil")$cluster, main=paste0("k=2, threshold=", t)) } graphics::par(oldpar) ``` With a mild level of sharpening (above, left), a few of the points on the boundary fade into a third group, displayed in gray. Increasing the sharpening threshold makes this group larger. At the cost of discarding some of the data, the leftover groups become more distinct and well-separated (above, right). ## Using `ksharp` To achieve this effect using the `ksharp` package, first we cluster the dataset `kdata.1` using k-means. ```{r} library(ksharp) km1 = kmeans(kdata.1, centers=2) table(km1$cluster) ``` The algorithm partitions the data into groups of roughly equal size. Next, we sharpen the clustering using command `ksharp`. ```{r} ks1 = ksharp(km1, threshold=0.05, data=kdata.1) table(ks1$cluster) ``` The sharpening function creates an additional small group with cluster index `0` containing 5\% of the data. On the first run, the sharpening command requires access to the original data and can take a moment to compute various sharpening-related metrics. But once computed, it is straightforward to re-use those metrics to adjust the level of sharpening. We can adjust the size of the left-out group by varying the sharpness threshold. ```{r} ks1.10 = ksharp(ks1, threshold=0.1) table(ks1.10$cluster) ks1.20 = ksharp(ks1, threshold=0.2) table(ks1.20$cluster) ``` By increasing the threshold, we delegate a larger proportion of the points to the group with index `0`, reproducing the effect in the figure above. ## Sharpening heuristics Just like there are many ways to define a clustering algorithm, there are also many conceivable procedures to mask out ambiguous points in a clustering. `ksharp` implements three, which can be toggled via an argument called `method`. - `method="silhouette"` is based on silhouette widths. A silhouette width is defined for each data point as the ratio of two distances. In the numerator, there is the average distance to all other points in its cluster. The denominator is the average distance to elements in the nearest adjacent cluster. Sharpening is achieved via a threshold on these silhouette widths. - `method="medoids"` is based on ratios of distances to cluster centers. This is similar to the method using silhouette widths, but only requires computation of distances to representatives of each cluster. Sharpening can be achieved through a threshold on the ratio of distances to the self-cluster and the nearest adjacent cluster. - `method="neighbors"` is based on local neighborhoods. A neighborhood is defined for each point as the set of `n` of its nearest neighbors. A sharp neighborhood is one where all the neighbors belong to the same cluster. Sharpening can be achieved through a threshold on the size of the neighborhood. Each of these methods has characterisitic properties that may be suitable/unsuitable for certain datasets. # Examples ## Two overlapping clusters Before moving to more intricate examples, let's revisit the dataset with the two overlapping blobs using the three sharpening methods. ```{r, echo=FALSE} oldpar = par(mfrow=c(3,4)) methods = c("silhouette", "medoid", "neighbor") for (m in methods) { for (t in thresholds) { plotClusters(K1, ksharp(sharp1, t, method=m)$cluster, main=paste0(m, ", ", t)) } } graphics::par(oldpar) ``` Each row represents output from a sharpening method. From left to right, increasing the threshold masks out more points. Although the results differ slightly across methods, the changes here appear to be minor. ## Non-overlapping, non-spherical clusters Another dataset with two groups is `kdata.2`. ```{r, echo=FALSE} K2 = kdata.2 K2dist = dist(K2) km2 = kmeans(K2, 2) db2 = dbscan.named(K2, 0.3) sharp2 = ksharp(db2, method="neighbor") ``` ```{r, echo=FALSE, fig.width=7.2, fig.height=1.8} oldpar = par(mfrow=c(1,4)) plotClusters(K2, cluster=NULL, main="raw", Rcssclass="unsharp") plotClusters(K2, km2$cluster, main="kmeans, k=2", Rcssclass="unsharp") plotClusters(K2, db2$cluster, main="dbscan, k=2", Rcssclass="unsharp") emptyplot(1) graphics::par(oldpar) ``` Here, points are arranged in non-circular shapes (above, left) and this can confuse the k-means algorithm (above, center). To avoid that, we can create the initial clustering using a density-based algorithm instead, `dbscan` (above, right). Sharpening of a `dbscan` clustering proceeds in the same way as before. (A caveat is that `dbscan` provides cluster assignments in unnamed lists. `ksharp` requires a named list, so the initial clustering must be adjusted manually before sharpening.) ```{r, echo=FALSE} oldpar = par(mfrow=c(3,4)) for (m in methods) { for (t in thresholds) { plotClusters(K2, ksharp(sharp2, t, method=m)$cluster, main=paste0(m, ", ", t)) } } graphics::par(oldpar) ``` The silhouette and medoid methods (top and middle) tend to remove points from the left and right sides of the rectangles. The neighbor-based method (bottom) instead subtracts from the upper and lower sides. ## Non-convex groups Differences between sharpening methods can also be observed when data form groups with non-convex shapes. As an example, let's use dataset `kdata.3` with a clustering by `dbscan`. ```{r, echo=FALSE, fig.height=1.8} K3 = kdata.3 db3 = dbscan.named(K3, 0.25) oldpar = par(mfrow=c(1,4)) plotClusters(K3, cluster=NULL, main="raw", Rcssclass="unsharp") plotClusters(K3, db3$cluster, main="dbscan", Rcssclass="unsharp") emptyplot(2) graphics::par(oldpar) ``` The result from `dbscan` contains three main groups. It also already assigns some points to a noise group, which has a cluster index of 0. But we can sharpen the result further. ```{r, echo=FALSE} sharp3 = ksharp(db3, data=K3, method="neighbor") oldpar = par(mfrow=c(3,4)) for (m in methods) { for (t in thresholds) { plotClusters(K3, ksharp(sharp3, t, method=m)$cluster, main=paste0(m, ", ", t)) } } graphics::par(oldpar) ``` An interesting feature that appears here is that the presence of a noise group in the original clustering also removes points in the real clusters that might overlap with the noise. ## Noise points The last example consists of groups that are well defined, but are surrounded by noise, `kdata.4`. ```{r, echo=FALSE} K4 = kdata.4 K4.centers = matrix(c(-3,-3,-3,3,3,-3,3,3), ncol=2, byrow=T) km4 = kmeans(K4, K4.centers) sharp4 = ksharp(km4, data=K4) ``` ```{r, echo=FALSE} oldpar = par(mfrow=c(3,4)) thresholds2 = c(0.05, 0.1, 0.2, 0.4) for (m in methods) { for (t in thresholds2) { plotClusters(K4, ksharp(sharp4, t, method=m)$cluster, main=paste0(m, ", ", t)) } } graphics::par(oldpar) ``` Here, the initial clustering is generated by k-means with a manual group seeding. As k-means partitions all points to the specified number of clusters, the surrounding noise points are assigned to groups along the four dense regions near the center. Sharpening can in this case remove some of the noise points. Interestingly, an already-sharp cluster assignment can be produced in this case by `dbscan`, which can produce a noise group in its raw output. The above result thus reproduces a similar effect using a different algorithm. # Discussion Sharpening is a procedure that adjusts an existing clustering in order to better bring out the distinctions between cluster groups. Package `ksharp` implements a specific form of cluster sharpening called 'sharpening by excision' (Cleveland, 1988). This works by removing some boundary points from a dataset to increase contrast between the remaining groups. There are several possible algorithms for cluster sharpening. The `ksharp` command delivers three different ones. The object output by function `ksharp` contains details relevant to all of them. This makes it possible to adjust the sharpening level of already-sharpened clusterings in time O(N), where N is the number of data points. However, the initial creation of these details requires recalculation and sorting of the distance matrix and can thus be expensive on large datasets. Speeding up these calculations using alternative implementations is a possible avenue for development. # References
# Appendix ## Implementation notes Function `ksharp` is the main API function of the package. This function is designed to be compatible with clusterings produced by k-means and by functions from the `cluster` package, e.g. `pam`. It is also compatible with output from `dbscan` after an adjustment of that clustering output to include data-point names. The function can also sharpen self-made clusterings, as long as they are prepared as a list with a component `cluster` consisting of a named vector associating each data point to a numeric cluster id. Beside the `ksharp` function, the package also exports other functions tat compute auxiliary information such as silhouette widths. These implementations differ slightly from existing ones from package `cluster`; see the documentation, e.g. `help(silinfo)` for details. ## Session info ```{r} sessionInfo() ```