ksharp
##
## Attaching package: 'dbscan'
## The following object is masked from 'package:stats':
##
## as.dendrogram
Clustering assigns data points to groups, i.e clusters, so that each group is internally coherent and also different from the others. In practice, however, clusters often turn out to be almost merging with one another. Sharpening in this context refers to increasing contrasts between cluster groups.
The concept of cluster sharpening goes back at least to the 1980s and
the works of Tukey (Cleveland 1988).
Methods that specifically mention cluster sharpening include dendrogram
sharpening (Stanberry, Nandy, and Cordes
2003) and variations on the k-means algorithm (Garcı́a-Escudero et al. 2008). Some R packages
that produce clusterings and provide controls on the level of sharpening
include tclust
(Fritz,
Garcia-Escudero, and Mayo-Iscar 2012) and
trimcluster
(Hennig
2018).
In contrast to methods that produce de-novo clusterings with
specific sharpness properties, the ksharp
package aims to
adjust existing clusterings. The package provides a general
interface for this purpose along with several distinct algorithms. All
are compatible with clusterings produced using the stats
package, the cluster
package (Maechler et al. 2019), the dbscan
package (Hahsler, Piekenbrock, and Doran
2019), as well as custom-made assignments.
This vignette first explains cluster sharpening in the methods section. It then demonstrates the algorithms implemented in the package on a range of examples. Throughout, most code is hidden for readability, but is available in the package source.
As a form of motivation and illustration, let’s look at the toy dataset below.
The raw data shows two circular blobs that are nearby on the plane (above, left). k-means clustering using two colors partitions the points into groups that appear symmetrical, except on the boundary between the groups (above, right). The distorted shapes suggest that the true groups may be in fact overlapping and, therefore, certain items assigned to one group may should belong to the other.
The ambiguous points can be marked by demanding increasing levels of sharpness.
With a mild level of sharpening (above, left), a few of the points on the boundary fade into a third group, displayed in gray. Increasing the sharpening threshold makes this group larger. At the cost of discarding some of the data, the leftover groups become more distinct and well-separated (above, right).
ksharp
To achieve this effect using the ksharp
package, first
we cluster the dataset kdata.1
using k-means.
## 1 2
## 242 258
The algorithm partitions the data into groups of roughly equal size.
Next, we sharpen the clustering using command ksharp
.
## 0 1 2
## 25 231 244
The sharpening function creates an additional small group with
cluster index 0
containing 5% of the data.
On the first run, the sharpening command requires access to the original data and can take a moment to compute various sharpening-related metrics. But once computed, it is straightforward to re-use those metrics to adjust the level of sharpening. We can adjust the size of the left-out group by varying the sharpness threshold.
## 0 1 2
## 50 222 228
## 0 1 2
## 100 202 198
By increasing the threshold, we delegate a larger proportion of the
points to the group with index 0
, reproducing the effect in
the figure above.
Before moving to more intricate examples, let’s revisit the dataset with the two overlapping blobs using the three sharpening methods.
Each row represents output from a sharpening method. From left to right, increasing the threshold masks out more points. Although the results differ slightly across methods, the changes here appear to be minor.
Another dataset with two groups is kdata.2
.
Here, points are arranged in non-circular shapes (above, left) and
this can confuse the k-means algorithm (above, center). To avoid that,
we can create the initial clustering using a density-based algorithm
instead, dbscan
(above, right).
Sharpening of a dbscan
clustering proceeds in the same
way as before. (A caveat is that dbscan
provides cluster
assignments in unnamed lists. ksharp
requires a named list,
so the initial clustering must be adjusted manually before
sharpening.)
The silhouette and medoid methods (top and middle) tend to remove points from the left and right sides of the rectangles. The neighbor-based method (bottom) instead subtracts from the upper and lower sides.
Differences between sharpening methods can also be observed when data
form groups with non-convex shapes. As an example, let’s use dataset
kdata.3
with a clustering by dbscan
.
The result from dbscan
contains three main groups. It
also already assigns some points to a noise group, which has a cluster
index of 0. But we can sharpen the result further.
An interesting feature that appears here is that the presence of a noise group in the original clustering also removes points in the real clusters that might overlap with the noise.
The last example consists of groups that are well defined, but are
surrounded by noise, kdata.4
.
Here, the initial clustering is generated by k-means with a manual group seeding. As k-means partitions all points to the specified number of clusters, the surrounding noise points are assigned to groups along the four dense regions near the center. Sharpening can in this case remove some of the noise points.
Interestingly, an already-sharp cluster assignment can be produced in
this case by dbscan
, which can produce a noise group in its
raw output. The above result thus reproduces a similar effect using a
different algorithm.
Sharpening is a procedure that adjusts an existing clustering in
order to better bring out the distinctions between cluster groups.
Package ksharp
implements a specific form of cluster
sharpening called ‘sharpening by excision’ (Cleveland, 1988). This works
by removing some boundary points from a dataset to increase contrast
between the remaining groups.
There are several possible algorithms for cluster sharpening. The
ksharp
command delivers three different ones. The object
output by function ksharp
contains details relevant to all
of them. This makes it possible to adjust the sharpening level of
already-sharpened clusterings in time O(N), where N is the number of
data points. However, the initial creation of these details requires
recalculation and sorting of the distance matrix and can thus be
expensive on large datasets. Speeding up these calculations using
alternative implementations is a possible avenue for development.
Function ksharp
is the main API function of the package.
This function is designed to be compatible with clusterings produced by
k-means and by functions from the cluster
package,
e.g. pam
. It is also compatible with output from
dbscan
after an adjustment of that clustering output to
include data-point names.
The function can also sharpen self-made clusterings, as long as they
are prepared as a list with a component cluster
consisting
of a named vector associating each data point to a numeric cluster
id.
Beside the ksharp
function, the package also exports
other functions tat compute auxiliary information such as silhouette
widths. These implementations differ slightly from existing ones from
package cluster
; see the documentation,
e.g. help(silinfo)
for details.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ksharp_0.1.0.1 dbscan_1.2-0 cluster_2.1.6 Rcssplot_1.1.0
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.37 R6_2.5.1 fastmap_1.2.0 xfun_0.49
## [5] maketools_1.3.1 cachem_1.1.0 knitr_1.48 htmltools_0.5.8.1
## [9] generics_0.1.3 rmarkdown_2.29 buildtools_1.0.0 lifecycle_1.0.4
## [13] cli_3.6.3 sass_0.4.9 jquerylib_0.1.4 compiler_4.4.2
## [17] highr_0.11 sys_3.4.3 tools_4.4.2 evaluate_1.0.1
## [21] bslib_0.8.0 Rcpp_1.0.13-1 yaml_2.3.10 jsonlite_1.8.9
## [25] rlang_1.1.4