R package umap
provides an interface to uniform manifold
approximation and projection (UMAP) algorithms. There are now several
implementations, including versions of python package
umap-learn
. This vignette explains some aspects of
interfacing with the python package.
(For general information on usage of package umap
, see
the introductory
vignette.)
As prep, let’s load the package and prepare a small dataset.
The basic command to perform dimensional reduction is
umap
. By default, this function uses an implementation
written in R. To use the python package umap-learn
instead,
that package and its dependencies must be installed separately (see python package index or
the package source). The
R package reticulate
is also required (use
install.packages('reticulate') and library('reticulate')
).
After completing installations, the python implementation is
activated by specifying method="umap-learn"
.
(This command is not actually executed in the vignette because
umap-learn
may not be available on the rendering system. If
umap-learn
is available, the command should execute quietly
and create a new object iris.umap_learn
that contains an
embedding.)
As covered in the introductory vignette, tuning parameters can be set
via a configuration object and via explicit arguments in the
umap
function call. The default configuration is accessible
as object umap.defaults
.
## umap configuration parameters
## n_neighbors: 15
## n_components: 2
## metric: euclidean
## n_epochs: 200
## input: data
## init: spectral
## min_dist: 0.1
## set_op_mix_ratio: 1
## local_connectivity: 1
## bandwidth: 1
## alpha: 1
## gamma: 1
## negative_sample_rate: 5
## a: NA
## b: NA
## spread: 1
## random_state: NA
## transform_state: NA
## knn: NA
## knn_repeats: 1
## verbose: FALSE
## umap_learn_args: NA
Note the entry umap_learn_args
toward the end. It is set
to NA
by default. This indicates that appropriate arguments
will be selected automatically and passed to umap-learn.
After executing dimensional reduction, the output object contains a copy of the configuration with the values actually used to produce the output. We can examine the effective configuration that was used for our embedding.
(Again, this command is not executed in the vignette because
umap-learn
may not be available on the rendering system.
When umap-learn
is available, this should produce a
configuration printout.)
The entry for umap_learn_args
should contain a vector of
all the arguments passed from the configuration object to the python
package. An entry in the configuration should also reveal the version of
the python package used to perform the calculation.
A configuration object can contain many components, but not all may
be used in a calculation. To verify that a setting is actually passed to
umap-learn
, ensure that it appears in
umap_learn_args
in the output.
As an example, consider setting foo
and
n_epochs
during the function call.
## (not evaluated in vignette)
iris.foo <- umap(iris.data, method="umap-learn", foo=4, n_epochs=100)
iris.foo$config
Inspecting the output configuration will reveal that both
foo
and n_epochs
are recorded (in the latter
case, the default value is replaced by the new value). However,
foo
should not appear in umap_learn_args
. This
means that foo
was not actually passed on to
umap-learn
.
Various version of umap-learn
take different parameters
as input. The R package is coded to work with umap-learn
versions 0.2, 0.3, 0.4, and 0.5. It will adjust arguments automatically
to suit those versions.
Note, however, that some arguments that are acceptable in new
versions of umap-learn are not set in the default configuration object.
To use those features (see python package documentation), set the
appropriate arguments manually, either by preparing a custom
configuration object or by specifying the arguments during the
umap
function call.
It is possible to set umap_learn_args
manually while
calling umap
.
## (not evaluated in vignette)
iris.custom <- umap(iris.data, method="umap-learn",
umap_learn_args=c("n_neighbors", "n_epochs"))
iris.custom$config
Here, only the two specified arguments have been passed on to the calculation.
Summary of R session:
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] umap_0.2.11.0 rmarkdown_2.28
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.3 knitr_1.48 rlang_1.1.4 xfun_0.48
## [5] highr_0.11 png_0.1-8 jsonlite_1.8.9 openssl_2.2.2
## [9] buildtools_1.0.0 askpass_1.2.1 htmltools_0.5.8.1 maketools_1.3.1
## [13] sys_3.4.3 sass_0.4.9 grid_4.4.1 evaluate_1.0.1
## [17] jquerylib_0.1.4 fastmap_1.2.0 yaml_2.3.10 lifecycle_1.0.4
## [21] compiler_4.4.1 RSpectra_0.16-2 Rcpp_1.0.13 lattice_0.22-6
## [25] digest_0.6.37 R6_2.5.1 reticulate_1.39.0 bslib_0.8.0
## [29] Matrix_1.7-1 tools_4.4.1 cachem_1.1.0