parDist
Feb 9, 2018 #parallelDist
Problem Statement:
We have a large dataset. We need to compute its distance matrix (i.e. for clustering purposes). The complexity for a N * P matrix is
N(N-1)/2 * 3P
Create Sample
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
pts <- rnorm(1e4, mean = c(5, 5, 10, 10, 15, 15), sd = 1) %>% matrix(ncol = 2, byrow = T)
pts %>% plot(xlab = "x", ylab = "y")
Package ‘parallelDist’
parallelDist
is a fast parallelized alternative to R’s native dist
function. The package is mainly implemented in C++ and leverages the RcppParellel
package to parallelize the distance computations. In addition, it also uses Armadillo
linear algebra library to optimize matrix operations during distance calculations.
In short, to compute distance matrix for large data object, use parallelDist
because it is much faster.
Demo
Say we wish to compute distance matrix to compute silhouette
distance for sample points above to determine the optimal number of cluster (which we already knew is 3).
# By default, parDist returns a dist object
# Here we convert to matrix for minor efficiency
dist.euclidean <- parallelDist::parDist(pts, method = "euclidean") %>% as.matrix()
# A custom function for looping
compare_silhouette <- function(k){
kmeans(pts, centers = k, nstart = 20, iter.max = 50)$cluster %>%
# use dmatrix instead of dist
cluster::silhouette(dmatrix = dist.euclidean) %>%
summary() %>%
# extract avergae silhouette width
`[[`('avg.width')
}
# Here we try out various number of clusters
res <- lapply(2:5, compare_silhouette)
library(ggplot2)
data.frame(x = 2:5, y = unlist(res)) %>% ggplot(aes(x, y)) + geom_line() + geom_point() + theme_light()