Do you use hierarchical clustering packages like R’s hclust
or Python’s scipy.cluster.hierarchy.linkage
in your workflow? If so, you’re using an \(O(N^3)\) algorithm1 and should switch to the fastcluster
package, which provides \(O(N^2)\) routines for the most commonly used types of clustering.
fastcluster
is implemented in C++, with interfaces for C++, R, and Python. In particular, the Python interface mirrors scipy.cluster.hierarchy.linkage
, and the R interface mirrors stats::hclust
and flashClust::flashClust
, so switching over is a no-brainer.
For a performance comparison provided by the package’s author, take a look here. fastcluster
is described in a Journal of Statistical Software publication from 2013, with algorithmic details in a 2011 arXiv paper. The key algorithms used to get to \(O(N^2)\) are a variant of Prim’s algorithm for minimum spanning trees and the nearest-neighbor chain algorithm. Note that centroid and median linkage still take \(O(N^3)\) time with this package.
I haven’t read the details of these algorithms yet, but if they are correct, then the Wikipedia articles on hierarchical clustering and UPGMA (average linkage clustering) could use an update2.