DBSCAN vs. HDBSCAN. Which One You Should Use?
The two density-based clustering algorithms DBSCAN (Density-Based Spatial Clustering of Applications with Noise) and HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) share many similarities, but some key differences make it easier to choose the right one.
This article describes some key differences between DBSCAN and HDBSCAN and helps you choose the best algorithm for your clustering application.
Distance scale eps
parameter
DBSCAN uses two main parameters viz. epsilon (eps
) and minPts
(min_samples
) for clustering. The eps
defines the maximum radius within which points are considered to belong to the same cluster, whereas
minPts
is a density threshold which is the minimum number of points required to form a core point (dense regions).
HDBSCAN is an extension of DBSCAN and uses three main parameters viz. min_cluster_size
, min_samples
, and cluster_selection_epsilon
which may have a significant effect on clustering. I have covered the details of each of these parameters
in this article.
In HDBSCAN, you can get optimal clustering using min_cluster_size
and min_samples
parameters. HDBSCAN searches all possible eps
parameters to find the optimal cluster.
Hence, HDBSCAN eliminates the need to set eps
parameter which makes HDBSCAN more easier and useful over DBSCAN.
Clusters with variable densities
Density-based clustering algorithms can discover clusters of arbitrary shapes and effectively identify noise or outliers.
DBSCAN calculates the density around each data point and can identify clusters with high densities. However, it struggles to find clusters with varying densities.
HDBSCAN addresses this limitation and can find clusters with varying densities. This is mostly due to its ability to autotune eps
parameter. HDBSCAN constructs a hierarchy of clusters,
prunes it to find stable clusters, and detects clusters at different scales.
Prediction of new points
The DBSCAN does not allow the prediction of new points based on the fitted model. There is no prediction function for DBSCAN in the scikit-learn for assigning new points to the clusters.
However, you can make predictions of new points using the HDBSCAN. You can use the approximate_predict
function from the DBSCAN package for making predictions of new points.
Please read this article which covers the prediction of new points using the HDBSCAN.
Packages
Both DBSCAN and HDBSCAN are implemented in scikit-learn Python package. You can also use the hdbscan Python package for HDBSCAN.