DBSCAN vs. HDBSCAN. Which One You Should Use?

2024-06-01 373 words 2 minutes

Contents

The two density-based clustering algorithms DBSCAN (Density-Based Spatial Clustering of Applications with Noise) and HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) share many similarities, but some key differences make it easier to choose the right one.

This article describes some key differences between DBSCAN and HDBSCAN and helps you choose the best algorithm for your clustering application.

Distance scale `eps` parameter

DBSCAN uses two main parameters viz. epsilon (eps) and minPts (min_samples) for clustering. The eps defines the maximum radius within which points are considered to belong to the same cluster, whereas minPts is a density threshold which is the minimum number of points required to form a core point (dense regions).

HDBSCAN is an extension of DBSCAN and uses three main parameters viz. min_cluster_size, min_samples, and cluster_selection_epsilon which may have a significant effect on clustering. I have covered the details of each of these parameters in this article.

In HDBSCAN, you can get optimal clustering using min_cluster_size and min_samples parameters. HDBSCAN searches all possible eps parameters to find the optimal cluster.

Hence, HDBSCAN eliminates the need to set eps parameter which makes HDBSCAN more easier and useful over DBSCAN.

Clusters with variable densities

Density-based clustering algorithms can discover clusters of arbitrary shapes and effectively identify noise or outliers.

DBSCAN calculates the density around each data point and can identify clusters with high densities. However, it struggles to find clusters with varying densities.

HDBSCAN addresses this limitation and can find clusters with varying densities. This is mostly due to its ability to autotune eps parameter. HDBSCAN constructs a hierarchy of clusters, prunes it to find stable clusters, and detects clusters at different scales.

Prediction of new points

The DBSCAN does not allow the prediction of new points based on the fitted model. There is no prediction function for DBSCAN in the scikit-learn for assigning new points to the clusters.

However, you can make predictions of new points using the HDBSCAN. You can use the approximate_predict function from the DBSCAN package for making predictions of new points.

Please read this article which covers the prediction of new points using the HDBSCAN.

Packages

Both DBSCAN and HDBSCAN are implemented in scikit-learn Python package. You can also use the hdbscan Python package for HDBSCAN.