Understanding Differences in HDBSCAN Parameters

2024-04-18 246 words 2 minutes

Contents

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is an extension to the DBSCAN algorithm and has three main parameters (min_cluster_size, min_samples, and cluster_selection_epsilon) to control the clustering process.

This article describes the major differences between min_cluster_size, min_samples, and cluster_selection_epsilon in HDBSCAN and how they control the clustering process.

`min_cluster_size`

min_cluster_size parameter sets the minimum number of points required for forming the cluster. It helps in filtering out small clusters and noise.

min_cluster_size should be set based on the input dataset and the number of clusters we intend to achieve. The higher the value of min_cluster_size can reduce the number of clusters while merging some clusters.

To get specific clusters, keep this value low as clusters with few points can be also important.

`min_samples`

The min_samples parameter is from the DBSCAN. It defines the number of data points required to form core points (dense regions). A core point has at least minPts data points within an eps radius.

By default, the value of min_samples is same as min_cluster_size in HDBSCAN. The change in this value has a signifcant effect on clustering.

Lowering min_samples can help to restore original clustering, which can be lost if min_cluster_size is too high.

`cluster_selection_epsilon`

cluster_selection_epsilon specifies the maximum radius within which points are considered to belong to the same cluster. This is more similar to the eps value in DBSCAN.

cluster_selection_epsilon will help to keep the cluster intact up to certain threshold and helps to prevent splitting of the clusters.