Understanding Differences in HDBSCAN Parameters
HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is an extension to
the DBSCAN algorithm and has three main parameters (min_cluster_size
, min_samples
, and cluster_selection_epsilon
)
to control the clustering process.
This article describes the major differences between min_cluster_size
, min_samples
, and cluster_selection_epsilon
in HDBSCAN and how they control the clustering process.
min_cluster_size
min_cluster_size
parameter sets the minimum number of points required for forming the cluster. It helps in filtering out
small clusters and noise.
min_cluster_size
should be set based on the input dataset and the number of clusters we intend to achieve. The higher the
value of min_cluster_size
can reduce the number of clusters while merging some clusters.
To get specific clusters, keep this value low as clusters with few points can be also important.
min_samples
The min_samples
parameter is from the DBSCAN. It defines the number of data points required to form core points (dense regions).
A core point has at least minPts
data points within an eps
radius.
By default, the value of min_samples
is same as min_cluster_size
in HDBSCAN. The change in this value has a signifcant effect on clustering.
Lowering min_samples
can help to restore original clustering, which can be lost if min_cluster_size
is too high.
cluster_selection_epsilon
cluster_selection_epsilon
specifies the maximum radius within which points are considered to belong to the same cluster. This is more similar to
the eps
value in DBSCAN.
cluster_selection_epsilon
will help to keep the cluster intact up to certain threshold and helps to prevent splitting of the clusters.