How to Choose Optimal Hyperparameters for DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) has two main hyperparameters:
eps
(epsilon) and MinPts
(minimum number of points).
The eps
parameter defines the radius for searching the neighboring points within a cluster, whereas
MinPts
defines the minimum number of points required to form a core point (dense regions). A core point
has at least minPts
data points within a eps
radius.
This article describes the rules and methods for choosing the optimal values for MinPts
and eps
for forming the clusters.
Choosing MinPts
The MinPts
parameter is crucial in DBSCAN as it determines the minimum number of points required to
form a dense regions and clusters.
There is no pre-defined rule or method to determine MinPts
for the DBSCAN. The following rules might
help to determine the appropriate MinPts
for the DBSCAN.
- The
MinPts
should be equal to or larger than the the dimensionality of the dataset. For example, if you have four features, theMinPts
should be equal to or greater than the four. - Sometimes, you should also try setting the
MinPts
= 2*dimensionality. For example, if you have two features, theMinPts
should be 4. This is a good rule of thumb and a starting point for initiating DBSCAN clustering. - The larger values for the
MinPts
could be necessary if the dataset is noisy and larger. - If you set a lower value such as 1 or 2 for
MinPts
, it may not be useful for forming meaningful clusters. - You can also visualize the clusters with different values of
MinPts
and then selectMinPts
which gives the best clustering result.
MinPts
could be challenging. Domain knowledge
is often required to set optimum value for theMinPts
. The above rules are good starting
points for initiating the DBSCAN clustering and work in most of cases.Choosing eps
To compute the optimal value for eps
, you can use the k-nearest neighbor (kNN) distances method. In this method,
we compute the average distance of each data point to the k-nearest neighbor.
Intitally, you will need to provide the value for the number of neighbors, which can be the same as the
MinPts
.
The following examples illustrates how to estimae the eps
.
# import package
from sklearn.neighbors import NearestNeighbors
import numpy as np
import matplotlib.pyplot as plt
# create random dataset with 2 features
X = np.random.randn(500, 2)
Compute the k-nearest neighbors (calculate distance of points to its k-th nearest neighbor).
# initialize the value of k for kNN which can be same as MinPts
k = 4
# Compute k-nearest neighbors
# you need to add 1 to k as this function also return
# distance to itself (first column is zero)
nbrs = NearestNeighbors(n_neighbors=k+1).fit(X)
# get distances
dist, ind = nbrs.kneighbors(X)
Drop first column (which is zero) and sort the distances in ascending order,
k_dist = np.sort(dist[:, -1])
Plot the kNN graph with sorted distance,
plt.plot(k_dist)
plt.xlabel('Distance sorted points')
plt.ylabel(f'{k}-Distance')
plt.show()
In the above kNN distance plot, you can find the optimal value of eps
by looking at the “knee” or “elbow” point (where there is significant change) in the kNN curve. You can see
that there is a knee at 0.25 and indicates a good value for the eps
.
This eps
point indicates a natural threshold for distinguishing between cluster points and noise points in the dataset.
It can sometimes be difficult to visually identify the ideal value for eps
. For such cases, you should evaluate the values within a certain range
where there is a significant change, and run DBSCAN with different eps
values within that range to determine the optimal eps
value.
Additionally, you can also use automatic methods to identify the eps
value from the kNN graph.