How to Choose Optimal Hyperparameters for DBSCAN

2024-04-04 621 words 3 minutes

Contents

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) has two main hyperparameters: eps (epsilon) and MinPts (minimum number of points).

The eps parameter defines the radius for searching the neighboring points within a cluster, whereas MinPts defines the minimum number of points required to form a core point (dense regions). A core point has at least minPts data points within a eps radius.

This article describes the rules and methods for choosing the optimal values for MinPts and eps for forming the clusters.

Choosing `MinPts`

The MinPts parameter is crucial in DBSCAN as it determines the minimum number of points required to form a dense regions and clusters.

There is no pre-defined rule or method to determine MinPts for the DBSCAN. The following rules might help to determine the appropriate MinPts for the DBSCAN.

The MinPts should be equal to or larger than the the dimensionality of the dataset. For example, if you have four features, the MinPts should be equal to or greater than the four.
Sometimes, you should also try setting the MinPts= 2*dimensionality. For example, if you have two features, the MinPts should be 4. This is a good rule of thumb and a starting point for initiating DBSCAN clustering.
The larger values for the MinPts could be necessary if the dataset is noisy and larger.
If you set a lower value such as 1 or 2 for MinPts, it may not be useful for forming meaningful clusters.
You can also visualize the clusters with different values of MinPts and then select MinPts which gives the best clustering result.

Tip

Setting the optimal value for MinPts could be challenging. Domain knowledge is often required to set optimum value for theMinPts. The above rules are good starting points for initiating the DBSCAN clustering and work in most of cases.

Choosing `eps`

To compute the optimal value for eps, you can use the k-nearest neighbor (kNN) distances method. In this method, we compute the average distance of each data point to the k-nearest neighbor.

Intitally, you will need to provide the value for the number of neighbors, which can be the same as the MinPts.

The following examples illustrates how to estimae the eps.

# import package
from sklearn.neighbors import NearestNeighbors
import numpy as np
import matplotlib.pyplot as plt


# create random dataset with 2 features
X = np.random.randn(500, 2)

Compute the k-nearest neighbors (calculate distance of points to its k-th nearest neighbor).

# initialize the value of k for kNN which can be same as MinPts
k = 4

# Compute k-nearest neighbors
# you need to add 1 to k as this function also return 
# distance to itself (first column is zero)
nbrs = NearestNeighbors(n_neighbors=k+1).fit(X)

# get distances
dist, ind = nbrs.kneighbors(X)

Drop first column (which is zero) and sort the distances in ascending order,

k_dist = np.sort(dist[:, -1])

Plot the kNN graph with sorted distance,

plt.plot(k_dist)
plt.xlabel('Distance sorted points')
plt.ylabel(f'{k}-Distance')
plt.show()

/images/dbscan_py/e_estimate_knn.png — kNN plot for estimating eps

In the above kNN distance plot, you can find the optimal value of eps by looking at the “knee” or “elbow” point (where there is significant change) in the kNN curve. You can see that there is a knee at 0.25 and indicates a good value for the eps.

This eps point indicates a natural threshold for distinguishing between cluster points and noise points in the dataset.

It can sometimes be difficult to visually identify the ideal value for eps. For such cases, you should evaluate the values within a certain range where there is a significant change, and run DBSCAN with different eps values within that range to determine the optimal eps value.

Additionally, you can also use automatic methods to identify the eps value from the kNN graph.

Contents

How to Choose Optimal Hyperparameters for DBSCAN

Choosing MinPts

Choosing eps

Choosing `MinPts`

Choosing `eps`