# How to Choose Optimal Hyperparameters for DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) has two main hyperparameters:
`eps`

(epsilon) and `MinPts`

(minimum number of points).

The `eps`

parameter defines the radius for searching the neighboring points within a cluster, whereas
`MinPts`

defines the minimum number of points required to form a core point (dense regions). A core point
has at least `minPts`

data points within a `eps`

radius.

This article describes the rules and methods for choosing the optimal values for `MinPts`

and `eps`

for forming the clusters.

## Choosing `MinPts`

The `MinPts`

parameter is crucial in DBSCAN as it determines the minimum number of points required to
form a dense regions and clusters.

There is no pre-defined rule or method to determine `MinPts`

for the DBSCAN. The following rules might
help to determine the appropriate `MinPts`

for the DBSCAN.

- The
`MinPts`

should be equal to or larger than the the dimensionality of the dataset. For example, if you have four features, the`MinPts`

should be equal to or greater than the four. - Sometimes, you should also try setting the
`MinPts`

= 2*dimensionality. For example, if you have two features, the`MinPts`

should be 4. This is a good rule of thumb and a starting point for initiating DBSCAN clustering. - The larger values for the
`MinPts`

could be necessary if the dataset is noisy and larger. - If you set a lower value such as 1 or 2 for
`MinPts`

, it may not be useful for forming meaningful clusters. - You can also visualize the clusters with different values of
`MinPts`

and then select`MinPts`

which gives the best clustering result.

`MinPts`

could be challenging. Domain knowledge
is often required to set optimum value for the`MinPts`

. The above rules are good starting
points for initiating the DBSCAN clustering and work in most of cases.## Choosing `eps`

To compute the optimal value for `eps`

, you can use the k-nearest neighbor (kNN) distances method. In this method,
we compute the average distance of each data point to the k-nearest neighbor.

Intitally, you will need to provide the value for the number of neighbors, which can be the same as the
`MinPts`

.

The following examples illustrates how to estimae the `eps`

.

```
# import package
from sklearn.neighbors import NearestNeighbors
import numpy as np
import matplotlib.pyplot as plt
# create random dataset with 2 features
X = np.random.randn(500, 2)
```

Compute the k-nearest neighbors (calculate distance of points to its k-th nearest neighbor).

```
# initialize the value of k for kNN which can be same as MinPts
k = 4
# Compute k-nearest neighbors
# you need to add 1 to k as this function also return
# distance to itself (first column is zero)
nbrs = NearestNeighbors(n_neighbors=k+1).fit(X)
# get distances
dist, ind = nbrs.kneighbors(X)
```

Drop first column (which is zero) and sort the distances in ascending order,

```
k_dist = np.sort(dist[:, -1])
```

Plot the kNN graph with sorted distance,

```
plt.plot(k_dist)
plt.xlabel('Distance sorted points')
plt.ylabel(f'{k}-Distance')
plt.show()
```

In the above kNN distance plot, you can find the optimal value of `eps`

by looking at the **“knee” or “elbow” point (where there is significant change)** in the kNN curve. You can see
that there is a knee at 0.25 and indicates a good value for the `eps`

.

This `eps`

point indicates a natural threshold for distinguishing between cluster points and noise points in the dataset.

It can sometimes be difficult to visually identify the ideal value for `eps`

. For such cases, you should evaluate the values within a certain range
where there is a significant change, and run DBSCAN with different `eps`

values within that range to determine the optimal `eps`

value.

Additionally, you can also use automatic methods to identify the `eps`

value from the kNN graph.