How to Predict New Points in DBSCAN in Python

2024-04-18 560 words 3 minutes

Contents

You can fit the DBSCAN model using the DBSCAN class implemented in sklearn.cluster module available in the scikit-learn Python package. However, there is no prediction function for DBSCAN in the scikit-learn for assigning new points to the clusters.

You can use the HDBSCAN Python package if you want to predict the new points and assign them to existing clusters generated by DBSCAN.

This article describes how to use the HDBSCAN for predicting new points in the DBSCAN.

Create a training and testing dataset

Create sample training and testing datasets to fit the DBSCAN model and predict the new points.

# import package
import numpy as np

# create random training dataset with 2 features
X = np.random.randn(200, 2)

# create a testing dataset 
Y = np.random.randn(5, 2)

Fit the DBSCAN model

We will use the HDBSCAN to fit the DBSCAN model on the training dataset. HDBSCAN is an extension of the DBSCAN algorithm.

The HDBSCAN requires two hyperparameters min_samples and min_cluster_size which have a significant effect on clustering.

The large values of min_cluster_size result in a few clusters. By default, the min_samples value is same as the min_cluster_size.

The other parameter prediction_data is important for speeding up the prediction of new data points.

Tip

HDBSCAN eliminates the need to set of eps parameter (unlike DBSCAN). This makes HDBSCAN to find optimal clusters with varying densities which is not possible with DBSCAN.

# import package
import hdbscan


# fit the DBSCAN model
model = hdbscan.HDBSCAN(min_cluster_size=4, prediction_data=True).fit(X)

# get clusters
model.labels_

array([ 1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  0, -1,  1, -1,  1,
        1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1, -1,  1,
        0,  1, -1,  1, -1,  1, -1,  1,  1,  1,  1,  1,  1, -1,  1,  0, -1,
        1,  1,  1,  1,  1,  1,  1, -1,  1, -1,  1,  1, -1,  1, -1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1, -1,  1,  1, -1,  1,
        1,  1,  1,  1,  1, -1,  1,  1,  0,  1, -1,  1, -1,  1,  1, -1,  0,
        1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1, -1,  1,  1,  1,  1,  1,
        1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  0,  1,  1,
        1,  1,  1,  1,  1, -1, -1, -1, -1,  1, -1,  1, -1,  1,  1,  1,  1,
       -1, -1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,
        1,  1, -1,  1, -1,  1,  1,  1,  1,  1,  1, -1,  1])

The cluster labels indicate that there are two clusters (1 and 0), and the remaining data points are classified as noise points i.e. -1 (which could not assigned to any cluster).

Predict new points

You can use the approximate_predict() function from the HDBSCAN to predict the cluster for the new points.

The approximate_predict() takes the input of new points as an array.

# predict new points
test_labels, strengths = hdbscan.approximate_predict(model, Y)

# see cluster labels for new points
test_labels

array([ 1,  1,  1, -1, -1], dtype=int32)

The prediction results for new points indicate that three points belong to 1 cluster and the remaining data points are predicted as noise points.