How to Predict New Points in DBSCAN in Python
You can fit the DBSCAN model using the DBSCAN
class implemented in sklearn.cluster
module available in the scikit-learn Python package. However, there
is no prediction function for DBSCAN in the scikit-learn for assigning new points to the clusters.
You can use the HDBSCAN Python package if you want to predict the new points and assign them to existing clusters generated by DBSCAN.
This article describes how to use the HDBSCAN for predicting new points in the DBSCAN.
Create a training and testing dataset
Create sample training and testing datasets to fit the DBSCAN model and predict the new points.
# import package
import numpy as np
# create random training dataset with 2 features
X = np.random.randn(200, 2)
# create a testing dataset
Y = np.random.randn(5, 2)
Fit the DBSCAN model
We will use the HDBSCAN
to fit the DBSCAN model on the training dataset. HDBSCAN is an extension of the DBSCAN algorithm.
The HDBSCAN requires two hyperparameters min_samples
and min_cluster_size
which have a significant effect on clustering.
The large values of min_cluster_size
result in a few clusters. By default, the min_samples
value is same as the min_cluster_size
.
The other parameter prediction_data
is important for speeding up the prediction of new data points.
eps
parameter (unlike DBSCAN). This makes HDBSCAN to find optimal clusters with varying densities which is not possible with DBSCAN.# import package
import hdbscan
# fit the DBSCAN model
model = hdbscan.HDBSCAN(min_cluster_size=4, prediction_data=True).fit(X)
# get clusters
model.labels_
array([ 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, -1, 1, -1, 1,
1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, -1, 1,
0, 1, -1, 1, -1, 1, -1, 1, 1, 1, 1, 1, 1, -1, 1, 0, -1,
1, 1, 1, 1, 1, 1, 1, -1, 1, -1, 1, 1, -1, 1, -1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, -1, 1, 1, -1, 1,
1, 1, 1, 1, 1, -1, 1, 1, 0, 1, -1, 1, -1, 1, 1, -1, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, -1, 1, 1, 1, 1, 1,
1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
1, 1, 1, 1, 1, -1, -1, -1, -1, 1, -1, 1, -1, 1, 1, 1, 1,
-1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1,
1, 1, -1, 1, -1, 1, 1, 1, 1, 1, 1, -1, 1])
The cluster labels indicate that there are two clusters (1 and 0), and the remaining data points are classified as noise points i.e. -1 (which could not assigned to any cluster).
Predict new points
You can use the approximate_predict()
function from the HDBSCAN
to predict the cluster for the new points.
The approximate_predict()
takes the input of new points as an array.
# predict new points
test_labels, strengths = hdbscan.approximate_predict(model, Y)
# see cluster labels for new points
test_labels
array([ 1, 1, 1, -1, -1], dtype=int32)
The prediction results for new points indicate that three points belong to 1 cluster and the remaining data points are predicted as noise points.