Visualize DBSCAN Clusters in R

2023-12-11 587 words 3 minutes

Contents

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is an unsupervised clustering algorithm that groups data points that are close to each other in a high-density region and separates data points from low-density regions (noisy points).

Visualizing the results of the DBSCAN cluster is essential to understanding the structure and patterns of your data, and the effectiveness of DBSCAN.

This article focuses on various methods for visualizing the DBSCAN results in R using dbscan and ggpairs functions.

1 Get sample dataset

We will use the iris dataset for DBSCAN analysis

# load the iris dataset
data(iris)

# view first few rows
head(iris, 2)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa

# keep first four features as we do not need target (species) feature
iris_data = iris[, -5]
head(iris_data, 2)

  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2

2 Perform DBSCAN in R

In R, you can perform the DBSCAN using dbscan() function from dbscan package.

The dbscan() function requires eps and minPts parameters to calculate DBSCAN clusters. You can read this article if you’re unsure how to calculate these parameters.

# import package
# install.packages("dbscan")
library(dbscan)

# perform DBSCAN
dbscan(iris_data, eps = 0.4, minPts = 4)
# output
DBSCAN clustering for 150 objects.
Parameters: eps = 0.4, minPts = 4
Using euclidean distances and borderpoints = TRUE
The clustering contains 4 cluster(s) and 25 noise points.

 0  1  2  3  4 
25 47 38 36  4 

Available fields: cluster, eps, minPts, dist, borderPoints

# -1 value represents noisy points could not assigned to any cluster

DBSCAN output indicates four clusters (1, 2, 3 and 4). With 47 data points, cluster 1 is the largest, and with 4 points, cluster 4 is the smallest.

Cluster 0 represents the noise points that could not be assigned to any cluster.

3 Visualize DBSCAN clusters

The visualization of the DBSCAN clusters is necessary to understand the structure of your data.

If you have a two-dimensional dataset, you can simply create the scatter plot. When you have a high-dimensional dataset, you need to use either a pair plot or dimensionality reduction techniques before visualization.

3.1 Pair plot

As there are four features (sepal length, sepal width, petal length, and petal width), we will use the pairwise plot among those variables to see the clusters.

Here, we will use the ggpairs() function from the GGally package to visualize the pairwise scatter plot among the four features.

# load package
# install.packages("GGally")
library(GGally)

# pair plot for clusters
clus <- dbscan(iris_data, eps = 0.4, minPts = 4)
ggpairs(iris_data, aes(color=as.factor(clus$cluster)))

/images/dbscan_r/dbscan_pair_plot.png — DBSCAN pair plot

The pairwise scatter plot displays the individual clusters with different color for each pair of features.

3.2 Dimensionality reduction plot

As there are four features, it is hard to visualize them in one plot. In this case, we can use dimensionality reduction methods such as PCA to view the scatter plot in two dimensions (reduce 4-dimensional data into 2 dimensions).

We will use the fpc and factoextra packages to visualize the DBSCAN clusters in 2-dimensional space.

# install.packages("factoextra")
# install.packages("fpc")
library(factoextra)
library(fpc)

# perform DBSCAN with fpc package
clus <- fpc::dbscan(iris_data, eps = 0.4, MinPts = 4)

# visualize the clusters in 2-dimensional space
fviz_cluster(clus, iris_data, geom = "point", pointsize = 2)

Tip

For DBSCAN clustering, both the dbscan and fpc packages can be used, and they produces similar results

/images/dbscan_r/dim_red_dbscan.png — DBSCAN dimensionality reduction plot]

The above 2-dimensional plot of 4 features visualizes the well-separated DBSCAN clusters.