How to Implement k-Nearest Neighbors (kNN) in Python

2023-12-20 742 words 4 minutes

Contents

1 kNN background

k-Nearest Neighbors (kNN) is a supervised machine learning algorithm widely used for classification and regression analysis.

The kNN algorithm uses the rule of majority for classification and the average of nearest neighbors’ values for regression.

The variable parameter (k), also known as nearest neighbours is a crucial parameter in kNN. An appropriate value of k is crucial; a small value may be sensitive to noise, whereas a large value may smooth out patterns.

Based on distance measures such as Euclidean distance, the kNN determines k nearest data points and applies the majority rule to classify them.

The kNN algorithm is illustrated in the following figure. Using k=7, the kNN identifies the 7 closest data points from the query point. Among the 7 points identified, 5 are classified as class 1 and 2 as class 2. According to the majority rule, the 7 points fall into class 1.

/images/knn/knn_algorithm.png — kNN classification algorithm

2 Implement kNN in Python

2.1 Requirements

You need to install scikit-learn (sklearn), and matplotlib Python packages for implementing and visualizing the kNN.

2.2 Generate a sample dataset

We will generate a random dataset with two features and three classes using the make_blobs function from the sklearn package.

# load packages
from sklearn.datasets import make_blobs

# generate random dataset
data, classes = make_blobs(n_samples=300, n_features=2, 
                              centers=3, cluster_std=0.5, 
                              random_state=0)
# check data dimensions
data.shape
(300, 2)

classes.shape
(300,)

If you visualize this dataset as a scatterplot, you should see three different clusters (classes).

# load packages
import matplotlib.pyplot as plt

plt.scatter(data[:,0], data[:,1], c=classes)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

/images/knn/knn_scatterplot.png — scatterplot

2.3 Training and testing split

The training and testing splitting of the dataset is essential for the kNN algorithm. The training dataset will be used for fitting the kNN model and the subsequent model will be used for performance evaluation using the testing dataset.

In addition, the training dataset is also helpful for the hyperparameter tuning of the k parameter and selecting the best k value for the best generalization to test data.

We will use the train_test_split function from sklearn to split the dataset into 80% training and 20% testing. You can change this proportion as per your requirements.

# load packages
from sklearn.model_selection import train_test_split

# split dataset
X_train, X_test, y_train, y_test = train_test_split(data, classes, 
                                                    test_size=0.2, 
                                                    random_state=12)

2.4 Fit the kNN classifier model

You can fit the kNN model in Python using the KNeighborsClassifier function from the sklearn package. This function needs the number of neighbors hyperparameter (n_neighbors or k) for fitting the kNN classification model.

The hyperparameter k is a critical for building an efficient kNN model. A common rule of thumb is to choose k as the square root of the number of data points in your dataset. For example, if the training dataset has 240 data points, then we should set k = 15.

# load packages
from sklearn.neighbors import KNeighborsClassifier

# initialize kNN classifier
knn_model = KNeighborsClassifier(n_neighbors=15)
# fit the model on training dataset
knn_model.fit(X_train, y_train)

Tip

The optimal value of k depends on the dataset and the problem. Testing multiple values and assessing their impact on the model’s performance is recommended practice.

2.5 Perform predictions

Once you have fitted the kNN model using the training dataset, the fitted kNN model can be used for performing prediction on unseen testing data.

You can use the predict method to perform prediction using a fitted model.

# perform prediction on testing dataset
pred = knn_model.predict(X_test)

2.6 Evaluate the accuracy of the kNN model

The accuracy of the fitted kNN model can be calculated using the accuracy_score function from sklearn. It represents the proportion of correctly classified classes among the total classes in the test dataset.

The predicted classes are compared to the true classes to calculate accuracy.

# load package
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, pred)
print(accuracy)
1.0

The accuracy of the fitted kNN model is 1.0 (or 100%). It means that the fitted model is a perfect classifier

Tip

Accuracy alone is no appropriate measure to evaluate the performance of the fitted model, especially when the dataset is highly imbalanced. In such cases, the model should be evaluated with other measures (e.g. ROC, AUPRC) in addition to accuracy.

2.7 Visualize kNN predicted classification

The visualization of the predicted classification is useful in understanding how the kNN classified new data points based on the fitted model.

# load package
import matplotlib.pyplot as plt

plt.scatter(X_test[:,0], X_test[:,1], c=pred)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()