# How to Implement k-Nearest Neighbors (kNN) in Python

## 1 kNN background

** k-Nearest Neighbors** (kNN) is a supervised machine learning algorithm widely used for classification and regression
analysis.

The kNN algorithm uses the rule of majority for classification and the average of nearest neighbors’ values for regression.

The variable parameter (`k`

), also known as **nearest neighbours** is a crucial parameter in kNN. An appropriate value
of k is crucial; a small value may be sensitive to noise, whereas a large value may smooth out patterns.

Based on distance measures such as Euclidean distance, the kNN determines k nearest data points and applies the majority rule to classify them.

The kNN algorithm is illustrated in the following figure. Using k=7, the kNN identifies the 7 closest data points from the query point. Among the 7 points identified, 5 are classified as class 1 and 2 as class 2. According to the majority rule, the 7 points fall into class 1.

## 2 Implement kNN in Python

### 2.1 Requirements

You need to install `scikit-learn (sklearn)`

, and `matplotlib`

Python packages for implementing and visualizing the
kNN.

### 2.2 Generate a sample dataset

We will generate a random dataset with two features and three classes using the `make_blobs`

function from the sklearn
package.

```
# load packages
from sklearn.datasets import make_blobs
# generate random dataset
data, classes = make_blobs(n_samples=300, n_features=2,
centers=3, cluster_std=0.5,
random_state=0)
# check data dimensions
data.shape
(300, 2)
classes.shape
(300,)
```

If you visualize this dataset as a scatterplot, you should see three different clusters (classes).

```
# load packages
import matplotlib.pyplot as plt
plt.scatter(data[:,0], data[:,1], c=classes)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()
```

### 2.3 Training and testing split

The training and testing splitting of the dataset is essential for the kNN algorithm. The training dataset will be used for fitting the kNN model and the subsequent model will be used for performance evaluation using the testing dataset.

In addition, the training dataset is also helpful for the hyperparameter tuning of the `k`

parameter and selecting the
best `k`

value for the best generalization to test data.

We will use the `train_test_split`

function from `sklearn`

to split the dataset into 80% training and 20% testing. You can
change this proportion as per your requirements.

```
# load packages
from sklearn.model_selection import train_test_split
# split dataset
X_train, X_test, y_train, y_test = train_test_split(data, classes,
test_size=0.2,
random_state=12)
```

### 2.4 Fit the kNN classifier model

You can fit the kNN model in Python using the `KNeighborsClassifier`

function from the `sklearn`

package. This function
needs the number of neighbors hyperparameter (`n_neighbors`

or `k`

) for fitting the kNN classification model.

The hyperparameter `k`

is a critical for building an efficient kNN model. A common rule of thumb is to choose
`k`

as the square root of the number of data points in your dataset. For example, if the training dataset has 240 data points,
then we should set `k`

= 15.

```
# load packages
from sklearn.neighbors import KNeighborsClassifier
# initialize kNN classifier
knn_model = KNeighborsClassifier(n_neighbors=15)
# fit the model on training dataset
knn_model.fit(X_train, y_train)
```

*k*depends on the dataset and the problem. Testing multiple values and assessing their impact on the model’s performance is recommended practice.

### 2.5 Perform predictions

Once you have fitted the kNN model using the training dataset, the fitted kNN model can be used for performing prediction on unseen testing data.

You can use the `predict`

method to perform prediction using a fitted model.

```
# perform prediction on testing dataset
pred = knn_model.predict(X_test)
```

### 2.6 Evaluate the accuracy of the kNN model

The accuracy of the fitted kNN model can be calculated using the `accuracy_score`

function from `sklearn`

. It represents the
proportion of correctly classified classes among the total classes in the test dataset.

The predicted classes are compared to the true classes to calculate accuracy.

```
# load package
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, pred)
print(accuracy)
1.0
```

The accuracy of the fitted kNN model is 1.0 (or 100%). It means that the fitted model is a perfect classifier

### 2.7 Visualize kNN predicted classification

The visualization of the predicted classification is useful in understanding how the kNN classified new data points based on the fitted model.

```
# load package
import matplotlib.pyplot as plt
plt.scatter(X_test[:,0], X_test[:,1], c=pred)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()
```