
An essential algorithm in a Machine Learning Practitioner’s toolkit has to be K Nearest Neighbours(or KNN, for short). While it may be one of the most simple Algorithms, it is also a very powerful one and is used in many real world applications. In this tutorial, I will be doing the following:
- Explain the KNN algorithm and how it works
- Implement it from scratch using Python
So, without further ado, let’s get this Machine Learning party started!
K Nearest Neighbours explained
This is a common machine learning algorithm that can be used for classification, as well as regression. The steps the algorithm takes is the following:

- First, it takes a set of features and labels as input, making it a supervised learning algorithm.
- As well as the features and labels, it also takes in a n_neighbours parameter.
- For predicting a new sample’s class, it does the following:
- Measures the distance between the new sample and the N closest samples(as specified by the n_neighbours parameter)
- Based on its closest neighbour(s), it will classify the sample based on a majority vote. Specifically, if there are 3 neighbours, with 2 neighbours belonging to class A, and 1 neighbour belonging to class B, then it will classify the new sample as class A.
In order to fill some gaps in your knowledge, let’s run through a concrete example.

Let’s assume the following:
- You are trying to predict whether a patient has heart cancer based on blood type, BMI, etc.
- You have separated your dataset for training and validation
- You have scaled your features(I will discuss this later)
- You have fir a KNN algorithm, and have set the n_neighbours parameter to 3
If you were to feed an instance the validation set, the following will happen:
- The algorithm will measure the distance between the new instance and the other training set instances using a distance metric(I will also discuss this later!)
- Based on the n_neighbours parameter (which, in this example, was set to 3), it will use the three closest examples to the new sample to classify the new point
- It will then take a vote, and whichever class has the most votes is the class that the new sample will belong to. Concretely:
- If two of the closest classes are labelled 1(has heart cancer), and one is labelled 0(does not have heart cancer), the new sample will be predicted to be in class 1
- This process is repeated for all instances in the test set
Cleaning up some KNN Jargon

Distance Metric
This is the metric used to calculate the distance between two points. It is usually the Euclidean Distance, which is given by the following formula:

Feature Scaling
This is an important preprocessing step of many Machine Learning Algorithms, especially those that use distance metrics and calculations(like our friend the KNN). It essentially scales our features so that they are in a similar range. Think of it like a house, and a scaled model of a house. The shape of both are the same(they are both houses), but the size is different(5m != 500m). We do this for the following reasons:
- It speeds up algorithms
- Some algorithms are sensitive to scale. In other words, if the features have different scales, there is a chance that higher weightage is given to features with higher magnitude. This will impact the performance of the machine learning algorithm and obviously, we do not want our algorithm to be biassed towards one feature.
To demonstrate this, let’s suppose we have three features, named A,B and C:
- Distance of AB before scaling =>

- Distance of BC before scaling =>

- Distance of AB after scaling =>

- Distance of BC after scaling =>

We can clearly see that the features are much more comparable and unbiased than they were before scaling. If you want a great tutorial on feature scaling, please check out this [blogpost](http://Photo by Analytics Vidhya) by Analytics Vidhya.
Picking the n_neighbors parameter
While there is no predetermined way for selecting the optimal number of neighbors, some methods do exist. One popular method is called the elbow method, where you run a loop and fit the data to a range of neighbour values. If you were to plot the errors against the number of neighbours, your plot would look something like this:

Usually, the optimal point would be the point at where the elbow rapidly reduces its error(so in our case,possibly 5 or 6). The problems with this method are the following:
- it is possible to overfit the test set, so make sure to cross validate.
- Sometimes, the optimal number of K might not be very clear, so it may be hard to judge what is really the optimal number
A good tip to do when implementing KNN is to set K to an odd number. That way, there is no need to deal with tied voting scenarios.
Getting Technical with the code
It’s great to know how to use ML libraries like scikit-learn to code algorithms, but what can really level up your ML skills is learning how to build the algorithms from scratch. So, we will be doing just that; creating a KNNClassifier from scratch!

NOTE: the link to the code can be found here, however I recommend you to go through the blog post before checking out the code to get a complete understanding of what is going on.
First things first, let’s import our libraries:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from collections import Counter
iris = load_iris()
X, y = iris.data, iris.target
Yes, the only reason we are importing scikit-learn is to use to iris dataset and to split the data. Besides that, we are using plain numpy and collections!
Next, let’s create our train and test set:
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=1810)
Nothing out of the ordinary here, so let’s move swiftly along.
Now, I mentioned that feature scaling was an important preprocessing step for KNN. However, our data already lies in a similar range, so we can skip this step. In real world data however, it is very rare that we get this lucky.
Using an OOP Approach for coding the Algorithm
Clean and reusable code are key for any Data Scientist or Machine Learning Engineer. Therefore, to follow Software Engineering principles, I will be creating a KNN Class to make the code reusable and pristine.
First, we define our class name, and pass in some parameters. Namely,
- X(the features)
- y(the label vector)
- n_neighbors(the number of neighbors we desire)
class KNN:
def __init__(self,X,y,n_neighbors=3):
self.X = X
self.y = y
self.n_neighbors = n_neighbors
Next, we convert our Euclidean distance formula from above into code and make it a method of the class:
def euclidean(self,x1,x2):
return np.sqrt(np.sum((x1 - x2) ** 2))
Now the heavy lifting. I will initially show you the code and then explain what is going on:
def fit_knn(self,X_test):
distances = [self.euclidean(X_test,x) for x in X_train]
k_nearest = np.argsort(distances)[:self.n_neighbors]
k_nearest_labels = [y_train[i] for i in k_nearest]
most_common = Counter(k_nearest_labels).most_common(1)[0][0]
return most_common
We first create a method named fit_knn, that does, well, fit a KNN to the data! More specifically, the following is being done:
- The distance between each data point in the test set to the data points in the train set is measured
- We get the K nearest points(K being our parameter for the number of neighbors, which, in our case, is 3)
- After, the code returns the labels of the nearest neighbours we found to be closest to our new test set instance.
- the most common class is counted and returned by the method
Finally, to top it all off, we make predictions:
knn = KNN(X_train,y_train)
preds = knn.predict(X_test)
Now, let’s evaluate our model and see how well it did classifying the new samples:
accuracy = (preds == y_test).mean()
OUT: 1.0
So, the full code is the following:
import numpy as np
from collections import Counter
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=1810)
class KNN:
def __init__(self,X,y,n_neighbors=3):
self.X = X
self.y = y
self.n_neighbors = n_neighbors
def euclidean(self,x1,x2):
return np.sqrt(np.sum((x1 - x2) ** 2))
def fit_knn(self,X_test):
distances = [self.euclidean(X_test,x) for x in X_train]
k_nearest = np.argsort(distances)[:self.n_neighbors]
k_nearest_labels = [y_train[i] for i in k_nearest]
most_common = Counter(k_nearest_labels).most_common(1)[0][0]
return most_common
def predict(self,X_test):
preds = [self.fit_knn(x) for x in X_test]
return preds
knn = KNN(X_train,y_train)
preds = knn.predict(X_test)
accuracy = (preds == y_test).mean()
Additional tasks to do:
- Try play around with the n_neighbors parameter. What changes?
- Try scale your features and see if the results differ. Is it better? Or worse?
- Try fit the model onto a different dataset. See how changing K differs the results
- See if you can implement the elbow method yourself to determine the optimal number of neighbours
Thank you very much for reading this. I hope you have learned something new, or at least refreshed your memory. Now, that was long! So, my last wish to you is to take a break, rest your mind and enjoy life!
