The world’s leading publication for data science, AI, and ML professionals.

Let’s Make a KNN Classifier from Scratch

You'll need only Numpy and a bit of time. Source code included.

Coding out the entire machine algorithms from scratch can be a tedious process. That’s why we have libraries. The other reason is, our own implementation most likely cannot compete with out of the box solutions like the one from Scikit-Learn, that is if we don’t put enough time and effort into it.

Photo by Alvin Engler on Unsplash
Photo by Alvin Engler on Unsplash

That, however, doesn’t mean that you shouldn’t code out the algorithms from scratch. Here are some benefits of doing so:

  1. You have complete control over the code
  2. You’ll bring your understanding of the algorithm to a whole new level

Today I want to show you how easy it is to code a simple K Nearest Neighbors algorithm just with the NumPy library. Sure, it won’t be state of the art and there will be a lot of things to optimize, but it’s a good start. By the end of this article, I want you to feel like you could have invented it.

In a case you’re not eager to write the code yourself, here’s the GitHub repo.

Without further ado, let’s get started. The only library you’ll need is Numpy (for now) so make sure to import it.


What is KNN?

KNN stands for K-Nearest Neighbors. It’s basically a classification algorithm that will make a prediction of a class of a target variable based on a defined number of nearest neighbors. It will calculate distance from the instance you want to classify to every instance of the training dataset, and then classify your instance based on the majority classes of k nearest instances.

If you’re reading this for the first time it might sound confusing, but it’s actually a very simple algorithm.

As there are tons of theoretical explanations out there, I decided not to dive any deeper into it, and instead focus my time on the implementation side. If you, however, lack the theoretical background, feel free to refer to this article (or any others you find online):

A Quick Introduction to K-Nearest Neighbors Algorithm


Distance between two vectors

You can think of the rows, the Xs in your dataset as vectors. With having this idea in mind, you probably know that there are ways of measuring distances between vectors. Some of the ideas that pop into my head are:

  1. Euclidean distance
  2. Cosine distance

In this article, I’ll focus my attention on the first one, and we’ll code it out from scratch. In a nutshell, Euclidean distance is just a straight-line distance between two data points in Euclidean space[1]. It can be calculated as follows:

It really is a simple formula and you don’t have to know the number of dimensions beforehand. Let’s code it out in Python:

And that’s all there is to it!

Let’s now make two quick checks:

  • On the same vectors (distance should be 0)
  • On the different vectors (distance should not be 0)

Algorithm Implementation

Okay, the previous part is clearly working and it’s now time to code out the heart of the algorithm. I’ll declare a function predict() which will take in 3 parameters:

  • k: the value for k
  • train_set : entire matrix with the target variable
  • test_instance : a single vector without the target variable value

Here are the steps required:

  1. Calculate Euclidean distance between the test_instanceand each row of the train_set
  2. Sort the distances by distance value, from lowest to highest
  3. Keep the distance of the ksmallest ones
  4. Get values of a target variable for k train_setrows with the smallest distance
  5. Whichever target variable class has the majority, wins

As I’ve said, the algorithm is very simple and shouldn’t be a problem to code out. Just for the sake of it, I’ve commented every line of the code so there’s no way for you to get lost:

That was quite a bit of code to write, I know, but it will be worth it.

Now, there’s no point in implementing the algorithm if we can’t evaluate it somehow. That’s why I’ve decided to implement a basic accuracy reporting function, which will report back how well the algorithm performed on a test set:


Let’s make a Class

If you ask me, I think that everything looks way messier than it should be. That’s mainly because we have functions all over the place. So let’s fix that by throwing all of those functions inside a class called, you’ve guessed it, KNearestNeighbors. There are some slight modifications you’ll need to make, but as those are trivial I decided not to comment on them.

Anyway, here’s the full class:


Let’s Evaluate

The time has come to evaluate our algorithm. For simplicity, I’ve decided to use the famous Iris dataset which can be loaded from Scikit-Learn.

Here’s the code to get you going:

And here’s how everything should look like:

If you’re wondering what Bamboolib is, check out this article:

Introducing Bamboolib – a GUI for Pandas

Now, we’ll need to split the dataset into training and testing part somehow. That’s easy to do, here’s the code:

Let’s now make an instance of theKNearestNeighborsclass and set kto 3. Then we can train the model and evaluate it’s performance easily:

K = 3 seems to be a reasonable choice. Just like that, we’ve got an accuracy of 94.6%, which is amazing.

But how do we know if that’s the optimal value for K?

Well, we don’t, we’ll need to verify it manually. All it boils down to is surrounding the code from above in a loop:

If you’re wondering why I’ve decided to use only odd numbers, that’s because this way it’s impossible to get ties when the algorithm is making a classification.

The values 3 and 5 for K seem to yield the same accuracy, and we can set K = 13 to get an even better model. The accuracy scores might be different for you, that’s because the train and test sets are sampled randomly – but there shouldn’t be any drastic differences.


What’s Next?

This was a rather simple implementation of the K-Nearest Neighbors algorithm. It works well, but there are still some things you could do, like:

  • Trying out other distance algorithms (like cosine distance)
  • Standardizing the data

The latter is actually pretty much essential for KNN to work properly because your inputs might not be in the same unit. Think about it like this – 2.5 meters and 250 centimeters is the same to you, but the algorithm would think that the latter is more important due to the larger scale.

Iris dataset doesn’t suffer from that problem, so I decided to skip that part. On a real-world dataset, you probably shouldn’t.

Thanks for reading, I hope everything is working on your machine. If not, just download the code from my GitHub repo.


Loved the article? Become a Medium member to continue learning without limits. I’ll receive a portion of your membership fee if you use the following link, with no extra cost to you.

Join Medium with my referral link – Dario Radečić


References

[1] https://en.wikipedia.org/wiki/Euclidean_distance


Related Articles