ML from Scratch: K-Nearest Neighbors Classifier

A Complete Guide to the KNN Classification Algorithm, where We Will See How to Implement a KNN-Based Machine Learning Model from Scratch, while Understanding the Mathematics Behind it.

Aman Sharma
Towards Data Science
15 min readSep 13, 2020

--

When it comes to solving classification problems via machine learning, there’s a wide variety of algorithm choices available for almost any data type or niche problem that one might be dealing with. These algorithmic choices can be broadly categorized into two groups, which are as follows.

  1. Parametric Algorithms: The algorithms belonging to this category rely on algebraic mathematical equations, that use a set of weights and biases (collectively known as parameters), in order to predict the final discrete outcome for a given set of data. The model size, i.e., the total number of trainable parameters in the model, can vary from just a few (such as, in the case of traditional machine learning algorithms) to millions or even billions (as generally seen in the case of artificial neural nets).
  2. Non-Parametric Algorithms: The classification algorithms in this category are rather unique in the sense that they don’t use any trainable parameters at all. This means that in order to generate predictions, unlike their parametric counterparts, the models in this category don’t rely on a set of weights and biases, or have any assumptions regarding the data. Rather, in order to predict, say, the target class of a new data point, the models use some sort of comparison techniques that help them determine a final outcome.

Today, in this article, we are going to study one such non-parametric classification algorithm in detail— the K-Nearest Neighbors (KNN) algorithm.

This is going to be a project-based guide, where in the first part, we will be understanding the basics of the KNN algorithm. This will then be followed by a project where we will be implementing a KNN model from scratch using basic PyData libraries like NumPy and Pandas, while understanding the mathematical foundations of the algorithm.

So buckle up, and let’s get started!

KNN Classifier Basics

KNN Classification (Image by author)

To begin with, the KNN algorithm is one of the classic supervised machine learning algorithms that is capable of both binary and multi-class classification. Non-parametric by nature, KNN can also be used as a regression algorithm. However, for the scope of this article, we will only focus on the classification aspect of KNN.

KNN classification at a glance-

→ Supervised algorithm

→ Non-parametric

→ Used for both regression and classification

→ Support for both binary and multi-class classification

Before we move any further, let us first break down the definition and understand a few of the terms that we came across.

  • KNN is a “supervised” algorithm- In layman's terms, this means that the data used for training a KNN model is a labeled one.
  • KNN is used for both “binary” and “multi-class classification”- In the machine learning terminology, a classification problem is one where, given a list of discrete values as possible prediction outcomes (known as target classes), the aim of the model is to determine which target class a given data point might belong to. For binary classification problems, the number of possible target classes is 2. On the other hand, a multi-class classification problem, as the name suggests, has more than 2 possible target classes. A KNN classifier can be used to solve either kind of classification problems.

With that done, we have a rough idea regarding what KNN is. But now, a very important question arises.

How does a KNN Classifier work?

As we read earlier, KNN is a non-parametric algorithm. Therefore, training a KNN classifier doesn’t require going through the more traditional approach of iterating over the training data for multiple epochs in order to optimize a set of parameters.

Rather, the actual training process in the case of KNN is quite the opposite. Training a KNN model involves simply fitting (or saving) all the training data instances into the computer memory at the same time, which technically requires a single training cycle.

After this is done, then during the inference stage, where the model has to predict the target class for a completely new data point, the model simply compares this new data with the existing training data instances. Then finally, on the basis of this comparison, the model assigns this new data point to its target class.

But now another question arises. What exactly is this comparison that we are talking about, and how does it occur? Well quite honestly, the answer to this question is hidden in the name of the algorithm itself — K-Nearest Neighbors.

To understand this better, let us dive deeper into how the inference process works.

  • As the first step, our KNN model calculates the distance of this new data point from every single data point within the ‘fitted’ training data.
  • Then, in the next step, the algorithm selects ‘k’ number of training data points that are closest to this new data point in terms of the calculated distance.
  • Finally, the algorithm compares the target label of these ‘k’ points that are the nearest neighbors to our new data point. The target label with the highest frequency among these k-neighbors is assigned as the target class to the new data point.

And that’s how the KNN classification algorithm works.

Regarding the calculation of the distances between the data points, we will be using the Euclidean distance formula. We will be understanding this distance calculation in the next section where we will be code our own KNN-based machine learning model from scratch.

So now onto the fun, practical part! We will begin by having a quick glance at the problem statement that we are addressing via our project.

Understanding the Problem Statement

For this project, we will be working on the famous UCI Red Wine Dataset. The aim of this project is to create a machine learning solution that can predict the quality of a red wine sample.

This is a multi-class classification problem. The target variable, i.e., the ‘quality’ of the wine, accepts a discrete integer value ranging from 0–10, where a quality score of 10 denotes a wine of the highest quality standards.

Now that we have understood the problem, let us begin with the project by importing all the necessary project dependencies, which includes the necessary PyData modules and the dataset.

Importing Project Dependencies

In the first step, let us import all the necessary Python modules.

Now, let’s import our dataset.

A quick glance at the dataset (image by author)

Now that we have imported the dataset, let us try to understand what each of the columns in our data denotes.

Understanding the Data

The following is a brief description of all the individual columns within our dataset.

Description of the dataset (image by author)

As we discussed earlier, the ‘quality’ column is the target variable for this project. The rest of the columns in the dataset represent the feature variables that will be used for training the model.

Now that we know what the different columns in our dataset represent, let us move on to the next section where we will be doing some pre-processing and exploration of our data.

Data Wrangling and EDA

Data wrangling (or preprocessing) involves analyzing the data to see if it needs any sort of cleaning or scaling so that it can be prepared for training the model.

As the first step of data preprocessing, we will check if there are any null values within our data that need to be dealt with.

Column description and null value count (image by author)

As we can see, there are no null values within our dataset. This is a good thing since we won’t have to deal with any missing data. Now, let us have a look at the statistical analysis of the data.

One prominent observation from the above given statistical analysis is that there’s a visible inconsistency in the range of values across different columns within our dataset. To be more clear, the values in some columns are of the order 1e-1 while in a few others, the values can go as high as of the order 1e+2. Because of this inconsistency, there are chances that a feature weight bias might arise at the time of training the model. What is basically means is that some features might end up affecting the final prediction more than the others. Therefore, in order to prevent this weight imbalance, we will have to scale our data.

For this scaling, we will be standardizing our data. Standardization typically means rescaling data in a way such that each feature column has a mean of 0 and a standard deviation of 1 (unit variance).

The following is the mathematical formula for standard scaling.

Now that we know the mathematical formula, let us go ahead and implement this from scratch in Python.

Step-1: Separating the feature matrix and the target array.

Step-2: Declaring the standardization function.

Step-3: Performing standardization on the feature set.

Standardized data (image by author)

With this, we are done with standardized our data. This will most probably take care of weight bias.

Now for the last part of our data wrangling and EDA section, we will have a look at the distribution of values across the target column of our dataset.

Target label counts (Image by author)

Some of the observations from the above-given graph are-

  • Most wine samples in our data are rated 5 and 6, followed by 7.
  • No wine sample is rated above 8 or below 3. This implies that a wine sample of extremely high quality (9 or 10) or very low quality (0, 1 or 2) rating can either be thought of as a hypothetically ideal situation, or that probably that data is suffering from sampling bias, where the samples of extreme quality didn’t get any representation within the survey.
  • Our last assumption of a sampling bias within the data is further strengthened as we notice that a majority of wine samples are rated 5 or 6.
  • A model trained on such data will produce biased results, where it is more likely to classify a wine sample as a 5 or 6 on the quality scale as compared to, say, 3 or 8. To know more about sampling bias or how to deal with it, check out this article of mine.

Now that we are done exploring our data, let us move on to the final part of our project where we will be code our multi-class classifier KNN model using NumPy.

Modeling and Evaluation

As the first step of our modeling process, we will first split our dataset into training and test sets. This is done because training and evaluating your model on the same data is considered a bad practice.

Let’s see how to implement the code to split the dataset using Python.

Step-1: Declaring the split function.

Step-2: Running the splitting function on our standardized dataset.

Shape of the splits (Image by author)

Now that we have created our training and validation sets, we will finally see how to implement the model.

On the basis of what we have learned till now, here are the steps involved in creating a KNN model.

Step-1: Training the model- As we read earlier, in the case of KNN, this simply means saving the training dataset within the memory. We have already done that when we created our training and validation data splits.

Step-2: Calculating the distance- A part of the inference process in the KNN algorithm, the process of calculating the distance is an iterative process where we calculate the Euclidean distance of a data point (basically, a data instance/row) in the test data from every single data point within the training data.

Calculation of distance (Image by author)

Now, let us understand how the Euclidean distance formula works so that we can implement it for our model.

  • Let us consider two data points A and B.

→ A = [a0, a1, a2, a3,…. an], where ai is a feature that represents the data point A.

Similarly, B= [b0, b1, b2, b3,…. bn].

Therefore, the Euclidean distance between the data points A and B is calculated using the following formula-

Euclidean distance formula (image by author)

Let’s now implement this distance calculation step in Python.

  • Step-2.1: Declaring Python function to calculate Euclidean distance between 2 points.
  • Step-2.2: Declaring a Python function to calculate the distance from each point in the training data

Before we move on to the next step, let us test our distance function.

As we can see, our distance function successfully calculated the distance of the first point in our test data from all the training data points. Now we can move on to the next step.

Step-3: Selecting k-nearest neighbors and making the prediction- This is the final step of the inference stage, where the algorithm selects the k-closest training data points to the test data point based on the distances calculated in step 2. Then, we consider the target labels of these k-nearest neighboring points. The label with the highest occurring frequency is assigned as the target class to the test data point.

Let us see how to implement this in Python.

Step-3.1: Defining the KNN Classification function.

Step-3.2: Running inference on our test dataset.

Array of predicted values (Image by author)

With this, we have completed the modeling and inference process. As a final step, we will evaluate our models’ performance. For this, we will be using a simple accuracy function that calculates the number of correct predictions made by our model.

Let’s have a look at how to implement the accuracy function in Python.

Step-1: Defining the accuracy function.

Step-2: Checking the accuracy of our model.

Initial model accuracy

Step-3: Comparing with the accuracy of a KNN classifier built using the Scikit-Learn library.

Sklearn accuracy with the same k-value as scratch model

An interesting observation here! Though our model didn’t perform very well (with only 57% correct prediction), it has the exact same accuracy as a Scikit-Learn KNN model. This means the model that we defined from scratch was least able to replicate the performance of a pre-defined model, which is an achievement in itself!

However, I believe we can further improve the model’s performance to some extent. Therefore, as the last part of our project, we will find the best value of the hyperparameter ‘k’ for which our model gives the highest accuracy.

Model Optimization

Before we actually go on to finding the best k-value, first, let us understand the importance of the k-value in the K-Nearest Neighbors algorithm.

  • The k-value in the KNN algorithm determines the number of training data points that are to be considered while determining the class of the test data points.
  • Impact of a low k-value: If the k-value is very low, say, 1 or 2, the model will become very sensitive to outliers in the data. Outliers can be defined as the extreme instances within the data that do not follow the general trends within the data. Because of this, the predictions of the model become very unstable.
  • Impact of a high k-value: Now, as the k-value in the KNN algorithm increases, a weird trend is observed. At first, an increase is observed in the stability of the algorithm. One reason for this can be that as we consider more neighbors for predicting the target class of a test data point, because of majority voting, the effect of outliers decreases. However, as we continue to increase the k-value, after a certain point, we start to observe a decline in the stability of the algorithm, and the model accuracy starts to deteriorate.

The following graph roughly represents the relation between the k-value and stability of a KNN classifier model.

k-value vs stability (Image by author)

Now, let us finally evaluate the model for a range of different k-values. The one with the highest accuracy will be chosen as the final k-value for our model.

Model accuracy w/ different k-values (Image by author)

As we can see, k-value 1 has the highest accuracy. But as we discussed earlier, for k=1, the model will be very sensitive to outliers. Hence, we will go with k=8 which has the second-highest accuracy. Let us observe the results with k=8.

As you can see, we got a performance boost here! Just by tweaking the hyperparameter ‘k’, our model’s accuracy bumped up by almost 3 percent.

With this, we come to the end of our project.

To finish things off, let us have a quick rundown of all that we learned today, as well as some of the key takeaways from this lesson.

Conclusions

In this article, we had an in-depth analysis of the K-Nearest Neighbors classification algorithm. We understood how the algorithm uses Euclidean distance between the data instances as a criterion of comparison, on the basis of which it predicts the target class for a particular data instance.

In the second part of this guide, we went through a step-by-step process of creating a KNN classification model from scratch, primarily using Python and NumPy.

Though our model was not able to give a stellar performance, at least we were able to match the performance of a predefined Scikit-Learn mode. Now, while we were able to increase the model’s accuracy up to 60% via hyperparameter optimization, the performance was still not apt.

This has a lot to do with how the data was structured. As we observed earlier, there was a huge sampling bias in the data. This certainly affected our model’s performance. Another reason for the poor performance can be that probably, the data had a large number of outliers. All this brings us to a very important part that I left for the very end, where we will have a look at the advantages and disadvantages of the KNN classification algorithm.

Advantages of KNN-

  • As we discussed earlier, being a non-parametric algorithm, KNN doesn’t require multiple training cycles in order to adapt to the trends within the training data. As a result of this, KNN has an almost negligible training time, and in fact, is one of the fastest machine learning algorithms when it comes to training.
  • The implementation of KNN is very easy, as compared to some other, more complex classification algorithms.

Disadvantages of KNN–

  • When it comes to inference, KNN is very compute-intensive. For inference on each test data instance, the algorithm has to calculate its distance from every single point in the training data. In terms of time complexity, for n-number of training data instances and m-number of test data instances, the complexity of the algorithm evaluates to O(m x n).
  • As the dimensionality (i.e., the total number of features) and the scale of the dataset (i.e., the total number of data instances) increases, the model size also increases, which in turn impacts the performance and speed of the model. Therefore, KNN is not a good algorithm choice when it comes to high dimensional, large scale datasets.
  • The KNN algorithm is very sensitive to outliers in the data. Even a slight increase in the noise within the dataset might drastically affect the model’s performance.

With this, we finally come to an end of today’s learning session. In another article of mine, I have directly pitted a bunch of machine learning algorithms against each other. There, you can check out how KNN fares against other classification algorithms like logistic regression, decision tree classifiers, random forest ensembles, etc.

By the way, this was the fourth article in my ML from Scratch series, where I cover different machine learning algorithms and their mathematical foundations in detail. If you are interested in learning more, the other articles in this series are-

Link to the project GitHub files.

If you liked the article and would love to keep seeing more articles in the ML from Scratch series, make sure you hit that follow button.

Happy learning!

--

--

Programming computers to do the work for me and using data to solve problems are my passion. My machine and I are learning. Connect on Twitter @amansharma2910