
In this article, we shall understand how k-Nearest Neighbors (kNN) algorithm works and build kNN algorithm from ground up. We also shall evaluate our algorithm using the k-Fold cross-validation which is also developed from scratch.
After completing this tutorial you will know:
- How to code the k-Nearest Neighbors algorithm step-by-step
- How to use k-Nearest Neighbors to make a prediction for new data
- How to code the k-Fold Cross Validation step-by-step
- How to evaluate k-Nearest Neighbors on a real dataset using k-Fold Cross Validation
Prerequisites: Basic understanding of Python and the concept of classes and objects from Object-oriented Programming (OOP)
k-Nearest Neighbors
k-Nearest Neighbors, kNN for short, is a very simple but powerful technique used for making predictions. The principle behind kNN is to use "most similar historical examples to the new data."
‘k‘ is a number used to identify similar neighbors for the new data point.
The entire training dataset is initially stored. When predictions are required for new data, _k_NN considers the k-most similar neighbors (records) to decide where the new data point will belong to based on feature similarity.
Once we find the distance or similarity, we choose the top k closest records. After discovering k closest records, we make the prediction by returning the most common outcome or taking the average. As such, _k_NN can be used for classification or regression problems.
_k_NN algorithm doesn’t have a training phase. The model just holds the data until a prediction is required and does no work. For this reason, _k_NN is often referred to as a "lazy learning method".
k-Nearest Neighbors in 4 easy steps
- Choose a value for k
- Find the distance of the new point to each record of training data
- Get the k-Nearest Neighbors
- For classification problem, the new data point belongs to the class that most of the neighbors belong to. For regression problem, the prediction can be average or weighted average of the label of k-Nearest Neighbors
Building kNN from scratch using Python
You can follow along using the code available in my GitHub.
You can also install it by using:
pip install simple-kNN
GitHub repo for PyPI package version: https://github.com/chaitanyakasaraneni/simple-kNN
Step 1: Choosing a k value
Choice of K has a drastic impact on the results we obtain from _k_NN. Better choose an odd number.
Step 2: Calculating Distance
The next step is to calculate distance between two rows in a dataset.
Problem or data specific methods are used to calculate distance or similarity between two records. In general for tabular or vector data, Euclidean distance is considered as starting point. There are several other similarity or distance metrics such as Manhattan distance, Hamming distance, etc.
Euclidean distance is defined as the square root of the sum of squared distance (difference) between two points. It is also known as L2 norm.

Manhattan distance is the sum of the absolute values of the differences between two points

Hamming distance is used for categorical variables. In simple terms it tells us if the two categorical variables are same or not.

where ‘δ’ is used to check equality of the two elements.
In python, we create a separate class that holds the methods to calculate the distance between two vectors.
We shall utilize this class to find the nearest neighbors in the next step.
Step 3: Get Nearest Neighbors
Neighbors for a piece of new data in the dataset are the top – k closest instances that we obtain using the distance metrics defined above.
To locate the neighbors for a new piece of data within a dataset we must first calculate the distance between each record in the dataset to the new piece of data. We can do this by creating an object for the distanceMetric class that we defined above.
Once distances are calculated, we must sort all of the records in the training dataset by their distance to the new data. We can then select the top k to return as the most similar neighbors.
We can do this by keeping track of the distance for each record in the dataset as a list, sort the list of lists by the distance and then retrieve the neighbors.
Now that we know how to get top k – neighbors from the dataset, we will use them to make predictions.
Step 4: Predictions
In this step, we shall use the top – k similar neighbors collected from training dataset to make predictions.
In the case of classification, we can return the most represented class among the neighbors.
We can achieve this by performing the max() function on the list of output values from the neighbors. Given a list of class values observed in the neighbors, the max() function takes a set of unique class values and calls the count on the list of class values for each class value in the set.
Below is the complete _k_NN class:
Now that we have our predictions, we need to evaluate the performance of our model. For this, we shall use k-Fold Cross Validation which is defined in the next part.
k Fold Cross validation
This technique involves randomly dividing the dataset into _k-_groups or folds of approximately equal size. The first fold is kept for testing and the model is trained on remaining k-1 folds.

There are many variants of k-Fold Cross Validation. You can read more about them here.
In out approach, after each fold, we calculate accuracy, and thus accuracy of _k-_Fold CV is computed by taking average of the accuracies over _k-_folds.
Building kFCV from scratch using Python
As a first step, we divide the dataset into k– folds.
Then for each fold in the k-folds, we perform kNN algorithm, get predictions and evaluate the performance using accuracy as evaluation metric.
The method to split the data into k-Folds:
The method for evaluation:
Both methods combined into a single class:
We can execute this by creating an object for k-Fold cross validation method and call the evaluate method as shown below.
kfcv = kFoldCV()
kfcv.kFCVEvaluate(data, foldCount, neighborCount, distanceMetric)
The kfcv.kFCVEvaluate() then splits the data into the specified number of folds and evaluated _k_NN algorithm by considering top-k neighbors using the distanceMetric specified.
Examples and implementation can be seen in my GitHub repository.
Conclusion
In this blog, we have seen:
- kNN algorithm
- Some distance metrics used in kNN algorithm
- Predictions using kNN algorithm
- Evaluating kNN algorithm using kFold Cross validation
Hope you gained some knowledge reading this article. Please remember that this article is just an overview and my understanding of kNN algorithm and kFold Cross validation technique that I read from various online sources.