
Overview
K-Nearest Neighbors (KNN) is a Classification and regression algorithm which uses nearby points to generate predictions. It takes a point, finds the K-nearest points, and predicts a label for that point, K being user defined, e.g., 1,2,6. For classification, the algorithm uses the most frequent class of the neighbors. For regression, the algorithm uses the average of k data points to predict a continuous value. KNN is a supervised learning algorithm that is in a sense, lazy when it comes training (more on this later).
This post will serve as a high-level overview of what’s happening under the hood of KNN when performing classification. I’ll start by going over the different distance metrics and move into code examples with KNN from Scikit-Learn. For the coding examples, I will be using the Titanic dataset from Kaggle.
Similarity Quantified by Distance
Knn, being a distance-based classifier, can use different types of distance metrics in order to calculate similarity. The three I will cover in this post are the Euclidean distance, Manhattan distance, and Minkowski distance.
Euclidean Distance

Euclidean distance is the most common metric used, and is derived from the Pythagorean theorem. Euclidean distance simply refers to the distance between two points. The formula for calculating Euclidean distance:

The formula above states, for each dimension, we calculate the length of that side in the triangle by subtracting a point’s value from the other’s. Then, square and add it to the running total. The Euclidean distance is the square root of the running total.

Take a look at the plot above, to calculate the Euclidean distance we would just subtract X1
from X2
, square it, do the same for the Y1
and Y2
, add the two, and take the square root. This can be easily coded using the Python’s built-in Math module. The code gist below is calculating the Euclidean distance of 2 points. The distance can be calculated the same way as a 2-dimensional space but with an additional z
axis.

Manhattan Distance

The Manhattan distance is a pretty simple metric we can use to calculate distance. Take a look at the plot below, and imagine the grid is an overview map of city blocks. In order to get from point A to point B in the context of Manhattan distance, we couldn’t just go straight through the all the skyscrapers! We’d have to go n blocks east or west, and n blocks north or south.

Manhattan distance differs from Euclidean distance when we calculate the difference between two points by using the absolute value of the difference. Back to the walking the blocks of Manhattan to get from point A
to point B
, if we were to go west first, we’d have a negative value. We still want this move to count so we take the absolute value of the difference. The formula for Manhattan distance:

The formula above states that the distance between point x and point y equals the sum of the absolute differences of the Y
value subtracted from the X
value in each dimension. We can also calculate the Manhattan distance using the Math module from Python. This time, instead of using the sqrt()
function, we will use the abs()
function.

Minkowski Distance

The Minkowski distance metric is a generalized distance across a normed vector space. A normed vector space, meaning a space where each point within has been run through a function. This distance metric is actually an induction of the Manhattan and Euclidean distances. The formula for Minkowski distance:

The formula above is similar to the Euclidean and Manhattan distance formulas but is just an exponent as the function. When c is set to 1, the formula is the same as Manhattan distance. When c is set to 2, it’s the same as Euclidean distance.
Under the Hood of K-Nearest Neighbors

When performing classification tasks with KNN, the algorithm will store the training split in memory, and measure the distance from the training points to the new point. Since the model stores the training set in memory, it isn’t really "training" when we call fit
. All of the real work takes place in the prediction step. It is worth noting that because of this "lazy training", KNN can require quite a bit of memory. Being a non-parametric algorithm, it doesn’t have coefficients.
Let’s go ahead and put this into code for a better understanding. For the purposes of visualizing KNN, I will only be using the Age
, Fare
, and Survived
features from the titanic dataset. The Titanic dataset is for binary classification, with the target being the Survived
feature. The dataset I’m loading in has previously been cleaned. For a description on the features in the dataset, see the data dictionary below.


The scatterplot above shows the 40 sampled data points (passengers) from the scaled train split in blue and orange. The orange points are passengers that survived and the blue points are passengers that did not. The new data point in green is from the scaled test set. If we take the new data point and have a K-value of 3, the model would look for the 3 nearest points and classify the new point depending on which class label appears more frequently. It looks like the 2 of the nearest points are a 1
(survived) and 1 is 0
(did not survive), since 2 > 1, the prediction for this data point would be that the passenger survived.

KNN in SciKit-Learn
Now to perform some actual classification with the KNeighborsClassifier from Sklearn. Along with the documentation, sklearn also provides a user guide implementing KNN for classification.
It’s always a good idea to look over the classifier’s parameter dictionary (see below).
help(KNeighborsClassifier())
_ Classifier implementing the k-nearest neighbors vote. Read more in the :ref: User Guide <classification>
.Parameters – – – – – n_neighbors : int, default=5 Number of neighbors to use by default for :meth: kneighbors
queries.weights : {‘uniform’, ‘distance’} or callable, default=’uniform’ weight function used in prediction. Possible values: – ‘uniform’ : uniform weights. All points in each neighborhood are weighted equally. – ‘distance’ : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away. – [callable] : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights. algorithm : {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto’ Algorithm used to compute the nearest neighbors: – ‘ball_tree’ will use :class: BallTree
– ‘kd_tree’ will use :class: KDTree
– ‘brute’ will use a brute-force search. – ‘auto’ will attempt to decide the most appropriate algorithm based on the values passed to :meth: fit
method.Note: fitting on sparse input will override the setting of this parameter, using brute force. leaf_size : int, default=30 Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem. p : int, default=2 Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. metric : str or callable, default=’minkowski’ the distance metric to use for the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric. See the documentation of :class: DistanceMetric
for alist of available metrics. If metric is "precomputed", X is assumed to be a distance matrix and must be square during fit. X may be a :term: sparse graph
,in which case only "nonzero" elements may be considered neighbors. metric_params : dict, default=None Additional keyword arguments for the metric function. n_jobs : int, default=None The number of parallel jobs to run for neighbors search. None
means 1 unless in a :obj:joblib.parallel_backend
context.-1
means using all processors. See :term:Glossary <n_jobs>
for more details. Doesn’t affect :meth: fit
method.Attributes – – – – – classes_ : array of shape (n_classes,) Class labels known to the classifier effectivemetric : str or callable The distance metric used. It will be same as the metric
parameteror a synonym of it, e.g. ‘euclidean’ if the metric
parameter set to‘minkowski’ and p
parameter set to 2.effective_metricparams : dict Additional keyword arguments for the metric function. For most metrics will be same with metric_params
parameter, but may also contain thep
parameter value if theeffective_metric_
attribute is set to‘minkowski’. outputs2d : bool False when y
‘s shape is (n_samples, ) or (n_samples, 1) during fitotherwise True._
Since we didn’t include any categorical features in the examples above, I’ll begin loading in the dataframe again, this time keeping the numerical and categorical features. Then, I’ll scale and encode the features before fitting the classifier.

Now that we have preprocessed train and test splits, we can move on to fitting a KNN classifier. Below, I will instantiate the KNN classifier with the default parameters. There will be a K-value of 5 (5 neighbors) and a p-value (power parameter) of 2 (Euclidean distance).

Not too shabby, the KNN classifier with default parameters performed pretty well. The model is performing better on the train set than the test set, indicating our model may be overfit. Let’s try the same classification but with Manhattan distance (power-value of 1).

Using Euclidean distance didn’t increase the performance by much. Let’s move on to trying different K-values and see how the performance compares.
Finding Best K-Value
Unfortunately, there isn’t a set K-value to get the best performance out of our model. However, we do have the ability to iterate over different K-values and compare the performance. Since KNN makes predictions by voting on frequency, it makes sense to use only odd K-values. Having an odd K-value results in the algorithm not being able to evenly split between two classes. One way to visually see the comparison of K-values is to create a accuracy plot (see below).

Looking at the plot above, it looks like the best k-value between 11–13. Having a k-value of 11 results in:

We can see that this is the best k-value since the train and test accuracy lines are at the highest point (accuracy) while still being close at 11, indicating the model isn’t overfit or underfit.
Conclusion
The K Neighbors Classifier is a good model to learn when getting into the field of classification with Supervised Learning. Even though it doesn’t really ‘train’, understanding how the different distance metrics are used to perform classification will be helpful when it comes to finding the best k-value. The fact that KNN stores the training set in memory means that as the number of data points and features increase, the computation time will increase as well. Thank you for your time and I hope this post will assist you in your future classification endeavors with KNN!
References
- 1.6. Nearest Neighbors. (n.d.). Retrieved from https://scikit-learn.org/stable/modules/neighbors.html
- Glossary. (n.d.). Retrieved from http://rosalind.info/glossary/euclidean-distance/
- Bogomolny, A. (n.d.). The Distance Formula. Retrieved from https://www.cut-the-knot.org/pythagoras/DistanceFormula.shtml
- Bogomolny, A. (n.d.). Pythagorean Theorem. Retrieved from https://www.cut-the-knot.org/pythagoras/index.shtml
- Manhattan Distance. (n.d.). Retrieved from https://www.sciencedirect.com/topics/mathematics/manhattan-distance#:~:text=The Manhattan distance between two,the "taxi cab" metric.
- _https://commons.wikimedia.org/wiki/File:De_Raum_Zeit_Minkowski_005.jpg_