The world’s leading publication for data science, AI, and ML professionals.

K-Nearest Neighbors Algorithm

A Breakdown of Classification With KNN

Photo by Dil on Unsplash
Photo by Dil on Unsplash

Overview

K-Nearest Neighbors (KNN) is a Classification and regression algorithm which uses nearby points to generate predictions. It takes a point, finds the K-nearest points, and predicts a label for that point, K being user defined, e.g., 1,2,6. For classification, the algorithm uses the most frequent class of the neighbors. For regression, the algorithm uses the average of k data points to predict a continuous value. KNN is a supervised learning algorithm that is in a sense, lazy when it comes training (more on this later).

This post will serve as a high-level overview of what’s happening under the hood of KNN when performing classification. I’ll start by going over the different distance metrics and move into code examples with KNN from Scikit-Learn. For the coding examples, I will be using the Titanic dataset from Kaggle.


Similarity Quantified by Distance

Knn, being a distance-based classifier, can use different types of distance metrics in order to calculate similarity. The three I will cover in this post are the Euclidean distance, Manhattan distance, and Minkowski distance.

Euclidean Distance

Image by German Vizulis on Shutterstock
Image by German Vizulis on Shutterstock

Euclidean distance is the most common metric used, and is derived from the Pythagorean theorem. Euclidean distance simply refers to the distance between two points. The formula for calculating Euclidean distance:

The formula above states, for each dimension, we calculate the length of that side in the triangle by subtracting a point’s value from the other’s. Then, square and add it to the running total. The Euclidean distance is the square root of the running total.

Image by author
Image by author

Take a look at the plot above, to calculate the Euclidean distance we would just subtract X1 from X2, square it, do the same for the Y1 and Y2, add the two, and take the square root. This can be easily coded using the Python’s built-in Math module. The code gist below is calculating the Euclidean distance of 2 points. The distance can be calculated the same way as a 2-dimensional space but with an additional z axis.

Manhattan Distance

Photo by Michael Discenza on Unsplash
Photo by Michael Discenza on Unsplash

The Manhattan distance is a pretty simple metric we can use to calculate distance. Take a look at the plot below, and imagine the grid is an overview map of city blocks. In order to get from point A to point B in the context of Manhattan distance, we couldn’t just go straight through the all the skyscrapers! We’d have to go n blocks east or west, and n blocks north or south.

Image by author
Image by author

Manhattan distance differs from Euclidean distance when we calculate the difference between two points by using the absolute value of the difference. Back to the walking the blocks of Manhattan to get from point A to point B, if we were to go west first, we’d have a negative value. We still want this move to count so we take the absolute value of the difference. The formula for Manhattan distance:

The formula above states that the distance between point x and point y equals the sum of the absolute differences of the Y value subtracted from the X value in each dimension. We can also calculate the Manhattan distance using the Math module from Python. This time, instead of using the sqrt() function, we will use the abs() function.

Minkowski Distance

Hermann Minkowski, Public domain, via Wikimedia Commons
Hermann Minkowski, Public domain, via Wikimedia Commons

The Minkowski distance metric is a generalized distance across a normed vector space. A normed vector space, meaning a space where each point within has been run through a function. This distance metric is actually an induction of the Manhattan and Euclidean distances. The formula for Minkowski distance:

The formula above is similar to the Euclidean and Manhattan distance formulas but is just an exponent as the function. When c is set to 1, the formula is the same as Manhattan distance. When c is set to 2, it’s the same as Euclidean distance.


Under the Hood of K-Nearest Neighbors

Image by Andre Valadao on Shutterstock
Image by Andre Valadao on Shutterstock

When performing classification tasks with KNN, the algorithm will store the training split in memory, and measure the distance from the training points to the new point. Since the model stores the training set in memory, it isn’t really "training" when we call fit. All of the real work takes place in the prediction step. It is worth noting that because of this "lazy training", KNN can require quite a bit of memory. Being a non-parametric algorithm, it doesn’t have coefficients.

Let’s go ahead and put this into code for a better understanding. For the purposes of visualizing KNN, I will only be using the Age, Fare, and Survived features from the titanic dataset. The Titanic dataset is for binary classification, with the target being the Survived feature. The dataset I’m loading in has previously been cleaned. For a description on the features in the dataset, see the data dictionary below.

Image by author
Image by author
Image by author
Image by author

The scatterplot above shows the 40 sampled data points (passengers) from the scaled train split in blue and orange. The orange points are passengers that survived and the blue points are passengers that did not. The new data point in green is from the scaled test set. If we take the new data point and have a K-value of 3, the model would look for the 3 nearest points and classify the new point depending on which class label appears more frequently. It looks like the 2 of the nearest points are a 1(survived) and 1 is 0(did not survive), since 2 > 1, the prediction for this data point would be that the passenger survived.

Image by author
Image by author

KNN in SciKit-Learn

Now to perform some actual classification with the KNeighborsClassifier from Sklearn. Along with the documentation, sklearn also provides a user guide implementing KNN for classification.

It’s always a good idea to look over the classifier’s parameter dictionary (see below).

help(KNeighborsClassifier())
_ Classifier implementing the k-nearest neighbors vote.
Read more in the :ref:User Guide <classification>.
Parameters
– – – – –
n_neighbors : int, default=5
Number of neighbors to use by default for :meth:kneighbors queries.
weights : {‘uniform’, ‘distance’} or callable, default=’uniform’
weight function used in prediction. Possible values:
– ‘uniform’ : uniform weights. All points in each neighborhood
are weighted equally.
– ‘distance’ : weight points by the inverse of their distance.
in this case, closer neighbors of a query point will have a
greater influence than neighbors which are further away.
– [callable] : a user-defined function which accepts an
array of distances, and returns an array of the same shape
containing the weights.
algorithm : {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto’
Algorithm used to compute the nearest neighbors:
– ‘ball_tree’ will use :class:BallTree
– ‘kd_tree’ will use :class:KDTree
– ‘brute’ will use a brute-force search.
– ‘auto’ will attempt to decide the most appropriate algorithm
based on the values passed to :meth:fit method.
Note: fitting on sparse input will override the setting of
this parameter, using brute force.
leaf_size : int, default=30
Leaf size passed to BallTree or KDTree. This can affect the
speed of the construction and query, as well as the memory
required to store the tree. The optimal value depends on the
nature of the problem.
p : int, default=2
Power parameter for the Minkowski metric. When p = 1, this is
equivalent to using manhattan_distance (l1), and euclidean_distance
(l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
metric : str or callable, default=’minkowski’
the distance metric to use for the tree. The default metric is
minkowski, and with p=2 is equivalent to the standard Euclidean
metric. See the documentation of :class:DistanceMetric for a
list of available metrics.
If metric is "precomputed", X is assumed to be a distance matrix and
must be square during fit. X may be a :term:sparse graph,
in which case only "nonzero" elements may be considered neighbors.
metric_params : dict, default=None
Additional keyword arguments for the metric function.
n_jobs : int, default=None
The number of parallel jobs to run for neighbors search.
None means 1 unless in a :obj:joblib.parallel_backend context.
-1 means using all processors. See :term:Glossary <n_jobs>
for more details.
Doesn’t affect :meth:fit method.
Attributes
– – – – –
classes_ : array of shape (n_classes,)
Class labels known to the classifier
effectivemetric : str or callable
The distance metric used. It will be same as the metric parameter
or a synonym of it, e.g. ‘euclidean’ if the metric parameter set to
‘minkowski’ and p parameter set to 2.
effective_metricparams : dict
Additional keyword arguments for the metric function. For most metrics
will be same with metric_params parameter, but may also contain the
p parameter value if the effective_metric_ attribute is set to
‘minkowski’.
outputs2d : bool
False when y‘s shape is (n_samples, ) or (n_samples, 1) during fit
otherwise True._

Since we didn’t include any categorical features in the examples above, I’ll begin loading in the dataframe again, this time keeping the numerical and categorical features. Then, I’ll scale and encode the features before fitting the classifier.


Now that we have preprocessed train and test splits, we can move on to fitting a KNN classifier. Below, I will instantiate the KNN classifier with the default parameters. There will be a K-value of 5 (5 neighbors) and a p-value (power parameter) of 2 (Euclidean distance).

Not too shabby, the KNN classifier with default parameters performed pretty well. The model is performing better on the train set than the test set, indicating our model may be overfit. Let’s try the same classification but with Manhattan distance (power-value of 1).

Using Euclidean distance didn’t increase the performance by much. Let’s move on to trying different K-values and see how the performance compares.


Finding Best K-Value

Unfortunately, there isn’t a set K-value to get the best performance out of our model. However, we do have the ability to iterate over different K-values and compare the performance. Since KNN makes predictions by voting on frequency, it makes sense to use only odd K-values. Having an odd K-value results in the algorithm not being able to evenly split between two classes. One way to visually see the comparison of K-values is to create a accuracy plot (see below).

Looking at the plot above, it looks like the best k-value between 11–13. Having a k-value of 11 results in:

We can see that this is the best k-value since the train and test accuracy lines are at the highest point (accuracy) while still being close at 11, indicating the model isn’t overfit or underfit.

Conclusion

The K Neighbors Classifier is a good model to learn when getting into the field of classification with Supervised Learning. Even though it doesn’t really ‘train’, understanding how the different distance metrics are used to perform classification will be helpful when it comes to finding the best k-value. The fact that KNN stores the training set in memory means that as the number of data points and features increase, the computation time will increase as well. Thank you for your time and I hope this post will assist you in your future classification endeavors with KNN!

References


Related Articles