The world’s leading publication for data science, AI, and ML professionals.

What is the K-Nearest Neighbor?

An introduction with a Python example

Photo by Jon Tyson on Unsplash
Photo by Jon Tyson on Unsplash

K-Nearest Neighbor Algorithm

K-Nearest Neighbor (KNN) is an easy to understand, but essential and broadly applicable supervised Machine Learning technique. To understand the intuition behind KNN, examine the scatterplot below. The plot shows the relationship between two arbitrary dimensions, x and y. The blue points represent members of group A and the orange points represent the members of group B. This will represent the training data for KNN.

Now suppose a new, unclassified data point is presented and plotted to the graph. The KNN would classify it based on the K nearest points (or, nearest neighbors), take a majority vote, and classify according. Note that K is set beforehand and represents how many points should be taken to make a vote.

For example, if K= 1, KNN would look at the nearest data point and classify the new data point as the same classification. In the example below, the "X" represents a new data point for classification. Because the X is closest to a known data point in group B, "X" would also be classified as group B.

Now suppose K = 3. KNN would look at the 3 nearest data points and take a vote for classification. If 2 or more of the nearest neighbors belong to a group, the new data point is classified with the majority.

In the example below, the new data point "X" moves. Of the 3 nearest points, 2 belong to group A and 1 belongs to group B. Because most of the points belong to group A, the new data point "X" is classified as group A.

If a tie occurs (which might happen in K=2), the majority is taken from K-1 nearest neighbors.

Pros of KNN

  • Non-parametric: KNN makes no assumptions about the underlying data. Consequently, it may be applied to a broad set of problems without the need to worry about the properties of the data.
  • Lazy Learning: The algorithm has no training phase. Instead, it makes a calculation at the moment of classification. This allows KNN to be a pretty dynamic machine learning technique by allowing additional data to be added without the need to re-train it.
  • Highly Non-Linear Data: Because no assumptions are made on the data and because no formal model is calculated, KNN works well to predict highly non-linear data.
  • Multi-Class Problems: Unlike some other algorithms that require tweaking for classifications involving more than 2 classes, KNN can be generalized into as many classes as necessary.
  • Intuitive: the algorithm is relatively simple to understand and interpret even to a non-technical audience.

Cons of KNN

  • Memory-Intensive: Because a new data point must be compared to every other data point in the training data, KNN often uses a lot of processing power to make a classification, especially on bigger data sets.
  • Curse of Dimensionality: Like other algorithms that use distance as a metric, KNN struggles to predict data with a lot of input variables.
  • Sensitive to Outliers: Outliers present a fundamental issue to KNN. By simply choosing the nearest neighbors, no matter how far they may be, outliers can skew its predictions.
  • Missing Data: KNN has no approach to handle missing data. If anything is missing, the entire data point cannot be predicted accurately.

Choosing an Optimal K

Since K is the only parameter to tune, great care should be given to select a good value. In general, there are two basic suggestions: an estimation and the Elbow Method.

As a good point of reference, taking the square root of the number of observations as K is sometimes suggested. For example, if 100 observations are fed into KNN, K = 10 would work as a quick estimation. Note that this is more of a rule of thumb than a rigorous method.

The more empirical approach, however, is the Elbow Method. Based on the principle of diminishing margin returns, the idea is to run KNN on test data, progressively increasing the K-value and looking at how it affects model performance.

If done visually, then at the "elbow" of the graph (the inflection point in more technical terms), the optimized value for K is represented at the point where the best performance is returned before the cost outweighs the benefit. The red circle on the illustration below demonstrates the principle.

Real world data, however, won’t always be as clear. If the data becomes noisier, choosing the smallest possible K at a local minimum is a viable option.

A Demonstration in Python

As with many other machine learning algorithms, the Scikit-Learn module offers a great implementation of KNN.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

In addition to the KNN module, the StandardScaler is imported for standardize the data, and pandas and numpy are imported to handle the data.

# Store the data in a dictionary
data = {
    "X1": [1,1,3,4,5,2,0,4,0.5,3.3,1.1,4.7,0.2,2,4.5,3.3,2.5],
    "X2": [1,2,4,4,6,1,1,5,0.5,4.2,1.4,5.2,2,0.03,5.1,4.8,2.5],
    "Member": [A,A,B,B,B,A,A,B,A,B,A,B,A,A,B,B,A]
}
# Convert the data into a dataframe
df = pd.DataFrame.from_dict(data)

Next, the data set is generated and placed into a dictionary. This set of numbers was actually used to generate the examples at the beginning of the article. The dictionary is then converted into a Dataframe for convenience.

# Separate the the independent and dependent variables
features = df_sample.filter(["X1", "X2"])
category = df_sample["Member"]
scaler = StandardScaler()
scaler.fit(features)
scaled_features = scaler.transform(features)

Because KNN uses distance as a metric and because the inputs may not necessarily use the same scale, the standard scaler is called to normalize the numerical data. This is an essential step to prevent bias from units in the data.

k = 5
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(scaled_features, category)

The final step to actually fit the KNN simply calls the function and uses the scaled features and the category as an argument. Note the argument n_neighbors which denotes how many K-nearest neighbors to use.

If the Elbow Method should be used, however, a slightly different approach is required.

# Separate the data into training and test data sets
X_train, X_test, Y_train, Y_test = train_test_split(scaled_features, color_category, test_size=0.30)
# Import Matplotlib for visualization
import matplotlib.pyplot as plt
# Create an empty list to catch the error rate
error_rate = []
# Iterate through K = 1-20
for i in range(1,20):

    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,Y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != Y_test))
# plot the error rate 
plt.figure(figsize=(10,6))
plt.plot(range(1,20),error_rate,color='blue', linestyle='dashed', marker='o', markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

First, the data is split into training and testing subsets (which should be standard procedure anyway). Next, the model is trained and evaluated on the test data for K = 1, K = 2, and so on until K = 20. Finally, the results are returned on a graph.

Conclusions

KNN is a simple, but powerful supervised machine learning technique. Its robust approach allows its application to a wide variety of problems. Additionally, a single parameter, K, makes parameter tuning relatively easily. A simple and easy-to-use implementation in Python makes using KNN a matter of a few lines.


Related Articles