The world’s leading publication for data science, AI, and ML professionals.

An intuitive guide to kNN with implementation

A vector is known by the company it keeps

Image Source: Unsplash (Photo by Jon Tyson)
Image Source: Unsplash (Photo by Jon Tyson)

kNN is one of the simplest algorithms of Classification and, as a result, remains one of the ‘darlings’ of the community. T[here](https://www.kaggle.com/saptarsi/k-nn-sg) have been quite a few surveys on popular machine learning algorithms and very rarely, kNN could be put outside of the top 10. You can check our video here and our code here on KNN.

Let’s first of all, look at the intuition. In any class, when there are students of multiple disciplines attending, there is a tendency of students of the same discipline to sit together. kNN leverages this idea.

Let’s, start with an example, to understand how it works.

Fig 1: Illustration of KNN (Source: Author)
Fig 1: Illustration of KNN (Source: Author)

There are three types(classes) of birds. The task is to find the type(class) of the new bird labeled by the question mark.

We look at it’s three nearest neighbors (Encircled), observe that the majority of the neighbors are ducks (2 out of 3), and conclude this unlabeled observation is a duck. In this example, we have looked at three nearest neighbors, so the value of k has been taken as 3.

Birds of the same feather flock together

Image Source: Unsplash Photo by Fungai Tichawangana
Image Source: Unsplash Photo by Fungai Tichawangana

Hence, there are two design decisions to be considered:

  • How many neighbors are to be considered?

if we keep a small value it will identify local patterns, may pick noise also in some cases. A very high value of k, it will look at the global pattern. In the worst case, it will behave like the prior probability of the naive Bayes. Irrespective of where the data lies, it will assign it to the majority class.

  • Which distance measure is to be used?

It can be Euclidian, Manhattan, Hamming Distance, Inverse of Cosine Similarity, etc. Interested readers can look at reference 1, a book only on distance measures.

Implementation:

The following code is enough to import and instantiate the classifier.

from sklearn.neighbors import KNeighborsClassifier
#Initalize the classifier
Knn = KNeighborsClassifier(n_neighbors=15)

What are the important parameters of kNN in scikit?

  • n_neighbors: Same meaning as ‘k’, default value is 5
  • weights: The possible values are uniform and distance. By default, it’s uniform, where all neighbors have an equal weightage of votes when you use distance, which means nearer neighbor will have more weightage, compared to further ones.
  • algorithm: The best option is to use ‘auto’, in this step the distance matrix is computed, which is the most computationally expensive part of kNN.
  • p: if p =2 then it is Euclidian Distance, if 1 then Manhattan, this is applicable when the metric is Minkowski. The equation is given for two p dimensional vectors x1 and x2.
Minkowski Distance ( Source: Author)
Minkowski Distance ( Source: Author)
  • metric: By default it is Minkowski, there are a lot of options based on the data type of the two vectors, some of them are listed in the below table.
Figure 2: Different distance metric based on the type of data (Source: Author)
Figure 2: Different distance metric based on the type of data (Source: Author)

There is also an option to create your own distance metric, but that’s a separate story.

How to determine the optimal value of ‘k’?

The easiest solution is to take different values of ‘k’ and run the algorithm, check the test set accuracy, where we see a saturation, we can stop there.

# Initialize an array with diffrent choices of 'k'
nc=np.arange(1,100,2)
#Creating an array to store the accuracy values
acc=np.empty(50)
i=0
for k in np.nditer(nc):
    #Initializing kNN with diffrent values of K
    knn = KNeighborsClassifier(n_neighbors=int(k))
    knn.fit(x_train, y_train)
    #Finding the testing set accuracy    
    acc[i]== knn.score(x_test, y_test)
    i = i + 1

Next, is plotting the accuracy with different values of ‘k’

x=pd.Series(acc,index=nc)
x.plot()
# Add title and axis names
plt.title('Neighbor vs Accuracy')
plt.xlabel('Count of Neighbor')
plt.ylabel('Accuracy')
plt.show()
Figure 3: Determining the optimal value of 'k' (Source: Author)
Figure 3: Determining the optimal value of ‘k’ (Source: Author)

We can stop at a small value of ‘k’ here maybe around 6,7 as per figure 3.

Why do the distance-based algorithms need scaling?

Again let’s take an example. Let’s say, we have data from three customers and we have two attributes age and salary. So three vectors and each vector is two dimensional. We have to determine who is the nearest neighbor of C1.

Figure 4: Need for Scaling ( Image Source: Author)
Figure 4: Need for Scaling ( Image Source: Author)

If we do the calculation, we will realize that both the distance comes to 1, hence they are equidistance. However, the change of 1 Rs in Salary and change of 1 Rs in Age is not equivalent, and as the range of salary will be quite high compared to Age, in the Euclidian Distance calculation, it will suffocate or dominate Age.

One standard technique to bring them to the same range is called normalization or min-max scaling, given by equation 2

Equation 2: Normalization
Equation 2: Normalization

A fictitious illustration with age is enclosed below

Figure 5: Illustration of Min-Max Scaling (Source: Author)
Figure 5: Illustration of Min-Max Scaling (Source: Author)

So, all values are mapped between 0 to 1, with the minimum value being mapped to 0 and the maximum value is mapped to 1.

# Impoting minmaxscaler
from sklearn.preprocessing import MinMaxScaler 
#Initilization
scaler = MinMaxScaler(feature_range=(0, 1)) 
#Transforming the value
x_scaled = scaler.fit_transform(x)

Critical Remarks:

  • kNN is a very simple algorithm and does not assume anything about the data distribution, hence called as non-parametric.
  • Determining ‘k’ is not easy.
  • As the number of dimensions increases many distance measures do not work well.
  • Finally, the computation of the neighborhood matrix is very expensive.
  • It can be trivially extended to Regression, where the value of the target variable will be an average of its k nearest neighbors.
  • It is also tolerant to change in input data, as it starts it’s computation, only when asked to classify. It is often called as Lazy learner.

Reference:

[1] Deza MM, Deza E. Encyclopedia of distances. In Encyclopedia of distances 2009 (pp. 1–583). Springer, Berlin, Heidelberg.

[2]https://towardsdatascience.com/knn-k-nearest-neighbors-1-a4707b24bd1d


Related Articles