Towards machine learning — K Nearest Neighbour (KNN)

A simple K Nearest Neighbour (KNN) classification of car classes according to their fuel consumption and engine size.

Karlo Leskovar
Towards Data Science

--

Introduction

Machine Learning (ML) is a very popular term nowadays. I think it is impossible to go through a day online, without meeting the term. As many before me, and I bet, many after me, sometime ago I started my journey in this interesting area.

There are many well written and well thought online resources on this topic. I am not trying to invent the wheel new, I will just try to share my perspective, or thoughts when going through a popular online Course by the University of Michigan, Applied Machine Learning in Python. Also, my goal is to emphasize some of key insights of K Nearest Neighbour method (KNN), what I hope will help a beginner (like me 😊) when starting with this popular machine learning method.

To make things a bit more interesting, I used the fuel economy dataset from Udacity.

KNN in short

KNN algorithm is a type of supervised ML which is often used in classification and regression applications. KNN classifiers are instance (memory) based classifiers, which means they rely on similar features between classes (inputs) in order to predict the class of a new (independent) input or data point. In general, we need a training dataset, where our model is trained to predict, and we evaluate the model performance on an independent dataset, to check the accuracy.

How exactly does KKN work? Well, let us say we have a dataset (array) of points, which need to be classified. Check the image bellow.

Source: author

Then, a new data point is added (brown) at the location (50, 40). We check the K number of nearest points (“neighbours”) to that new point. We assume, K = 5. So wee need to find the 5 nearest point to the new entry (brown point).

Source: author

If 5 is chosen, the 5 nearest points in terms of Euclidian distance are found (light purple lines), we can see that out of 5 points, 4 of them are red, therefore our brown point gets classified to the “red” corner.

The K=5 presents a balance between a noisy model with outliers and possible mislabelled data (low number K=1, or K=2), or a large K, which causes underfitting, so basically, the result is less detailed, or in worst case scenario, everything becomes one class. The right K depends on the dataset. Most often K is an odd number, to avoid random decisions if a point is equally distanced from 2 know classes.

Enough theory let us get practical! 😊

Python implementation

Reading and examining the data

First we will import dependencies, in this case, we need numpy, pandas, matplotlib and scikit-learn.

Also we need some data. In this example the popular car fuel economy dataset from Udacity has been used. This is a dataset of 3929 cars with some basic information about them, like make, model, year, transmission type, vehicle class, number of cylinders, fuel consumption, CO2 output etc.

Here, I will give an example how to classify a car of unknown vehicle class by the amount of cylinders, displacement, combined fuel consumption and CO2 output into appropriate class using the KNN algorithm.

We check the unique values of the VClass column, to get all the vehicle classes present in our dataset. We got 5 classes. Also, we create a new column with class labels for plotting purposes.

As our model needs to be trained first, we will split the data into train and test data (used later for evaluation). To do so, we use the train_test_split function from scikit-learn library.

With a large dataset (3929 entries) we don’t need to worry about train and/or test size. Therefore, the test_size argument of the function is left to None (=0.25) which means that the test set contains 25% of the data.

It’s always good practice to examine the data. for this purpose a feature pairplot is used.

Pair plot of the training dataset

Some interesting patterns can be observed. I.e. with increase of cylinder count, the displacement and CO2 output increases, while the average miles per gallon range decreases. Also, larger fuel consumption produces more CO2.

Another great way to observe the data is a 3D plot. Here CO2, cylinder count and displacement are plotted on x, y, and z axis, respectively.

3D plot — CO2 vs. Cylinders vs. Displacement

Again, one can observe that cars with more cylinders and larger displacement tend to produce more CO2. Although, some of smaller cars with 4 or 6 cylinders can also produce significant amounts of CO2 (probably sporty cars with large power output). But in general, the larger the engine, the more CO2 it produces.

Setting up and testing the model

First we need to choose the number of K nearest neighbours. As earlier explained, for this example, KNN=5 is chosen.

Then the model is trained, and the accuracy is displayed.

The accuracy of the model is 0.4933. Which doesn’t look especially good on paper. But for the two cars which I’ve tried, the model gave good predictions.

A Ford Fiesta is a “Subcompact” and a BMW M5 is a “Large car”.

Does accuracy changes with different K?

Here we will check if changing the number of nearest neighbours K can change the accuracy of our model. To do so we write a simple for loop. Also, the random_state is changed, which leads to different train/test split combinations.

random_state = 11

For this dataset and this train/test split, larger K values lead to smaller accuracy. It is possible, that a different train/test split would yield different results.

random_state = 44

It is interesting to observe that the training set #44 produces larger maximum and minimum accuracy.

Can a different train/test ratio change the accuracy of model?

And last but not least, we will check if changing the train test split ration can increase the accuracy of our model. First we generate a list t of possible train/test ratios. Again, the K is set to 5. Then a loop is created where for each possible t ratio, 200 simulations are run and the accuracy for each is calculated. The mean accuracy of 500 runs for each ratio is then calculated and plotted as a scatter plot to observe the behaviour.

random_state = 11
random_state = 44

As observed from the plot, the accuracy decreases with a decrease in training size sample . Changing the random_state seems to have no effect here, which makes sense, since we run 500 simulations for each of the train/test ratios and then calculate the mean accuracy.

Conclusion

Here a short implementation of a KNN classifier on the popular car fuel consumption dataset is presented. First, a brief explanation of the KNN method is given. Then, the dataset is examined to see coherence in our dataset. Then we define a KNN classifier with K set to 5, which means that a new data point is classified according to the 5 nearest neighbours.

For the two independent test examples, a Ford Fiesta and BMW M5 the classification works fine. Also, the model gets evaluated for different values of K, where for this particular dataset, smaller values of K seem to yield a more precise model. Changing the random_state to a different value, can change this behaviour, but not significantly.

Also, an increase in the training size to test size ratio, should yield better results. Here, changing the random_state value, due to large number of simulations, has less impact.

--

--

PhD → Hydrology & Deep Learning (ANN hydrological models). PostDoc at University of Zagreb.