MultiClass Classification Using K-Nearest Neighbours

In this article, learn what is multi-class classification and how does is work

Vatsal Sheth
Towards Data Science

--

Photo by Markus Spiske on Unsplash

INTRODUCTION:

Classification is a classic machine learning application. Classification basically categorises your output in two classes i.e. your output can be one of two things. For example, a bank wants to know whether a customer will be able pay his/her monthly investments or not? We can use machine learning algorithms to determine the output of this problem, which will be either Yes or No(Two classes). But what if you want to classify something that has more than 2 categories and isn’t as simple as a yes/no problem?

This is where multi-class classification comes in. MultiClass classification can be defined as the classifying instances into one of three or more classes. In this article we are going to do multi-class classification using K Nearest Neighbours. KNN is a super simple algorithm, which assumes that similar things are in close proximity of each other. So if a datapoint is near to another datapoint, it assumes that they both belong to similar classes. To know more deeply about KNN algorithms, I would suggest you go check out this article:

Now, that we are through all the basics, let’s get to some implementation. We are going to use multiple python libraries like pandas(To read our dataset), Sklearn(To train our dataset and implement our model) and libraries like Seaborn and Matplotlib(To visualise our data). If you don’t already have this libraries install you can install them using pip or Anaconda on your pc/laptop. Or another way that I would personally suggest, use google colab to perform the experiment online with all the libraries pre-installed. The dataset that we are going to be using is called the IRIS flower dataset and it basically has 4 features for it’s 150 data points and is categorised into 3 different species i.e. 50 flowers of each species.The dataset can be downloaded from the following link:

Now as we get started with our code, the first step to do is to import all the libraries in our code.

from sklearn import preprocessingfrom sklearn.model_selection import train_test_splitfrom sklearn.neighbors import KNeighborsClassifierimport matplotlib.pyplot as pltimport seaborn as snsimport pandas as pd

Once you’ve imported the libraries the next step is to read the data.We will use the pandas library for this function. While reading, we will also check if there are any null values as well as the number of different species in the data. (Should be 3 as our dataset has 3 species). We will also assign all the three species categories a particular number, 0,1 and 2.

df = pd.read_csv(‘IRIS.csv’)
df.head()
Image by Author(Top 5 columns of the Original Dataset)
df[‘species’].unique()

Output: array([‘Iris-setosa’, ‘Iris-versicolor’, ‘Iris-virginica’], dtype=object)

df.isnull().values.any()

Output: False

df[‘species’] = df[‘species’].map({‘Iris-setosa’ :0, ‘Iris-versicolor’ :1, ‘Iris-virginica’ :2}).astype(int) #mapping numbersdf.head()
Image by Author (New Table with mapped numbers for output)

Once, that we are now done with importing libraries and our CSV file, the next step we do is exploratory data analysis(EDA). EDA is necessary for any problem as it helps us visualise the data and infer some conclusions initially just by looking at the data and not performing any algorithms. We perform correlations between all the features using the library seaborn as well as plot a scatterplot of all the datasets using the same library.

plt.close();sns.set_style(“whitegrid”);sns.pairplot(df, hue=”species”, height=3);plt.show()

Output:

Image by Author (Correlation between all 4 features)
sns.set_style(“whitegrid”);sns.FacetGrid(df, hue=’species’, size=5) \.map(plt.scatter, “sepal_length”, “sepal_width”) \.add_legend();plt.show()

Output:

Image by Author

Inferences from EDA:

  1. While Setosa can be easily identified, Virnica and Versicolor have some overlap .
  2. Length and Width are the most important features to identify various flower types.

After the EDA and before training our model on the dataset, the one last thing left to do is normalisation. Normalisation is basically bringing all the values of different features on a same scale. As different features has different scale, normalising helps us and the model to optimise it’s parameters more efficiently. We normalise all our input from scale: 0 to 1. Here, X is our inputs(hence dropping the classified species) and Y is our output(3 classes).

x_data = df.drop([‘species’],axis=1)y_data = df[‘species’]MinMaxScaler = preprocessing.MinMaxScaler()X_data_minmax = MinMaxScaler.fit_transform(x_data)data = pd.DataFrame(X_data_minmax,columns=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'])df.head()
Image by Author (Normalised dataset)

Finally, we have reached to the point of training the dataset. We use the built-in KNN algorithm from sci-kit learn. We split the our input and output data into training and testing data, as to train the model on training data and testing model’s accuracy on the testing model. We choose a 80%–20% split for our training and testing data.

X_train, X_test, y_train, y_test = train_test_split(data, y_data,test_size=0.2, random_state = 1)knn_clf=KNeighborsClassifier()knn_clf.fit(X_train,y_train)ypred=knn_clf.predict(X_test) #These are the predicted output values

Output:

KNeighborsClassifier(algorithm=’auto’, leaf_size=30, metric=’minkowski’, metric_params=None, n_jobs=None, n_neighbors=5, p=2, weights=’uniform’)

Here, we see that the classifier chose 5 as the optimum number of nearest neighbours to classify the data best. Now that we have built the model, our final step is to visualise the results. We calculate the confusion matrix, the precision recall parameters and the overall accuracy of the model.

from sklearn.metrics import classification_report, confusion_matrix, accuracy_scoreresult = confusion_matrix(y_test, ypred)print(“Confusion Matrix:”)print(result)result1 = classification_report(y_test, ypred)print(“Classification Report:”,)print (result1)result2 = accuracy_score(y_test,ypred)print(“Accuracy:”,result2)

Output:

Image by Author (Results for our model)

Summary/Conclusion:

We successfully implemented a KNN algorithm for the IRIS datset. We found out the most impactful deatures through out EDA and normalised our dataset for improved accuracy. We got an accuracy of 96.67% with our algorithm as well as we got the confusion matrix and the classification report. From the classification report and the confusion matrix we can see that it misidentifies versicolor as virginica.

That is how one can do multi-class classification using KNN algorithm. Hope you learned something new and meaningful today.

Thank you.

--

--

A lifelong learner! A human being who likes to write and read(a lot!). A technology enthusiast, an inquisitive mind and always eager to learn something new.