Final Showdown of Machine Learning Algorithms

CNN, Mobile-Net, KNN, Random Forest, and MLP. Which algorithm is the best?

Shubh Patni
Towards Data Science

--

The Story

gif by Giphy

It all started with my young cousin, who was lost in his own world, doodling in his drawing book. I asked him what is he doing. He replied he is making a cat; it didn’t look like a cat at all. He asked me to play a game with him in which I would recognize what he is drawing. It was fun for some time, but soon, I got bored. I didn’t want to hurt his feelings by not playing with him, so I used my computer vision and python skills to make a doodle classifier. Now the question was how will I implement it; There are hundreds of methods to classify a doodle, and I had to pick the most accurate one, which takes the least amount of time to train, occupies less memory, requires less processing power, and doesn’t require TB’s of data to give meaningful results.

After some web surfing, I found the top 5 algorithms that can complete this task in the best way possible, but each website I visited told a different story. Some were saying CNN is the best, while others said mobile-net is best. I thought — well, let’s test out all of them. I found a great dataset containing lots of doodles with their labels in a Kaggle competition available for free to download.

Image classification is a vast topic as there are tons of algorithms available to use for various applications. Image classification is so vast and ever-changing that every day new algorithms are being created and new applications of it are emerging. Thus, it was difficult for me to handpick a few algorithms given that even they have hundreds of variations. So, this post will investigate which algorithm performs best for doodle classification specifically.

I will also test how reliable these algorithms are in other situations such as handwritten character classification, number plate recognition, etc.

What Are Covered

  1. A brief introduction to ML techniques used in research
  2. Evaluation Metrics
  3. Details on parameters chosen for research
  4. Results
  5. Limitations and Conclusion

Let’s Start With a Brief Intro to Machine Learning Algorithms Used

gif by Giphy

There are thousands of algorithms for doodle classification, here I have shortlisted a few famous ones which I will explore -

1) Random Forest

We can use the random forest algorithm for both classification and regression. It is like decision trees, except it uses hundreds of decision trees to come up to a conclusion. A decision tree separates data into various classes based on similar features. For each data point, it checks if it has a certain feature, most common data falls under the same class. In the random forest algorithm, we take many decision trees and randomly give them fewer features to check from, ex if we have 100 features, we might give 10 random features to each tree. Some trees will assign incorrect classes, but many will be right! We take the majority and create our classification model.

Research paper on random forest algorithm:

Leo Breiman

A great article on random forest algorithm by Tony Yiu:

2) KNN

K-nearest neighbors (KNN) can be used both as a classification and a regression algorithm. In KNN, data points are separated into several classes to predict the classification of a new sample point. To achieve this task, it uses distance formula to calculate distances between various data points, based on this distance, it then defines region boundaries for each class. Any new data point will fall into one of these regions and will be assigned that class.

Research paper on KNN:

Gongde Guo

A great article on KNN by Renu Khandelwal:

3) MLP

Multi-layer perception (MLP) is a form of feed-forward artificial neural network. MLP has many layers, but only a logistic function in its hidden layers and a softmax function at the output layer. The algorithm takes in a single big vector as input and performs matrix operations on the input layer and hidden layers, then the result goes through a logistic function, and its output goes through another hidden layer. This process is repeated until the network reaches the output layer, where a single output is produced using the softmax function.

Research paper on MLP:

Popescu Marius

A great article on MLP by Jorge Leonel:

4) CNN

Convolution neural networks (CNN) is one of the easiest to implement deep learning computer vision algorithm. First, it takes an input image of a given size and creates multiple filters/feature detectors (which is initially a randomly generated matrix of the given size) for it, a filter aims to recognize certain patterns in an image, the filter is moved across the image and matrix multiplication is done between the matrix and the image. This filter slides throughout the image to gather more features, we then use an activation function mostly a rectified linear unit function to increase non-linearity or preserve only important features, then we use max-pooling function to add up all the values in a given matrix size (ex- if we choose a matrix of 4 then it will add all the 4 values to create 1 value), thus reducing the size of our output to make it faster. The last step is to flatten the final matrix which is passed as input to a basic ANN (artificial neural network) and get class predictions.

Research paper on CNN:

Keiron Teilo O’Shea

A great article on CNN by Nagesh Singh Chauhan:

5) Mobile-Net

Mobile-Net architecture uses depth-wise separable convolution, which comprises a depth-wise convolution and a point-wise convolution. Depth wise convolution is a channel-wise Dk * Dk spatial convolution, suppose we have 3 channels (R, G, B) in an image then we will have 3*Dk*Dk spatial convolution. In point-wise convolution we have kernel size of 1*1*M, where M is the number of channels in depth wise convolution, in this case, it’s 3. Therefore, we have one kernel of size 1*1*3; we iterate this one kernel through our 3*Dk*Dk output to get Dk*Dk*1 output. We can create N 1*1*3 kernels that output a Dk*Dk*1 image each, to get a final image of shape Dk*Dk*N. The final step is to add depth wise convolution to pointwise convolution. This type of architecture reduces training time as we have lesser parameters to tune while having a minor effect on accuracy.

Research paper on Mobile net:

Andrew G. Howard

A great article on Mobile-net by Sik-Ho Tsang:

Evaluation Metrics

Sample of doodles used for research

Above are the samples of doodles used for this research.

I trained my machine learning models on the Kaggle quickdraw dataset that comprises 50 million images of different types of doodles. I divided this huge dataset into two parts: 35000 images for training, and 15000 for testing. I then calculated the training time for each algorithm on randomly chosen 5 different types of doodles. On the test set, I calculated mean average precision, accuracy, and recall for each algorithm.

Evaluation Metrics-

Training Time

Mean Average Precision

Accuracy

Recall

More on Evaluation Metrics by Shashwat Tiwari 16MCA0068

Also, a good post by Shervin Minaee

Details on Parameters Chosen

gif by Giphy

1) Random forest

n_estimators — number of decision trees in a forest. [10,50,100]

max_features — Features to be considered while splitting [‘auto’,’sqrt’]

max_depth — the maximum number of levels in the tree [2,4,6,8,10]

n_jobs — number of processes running in parallel, usually set as -1 to do maximum processes at a time.

criterion — it is a way to calculate loss and thus update the model to make the loss lesser and lesser. [‘entropy’,’cross_validation’]

I used ‘auto’ as the max_feature; 8 as the max_depth; -1 as n_jobs and ‘entropy’ as my criterion because they usually give the best results.

the Graph to find an optimum number of trees

However, to find out the optimal number of trees, I used GridSearchCV. It tries all the given parameter combinations and creates a table to show results. As can be seen from the figure, there is no significant increase in the test score after 80 trees. Thus, I decided to train my classifier on 80 trees.

2) K-Nearest Neighbors (KNN)

n_neighbors — number of nearest data points to compare [2,5,8]

n_jobs — number of processes running in parallel, usually set as -1 to do maximum processes at a time

I did not change any default parameters for this model because they would give the best result.

However, to find the optimum number of n_neighbors, I have used GridSearchCV, and this is the graph I got:

The graph to find an optimum number of N-neighbors

According to the graph, the test score declines after 5 n_neighbors, which means 5 is the optimum number of neighbors.

3) Multi-Layer Perceptron (MLP)

alpha- Most commonly known as the learning rate, it tells the network how fast to adjust the gradient. [0.01, 0.0001, 0.00001]

hidden_layer_sizes — It is a tuple of values that consists of the number of hidden nodes at each layer. [(50,50), (100,100,100), (750,750)]

activation — A function which provides value to important characteristics in an image and deletes the irrelevant information. [‘relu’,’tanh’,’logistic’].

solver — Also known as the optimizer, this parameter tells the network which technique to use for training the weights in a network. [‘sgd’,’adam’].

batch_size — It is the number of images to be processed at once. [200,100,200].

I have chosen activation as ‘relu’ and solver as ‘adam’ because these parameters give the best result.

However, to choose the number of hidden layers and alpha, I have used GridSearchCV.

Table to find an optimum number of N-neighbors

As can be seen in the table the best result is received when alpha is 0.001, and hidden_layer_size is (784,784). Therefore, I decided to use those parameters.

4) Convolution Neural Networks (CNN)

learning_rate- it tells the network how fast to adjust the gradient. [0.01, 0.0001, 0.00001]

hidden_layer_sizes — It is a tuple of values that consists of the number of hidden nodes at each layer. [(50,50),(100,100,100),(750,750)]

activation — A function which provides value to important characteristics in an image and deletes the irrelevant information. [‘relu’,’tanh’,’logistic’].

solver — Also known as the optimizer, this parameter tells the network which technique to use for training the weights in a network. [‘sgd’,’adam’].

batch_size — It is the number of images to be processed at once. [200,100,200]

Epochs — Number of times the program should run or how many times the model should be trained. [10,20,200]

I have chosen activation function as ‘relu’ and solver as ‘adam’ because these parameters usually give the best results. In the network, I have added 3 convolution layers, 2 maxpool layers, 3 dropout layers, and at the end one softmax activation function. I did not use GridSearchCV here because there can be a lot of possible combinations that can be tried out, but there won’t be much difference in the results.

5) Mobile Net

Input_shape- It is a tuple consisting of dimensions of an image. [(32,32,1),(128,128,3)].

Alpha- It is the width of the network. [<1,>1,1]

activation — A function which provides value to important characteristics in an image and deletes the irrelevant information. [‘relu’,’tanh’,’logistic’].

optimizer — Also known as the solver, this parameter tells the network which technique to use for training the weights in a network. [‘sgd’,’adam’].

batch_size — It is the number of images to be processed at once. [200,100,200]. Epochs — Number of times the program should run or how many times the model should be trained. [10,20,200]

classes- Number of classes to be classified. [2,4,10]

loss- It tells the network which method to use to calculate the loss i.e. the difference between the predicted and actual value. [‘categorical_crossentropy’, ‘RMSE’]

First, I resized the 28*28 images to 140*140 images, as the mobile net requires a minimum of 32*32 images, so the final input_shape value I used was (140,140,1), where 1 is the image channel (in this case, black and white). I set alpha to 1 because it usually gives the best results. The activation function was set to default, which is ‘relu’. I have used ‘Adadeltaoptimizer as it gave the best results. batch_size was set to 128 to train the model faster. I have used 20 epochs for better accuracy. Classes were set to 5 as we have 5 classes to classify.

Results

gif by Giphy
Final Results (Fingers crossed)

Above is the performance of all the machine learning techniques used. Metrics include accuracy, recall, precision, and training time. It shocked me to see the training time of Mobile Net to be 46 minutes, as it is considered as a light model. I am not sure why that is the case, if you know why please let me know.

Limitations and Conclusion

  1. For this research, only black and white doodles of 28*28 size were used, whereas in real-world different colors could portray or mean different things and image size may differ. Therefore, the algorithms might behave differently in those situations.
  2. In all the algorithms discusses, there are many hyper-parameters that can be varied and used, which might give different results.
  3. The training set on which these algorithms were trained was limited to only 35000 images, adding more images can improve the performance of these algorithms.

Results suggest that mobile-net achieved the highest accuracy, precision, and recall therefore it is the best algorithm in terms of these three parameters. However, the training time for mobile-net is also the highest. If we compare it with CNN, we can see that CNN took much lesser time to train, giving similar accuracy, precision, and recall. Therefore, according to this research, I would conclude that CNN is the best algorithm.

After doing this research, I conclude that algorithms like mobile-net and CNN can be used for hard-written character recognition, number-plate detection and in banks around the world. Algorithms, such as mobile-net and CNN, achieved an accuracy of over 97% which is better than average human performance of 95%. Thus, these algorithms can be used in real life, making harder or time- consuming processes automated.

You can find the code here:

--

--