The world’s leading publication for data science, AI, and ML professionals.

Full Deep Learning Portfolio Project Part 1

Systematically Find the Optimal Training Strategy for the Birds Species Image Classification Task

Photo by Thomas Lefebvre on Unsplash
Photo by Thomas Lefebvre on Unsplash

Introduction

Want to work as a machine learning engineer but have never actually worked on machine learning projects? Then it is not really easy to find a job. Most companies are looking for experienced Machine Learning engineers. And with experience I don’t mean that you have attended a Machine Learning online course. Don’t get me wrong: It’s always great to take Machine Learning online courses to gain more knowledge about the topic! However, it is very hard for an employer to evaluate an online course correctly. Some are very easy and some are very challenging. What an employer can evaluate well, however, are own projects. There he can directly see how you approach certain topics and what experience you have already gained through projects.

However, in these projects you should no longer focus only on the machine learning part, which is mostly done in Jupyter notebooks, but you should also think about deployment. Because nowadays training a Deep Learning model is not really challenging anymore. Libraries like Keras allow you to create and train a deep learning model within just a few lines of code. However, if you are able to develop a full-fledged application, then you are definitely much more interesting. Also, you can convince a recruiter with a non-technical background better if you have developed a web page and can show your trained model there, instead of walking him through your Jupyter notebook where only code and print outputs can be seen.

For this reason, I have created two articles in which a Deep Learning training strategy is systematically found and applied, and the trained model is deployed via a website.

In this first part of the series I want to showcase you a systematic approach on how to find a good model for an Image Classification task. As dataset, I decided to use the open-source bird species dataset from Kaggle, which you can find [here](https://github.com/patrickbrus/Birds_Classifier_API). The code and all the documentation can be found on my Github page here. The code is written in Python.

In the second part of the series I show you how to develop a full web application, including front-end and back-end using html and Flask. You can find the article here.

And if you are even interested in how I deployed the full web application to AWS Elastic Beanstalk using Docker and Github Actions, then I can recommend you reading this article of myself.

Content

In the first section, I introduce the dataset and already apply some preprocessing to the input data. In the second section, I show you how I created the input pipeline using Tensorflow and what the code looks like for it. The Methodology section describes the various evaluations I executed to find the optimal training strategy. These evaluations can be used whenever CNNs are to be trained. The Results section shows the training results of the final training strategy. The Conclusion section summarizes the results of this project, while the Outlook section provides a reference to the second article in this series.

Exploratory Data Analysis and Pre-Processing

As a first step, I loaded the provided "Bird_Species.csv" file into a pandas data frame (Figure 1).

Figure 1: Head of initially loaded data frame.
Figure 1: Head of initially loaded data frame.

The creator of the dataset also provides a second csv file, containing the classes and some meta data of the images. I also loaded this csv file into a pandas data frame (Figure 2).

Figure 2: Head of data frame containing the classes and image shapes.
Figure 2: Head of data frame containing the classes and image shapes.

I then quickly checked the number of different classes in this dataset using the second data frame and compared it to the number of unique labels in the "Bird_Species.csv" file (Code 1).

Interestingly, according to the "class_dict.csv" file, there are 300 different classes of birds, but in the entire dataset there are only 285 different classes. So I checked which classes are not represented in the dataset and looked in the folders to see if they are really not represented or if they are just missing in the "Bird_Species.csv" file. And indeed there were images for the missing classes. This means that the existing csv file is incorrect. So I first created a clean csv file, saved it and used it for the rest of this project. The code for this step can be found in the "Make_Clean_Dataset.ipynb" notebook here.

Now that the dataset is clean, I started investigating it further. I started with plotting some random example images of the dataset in order to get a feeling on how the images look like. Figure 3 shows two example bird images.

Figure 3: Two example images of the Birds Species dataset. The left images shows an African Crowned Crane and the right image shows a Gambels Quail.
Figure 3: Two example images of the Birds Species dataset. The left images shows an African Crowned Crane and the right image shows a Gambels Quail.

As a next step, I plotted the distribution of the already from the author created training and test sets (Figure 4 and Figure 5). This is important to check if the dataset is imbalanced and to ensure that the test distribution and the training distribution are approximately the same.

Figure 4: Distribution of the training set created from the dataset author.
Figure 4: Distribution of the training set created from the dataset author.
Figure 5: Distribution of the test set created from the dataset author.
Figure 5: Distribution of the test set created from the dataset author.

As one can see, the training set is imbalanced, while the test set is fully balanced. This clearly shows that the test set comes from a different distribution than the training set. This could result in a model, that performs good on the test set, but bad on real world data, because the test distribution is not reflecting the "real" distribution. Therefore, I decided to create my own training, validation and test sets using stratified shuffle split from Scikit-Learn. But before splitting the data, I also one-hot encoded the labels to match the desired input format for a CNN. Code 2 shows the code for these steps, including the initial loading of the cleaned dataset csv file.

Figure 6 and Figure 7 are showing the distributions of the training and test set after the stratified shuffle split. Both distributions are now coming from the same distribution.

Figure 6: Distribution of training set after using stratified shuffle split.
Figure 6: Distribution of training set after using stratified shuffle split.
Figure 7: Distribution of test set after using stratified shuffle split.
Figure 7: Distribution of test set after using stratified shuffle split.

The training dataset is a little bit imbalanced, which should not be a problem. But let’s later check if oversampling can improve the performance on this dataset.

The Figure of the validation set is not added here, but has the same distribution as the test set and the training set.

Create Input Pipeline

For training a Deep Learning model, it is always important to have a good input pipeline. Tensorflow is supporting you very well with its ImageDataGenerator class. This class allows you to specify some pre-processings, let you choose from some data augmentations and allows you to load the images batch wise during training. This is especially important, because with large data sets it would take a lot of RAM if all images had to be loaded once first before training.

For the bird species classification, I created the following input pipeline steps:

  1. Apply some randomly chosen data augmentations, such that approximately 10% of the data is not augmented.
  2. Normalize the image pixel values, such that they are in the range of (0,1).

I used some basic data augmentations from the imgaug Python library, in order to increase the variations in the dataset. This library is very useful because it implements almost all the augmentations you can think of. The data augmentations used for this project are chosen manually by applying an augmentation on an example image and evaluating whether the image still makes sense or not. The evaluation results of the different applied augmentations can be seen in the Jupyter notebook, called "Check Augmentations.ipynb", in the above mentioned Github repository. In general, one could use these augmentations and could train a model with one additional augmentation at a time and compare its performance to a baseline model without any augmentations. But for the sake of this project, I decided to not make this evaluation, because I assume that every augmentation leads to an improvement in comparison to the baseline model.

I used the following augmentations (Figure 8):

  1. Flip image left right.
  2. Multiply pixel values with offset.
  3. Salt and pepper.
  4. Gamma contrast change.
  5. Add offset to pixel values.
  6. Add additive gaussian noise.
  7. Apply motion blur.
  8. Apply an affine transformation.
  9. Rotate the image.
  10. Apply an elastic transformation.

Augmentation "Add" and "Gamma Contrast" are never used at the same time, because this could lead to unrealistic images.

Figure 8: Overview of applied augmentations (Image by author).
Figure 8: Overview of applied augmentations (Image by author).

Code 3 shows the full Python code for creating the input pipeline, while the "flow_from_dataframe" option of Tensorflow is used. This allows to load the images batch wise during training. You only need a csv file where the first column contains the name of the image and the other columns are containing the labels for this image. In order to first find the optimal CNN architecture, I decided to use a batch size of 8 and fixed the images to a size of (224, 224).

Methodology

The input pipeline is created. Now it’s time to find the best suited training strategy. First of all, some state-of-the-art CNN architectures are compared and the best performing is used as architecture for the remaining part of this project. Secondly, different image sizes are compared in order to find the optimal image size. As a third step, oversampling is tried to tackle the slightly imbalanced dataset. And as a last step, the optimal hyperparameters are searched using Bayesian hyperparameter search.

The code for the methodology section is not included here in this article, because it would be too long. Please feel free to check my Github Jupyter notebook to see and copy the code I’ve created. I only included the code for oversampling the data frame, because this one could be a more interesting code part.

CNN Architecture Comparison

There are plenty of different available CNN architectures, that could be used for the birds classifier. In this project, six different CNN architectures (Table 2) are chosen and compared to each other. Each architecture is used for training a bird classifier with the hyperparameters given in Table 1. In the end, the best validation f1-score is stored and used to find the best CNN architecture. Here, only the best validation f1-score is used and it is ignored, that each architecture has a different training complexity and that maybe one architecture leads to only slightly less good validation f1-scores, but has way less parameters than the model which reaches a higher validation f1-score. But for a future optimization, the training complexity could also be considered and could be included into the decision, which encoder architecture to use for the final birds classifier. Table 2 also shows the final results. As one can see, the DenseNet121 achieves the best validation f1-score and is therefore used as the final encoder architecture for the birds classifier and also for the other evaluations of this project. Figure 5 shows the f1-scores during the training process for the different encoders and Figure 4 the loss values during the training process.

Table 1: Used hyperparameters for the CNN architecture comparison.
Table 1: Used hyperparameters for the CNN architecture comparison.
Figure 4: Loss values during the training process of the different encoder architectures. As loss function, the categorical cross-entropy is used.
Figure 4: Loss values during the training process of the different encoder architectures. As loss function, the categorical cross-entropy is used.
Figure 5: F1-Scores during the training process of the different encoder architectures.
Figure 5: F1-Scores during the training process of the different encoder architectures.
Table 2: Used CNN architectures and their best scores during the training process.
Table 2: Used CNN architectures and their best scores during the training process.

Image Size Comparison

The optimal image size should be found next on the way to the optimal training strategy. Four different options will be evaluated and compared. A DenseNet is trained for each of the four image sizes with the hyperparameters from Table 1. The results can be found in Table 3, where again the best validation f1-score is taken as a metric for the decision. Figures 6 and 7 are showing the results of the f1-scores and loss values during the training process of the different image sizes.

As one would have guessed: The larger the images, the better the performance of the model. However, from image size 192×192 to 224×224, there is only a 0.4% increase in the best validation f1-score. Therefore, I decided to use 192×192 as image size in order to decrease the amount of trainable parameters and to speed up the training process a little bit.

Figure 6: F1-Scores during the training process of training a DenseNet with different image sizes.
Figure 6: F1-Scores during the training process of training a DenseNet with different image sizes.
Figure 7: Loss values during the training process of training a DenseNet with different image sizes. As loss function, the categorical cross-entropy is used.
Figure 7: Loss values during the training process of training a DenseNet with different image sizes. As loss function, the categorical cross-entropy is used.
Table 3: Best validation f1-scores and accuracies of training the DenseNet with different image sizes.
Table 3: Best validation f1-scores and accuracies of training the DenseNet with different image sizes.

Oversampling

As mentioned earlier, the dataset is slightly imbalanced. Therefore, over-sampling can be used in order to better balance the dataset and to avoid the risk that the model is more biased towards the majority classes and is less accurate in predicting the minority classes (Figure 8, Code 4).

A DenseNet121 is then trained using the over-sampled dataset and the hyperparameters from Table 1, except that the number of epochs is reduced to 20 epochs. This is due to the oversampling and therefore the reason that the model should learn faster, because the same images are available more than once within one epoch of training. The best validation f1-score of the over-sampled model is at 95.4%, which is almost the same as in the case where no oversampling is applied. Therefore, oversampling is not used for training the final birds classifier, because it also requires more training time and is in general more prone to overfitting.

Figure 8: Training dataset after applying oversampling. The training set is now perfectly balanced, which is achieved by copying the same image of the minority classes as long as each class has the same amount of samples.
Figure 8: Training dataset after applying oversampling. The training set is now perfectly balanced, which is achieved by copying the same image of the minority classes as long as each class has the same amount of samples.

Bayesian Hyperparameter Search

As a last step towards finding the optimal training strategy, the hyperparameters are optimized using Bayesian hyperparameter search. The Bayesian search has the advantage, that it is more efficient in finding the optimal hyperparameters than a random search and it requires less iterations than a grid search. A Gaussian Process with a kappa of 3 is used as optimization strategy. Four initial points are provided manually, which shall help to direct the optimization process into the optimal direction. The Bayesian optimization is executed for 12 iterations. The acquisition function called "Lower Confidence Bound" is used and it gets the best validation f1-score as metric to optimize. The Lower Confidence Bound tries to minimize its optimization metric. Therefore, the negative best validation f1-score is used. Table 4 shows the used search space for the hyperparameters, while Table 5 shows the best parameters. The decay rate specifies by how much the learning rate should be decreased every decay steps epochs.

Figure 9 shows the convergence plot of the Bayesian hyperparameter search. As one can see, the best parameters are found in the last iterations. This is due to the nature of the Bayes optimization. In the beginning, there is a lot of uncertainty and Bayes model samples more in not optimal region. But at the end, the model samples more in the optimal region and is therefore finding better and better parameters.

Table 4: Optimized hyperparameters and their search space for the Bayesian hyperparameter optimization.
Table 4: Optimized hyperparameters and their search space for the Bayesian hyperparameter optimization.
Table 5: Best hyperparamters found by the Bayesian hyperparameter search.
Table 5: Best hyperparamters found by the Bayesian hyperparameter search.
Figure 9: The convergence plot of the Bayes optimization. This plot shows the number of iterations and the achieved optimal f1-score.
Figure 9: The convergence plot of the Bayes optimization. This plot shows the number of iterations and the achieved optimal f1-score.

Results

The birds classifier is trained with all the findings found in the Methodology section. The birds classifier is now trained for 30 epochs and the best model according to the validation f1-score is stored as Tensorflow model (Figure 10). Afterwards, the best model, with a validation f1-score of 96.2%, is loaded and evaluated on the hold-out test set in order to check the performance of the final birds classifier on real world data. The best model achieves a f1-score of almost 96% on the hold-out test set, which is almost the same as the best validation f1-score that is achieved during the training process. This shows, that the model performs very well on real-world data and that there is no overfitting of the model on the training and validation data.

Figure 10: The training metrics for training the final birds classifier.
Figure 10: The training metrics for training the final birds classifier.

Conclusion

The final birds classifier achieves a f1-score of almost 96% on the hold-out test set. As underlying CNN architecture, the DenseNet121 is used, because of its good performance. As target image size, the size 192×192 pixels is used. The training set is not over-sampled, because the model trained on the over-sampled training set achieved approximately the same validation f1-score as the model which was trained on the not over-sampled training set. Some hyperparameters are optimized by utilizing Bayesian hyperparameter search.

Outlook

In the second part of this series, I develop an API with a frontend written in HTML and the backend application using the Python library Flask. Read the second article if you are interested in how this works. You can find the second article here. In general, I would always recommend that you embed your final Machine Learning application in a deployable application, because that’s what a machine learning engineer needs to think about at the end of the day.


Thank you for reading my article to the end! I hope you enjoyed this article and the project I worked on. If you want to read more articles like this in the future, follow me to stay updated.


Related Articles