
A project manager once told me about the 80/20 rule. He complained that the last part of a project was taking too long. Implementing the last 20% of features was taking over 80% of the time.
Vilfredo Pareto called this the 80/20 rule or the Pareto principle. It states that 20 percent of your efforts produce 80 percent of the results.
The 80/20 rule also holds for improving the accuracy of my Deep Learning model. It was straightforward to create a model with 88% accuracy. I have the feeling that improving it an extra 3 percent to make it to the top of the leaderboard will take a lot more time.
If you do not know what I am talking about, I invite you to read my previous article. That article ends with five possible techniques to improve the model’s accuracy. I learned these five techniques from the Kaggle community.
- Use bigger pre-trained models
- Use K-Fold Cross Optimization
- Use CutMix to augment your images
- Use MixUp to augment your images
- Using Ensemble learning
I tried each of these techniques and combined them. This is what happened.
All of the source code is available in this GitHub repository.
1. Use bigger pre-trained models
Before, we used EfficientNet-B3. This model was a good trade-off between performance and accuracy. See below. But EfficientNet offers other models that provide even greater accuracy – for example, EfficientNet-B4.

These more complex models have more parameters. More parameter needs more computing power and memory during training. I started with EfficientNet-B4, which gave an excellent result. The validation accuracy went up to 90%, and the validation loss to 0.32.
If you are interested in the implementation, see my previous article or this GitHub repository. The only change necessary was changing B3 into B4.


After I submitted the model to Kaggle, it showed a small increase in the public score. It went from 88.9% to 89.1%. An improvement of 0.2%.

I also tried the other EfficientNet models, such as EfficientNet-B5 and Efficient-B6. The accuracy did not increase.
2. Use K-Fold Cross-Validation
Until now, we split the images into a training and a validation set. So we don’t use the entire training set as we are using a part for validation.
Another method for splitting your data into a training set and validation set is K-Fold Cross-Validation. This method was first mentioned by Stone M in 1977.
With K-Fold Cross-Validation, you divide the images into K parts of equal size. You then train your model K number of times with a different training and validation set.
This way, you make optimal use of all your training data.

It is crucial to note that you will train many models, one for each fold. This means changing the way we make predictions. We have the following options.
- Use a single model, the one with the highest accuracy or loss.
- Use all the models. Create a prediction with all the models and average the result. This is called an ensemble.
- Retrain an alternative model using the same settings as the one used for the cross-validation. But now use the entire dataset.
Implementing K-Fold Cross-Validation
The scikit-learn library contains objects to help us. Two objects can help us with dividing our training data into folds. These are KFold
and StratifiedKFold
.
KFold
The KFold
object splits our training data into k consecutive folds. When creating the object, you choose the number of folds. If you then call split
on the object, it returns two arrays. The first array contains indices from our training data for training. The second array contains indices from our training data for validation.
In row five, we create the KFold
object and instruct it to create five different folds and shuffle the data. Then in row eight, we start a loop that will run five times. Each run returns the train_index
and val_index
array that contains indexes into the train_data data frame.
We then create two arrays, training_data
, and validation_data
that we use with the ImageDataGenerator
.
StratifiedKFold
StratifiedKFold differs from normal KFold in that it makes sure that each fold has the same percentage of samples of each class. This is especially useful if your training data is not uniformly distributed.

This is the case with our training data. So, we use StratifiedKFold. The implementation is the same as with KFold.
We iterate over all folds and use each training and validation set to train a model. Each model gets a unique filename. We use that filename when saving the model with the lowest loss.
We then add the history object that TensorFlow returns from the fit method to an array. We use this array to create a graph of each fold at the end of the training. The pre-trained model that I used for each fold was EfficientNet-B3.
The individual graphs did not show an increase in validation accuracy, as you can see in the charts of fold 1 and 2.


Creating and submitting predictions with K-Fold
To make a prediction, we have to calculate the average of all the individual predictions. To do this, we first load all the models and store them in a list.
Then we call the load_and_predict
function. This function iterates over all the loaded models. We add each prediction that a model returns to a list with all the predictions.
In row 26, we use NumPy to calculate the average of all predictions. Then, in row 27, we iterate over all the predictions to construct an array. We use this array to create a submission by loading it into a Pandas data frame. A Pandas data frame has a method to directly save it to a CSV file.
The result of the submission is a 0.001 increase from our previous submission.

We jumped a few places on the public leaderboard, to place 1908.

3. Use CutMix to augment your images
Sangdoo Yun describes CutMix in the research paper CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features by Sangdoo Yun et al.
CutMix combines two random images from your training set. It cuts a part of one image and pastes it onto another image. CutMix also mixes the labels of the two images proportional to the area of the size of the cut.
The paper concludes that CutMix improves the robustness and performance of the model.
When we use the CutMix technique onto the cassava training data, we get the following images. I added the red boxes to see the distinctions between the original image and patches.

Implementing CutMix with TensorFlow Keras
Implementing CutMix with TensorFlow is simple, thanks to the CutMixImageDataGenerator
class. Bruce Kim developed this class. It acts the same as an ImageDataGenerator
, but it adds CutMix.
We define an ImageDataGenerator
as usual, but instead of one, we create two iterators, see lines 16 and 27.
In row 38, we then create the CutMixImageDataGenerator
using both iterators as arguments. Following, we use the train_iterator
it returns and add it to the fit
method.
The combination of EfficientNet-B5 with CutMix resulted in a maximum validation accuracy of 89.49%, and a minimum validation loss of 0.32.


To report the accuracy and loss, I used Tensorboard. TensorBoard can track and visualize loss and accuracy metrics. Google also developed the website TensorBoard.dev. You can use it to store and share your training results.
The great thing is that you can even watch these metrics during training.
You can view the accuracy and loss results of the CutMix training here.
Although accuracy and loss were some of the best until now, they did not result in a higher score on Kaggle. The model scored 0.889, which was not an improvement.

So training a single model using CutMix and EfficientNetB5 did not improve our model.
4. Use MixUp to augment your images
MixUp is described in the research paper, mixup: Beyond Empirical Risk Minimization by Zhang et al.
Like CutMix, MixUp combines two images from our training set. It makes one image transparent and places it on top of the other. The amount of transparency is adjustable.
The research paper shows that MixUp improves the generalization of state-of-the-art neural network architectures.
When we use the MixUp technique onto the cassava training data, we get the following images. I used a transparency value of 0.2.

Implementing MixUp with TensorFlow Keras
Implementation of MixUp again is straightforward when you use the MixUpImageGenerator
that was mentioned on the dlology blog. You can find the source here on GitHub.
I made a small change to the MixUpImageGenerator
to be able to use it with a dataset and the flow_from_dataframe
method.
The implementation starts with creating an ImageDataGenerator
and passing the generator to the constructor of the MixUpImageGenerator
.
The combination of EfficientNet-B4 with MixUp resulted in a maximum validation accuracy of 88.5%, and a minimum validation loss of 0.35.


You can investigate these graphs as I created them using Tensorboard.
MixUp did not improve the accuracy or loss, the result was lower than using CutMix. It also did not result in a higher score on Kaggle. The model scored 0. 887 which was not an improvement.

So training a single model using MixUp and EfficientNet-B4 does not result in an improvement in recognizing cassava diseases.
5. Use Ensemble learning
Ensemble learning is an approach to improve predictions by training and combining multiple models. What we previously did with K-Fold Cross-Validation was ensemble learning.
We trained multiple models and combined the predictions of these models. With K-Fold Cross-Validation, we used the same model architecture, EfficientNet-B3. It is also possible to combine different architectures.
We combined a model trained with Efficient-B7 and another one with ResNext50v2. After training, the maximum validation accuracy of the ResNext50v2 model was 85%. The maximum validation accuracy of the EfficientNet-B7 model was 89%.


Creating the prediction using ensemble learning
I combined the ensemble with four times test augmentation. If you look at the source code, there are two loops – the first iterates over all models, the second iterates over the augmented test images.
In the end, in line 33, we calculate the average of the predictions using np.mean
. Then, on line 35, we select the category or label with the highest score using np.argmax
.
The result of this ensemble together with test time augmentation was the highest score until now. We got 89.3% on the private score and 89.35% on the public score.

The competition ended while I was trying to improve and writing this article. That is why you see the public and private score. Previously Kaggle would only show the public score.
Conclusion and further optimization
As expected, some optimization techniques worked on the Cassava data while others did not. We could increase our accuracy from 88.9% to 89.35%.
Using a larger model, K-Fold Cross-Validation and Ensemble learning increased the accuracy. CutMix and MixUp image augmentation did not.
You can find the source code in this GitHub repository.
I spent much more time trying to increase the accuracy if you compare it to my model’s initial creation. Creating the models and training took more time and computing power.
My feeling about the 80/20 rule was correct.
The five techniques in this article can increase the accuracy of your CNN. It depends on the amount and quality of the training images which one’s works best. Try to find out.
While reading through the Kaggle forums, I found even more optimization techniques. For example.
- Perform Image Augmentation, not on every epoch. For example, start the first three epochs without augmentation. Also, don’t use augmentation on the final epochs.
- Create an ensemble that includes a Vision Transform (ViT) model
- Use different loss functions
- Balance the dataset using over-sampling
I am going to use these techniques to further try to increase the accuracy of my model.
Thank you for reading and remember never to stop learning!
Used Resources
Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, Youngjoon Yoo. CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. arXiv:1905.04899
H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
Mingxing Tan and Quoc V Le. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of International Conference on Machine Learning (ICML), 2019.
Stone M. Cross-validatory choice and assessment of statistical predictions. J. Royal Stat. Soc., 36(2):111–147,1974.
Deep Residual Learning for Image Recognition. Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun.