Balancing the Regularization Effect of Data Augmentation

A look into the need to balance overfitting and underfitting with data augmentation using an application of Image segmentation on satellite images to identify water bodies.

Published in

Towards Data Science

7 min readJul 21, 2020

The Effect of Data Augmentation

When training neural networks, data augmentation is one of the most commonly used pre-processing techniques. The word “augmentation” which literally means “the action or process of making or becoming greater in size or amount”, summarizes the outcome of this technique. But another important effect is that it increases or augments the diversity of the data. The increased diversity means, at each training stage the model comes across a different version of the original data.

Why do we need this ‘increased diversity’ in data? The answer lies in the core tenet of machine learning — The Bias-Variance tradeoff. More complex models like deep neural networks have low bias but suffer from high variance. This implies that, these models overfit the training data and would show poor perform on test data or the data, they haven’t seen before. This would lead to higher prediction errors. Thus, the increased diversity from data augmentation reduces the variance of the model by making it better at generalizing.

For images, some common methods of data augmentation are taking cropped portions, zooming in/out, rotating along the axis, vertical/horizontal flips, adjusting the brightness and sheer intensity. Data augmentation for audio data involves adding noise, changing speed and pitch.

While data augmentation prevents the model from overfitting, some augmentation combinations can actually lead to underfitting. This slows down training which leads to a huge strain on resources like available processing time, GPU quotas, etc. Moreover, the model isn’t able to learn as much information to give accurate predictions which, again leads to high prediction errors. In this blog post we take the example of semantic segmentation on satellite images, to see the impact of different combinations of data augmentations on training.

About the Data set

This Kaggle data set gives the satellite images from Sentinel 2 and their corresponding masks which segment the water bodies. The masks have been calculated using the Normalized Difference Water Index or NDWI. Out of a total 2841 images in the data set, 2560 were extracted for the train set, 256 for the validation set and 25 for the test set respectively. The entire analysis and modeling was done on Google Colab with GPU support.

Structure of U-NET

Simply put, a U-NET is an autoencoder with residual or skip connections from each convolutional block in the encoder to its counterpart in the decoder. This results in a symmetric ‘U’ like structure. This article gives a comprehensive line by line explanation of the structure of a U-NET from the original paper.

We use a slightly modified version of the U-NET as shown below.

Snapshot of a block of the UNET used (By Author)

A Look at Different Cases of Data Augmentation

We explore 5 different cases of data augmentation with the help of Keras ImageDataGenerator. We want to see how augmentation can lead to overfitting or underfitting during training. Thus, for comparison of the 5 cases, Accuracy and Loss during training & validation were used; where binary cross-entropy was taken as the loss function.

When dealing with semantic segmentation, an important point to remember is to apply the same augmentations to the images and their corresponding masks!

In all 5 cases, the pixel values of the images and masks were rescaled by a factor of 1/255. All images and their masks in the validation and test set were also rescaled.

Case 1: This was the base case. Only the pixel values of images and their masks were rescaled. No augmentations were applied. This case produced a training set with the least variance.

Case 2: In addition to rescaling, the images and their masks were randomly flipped vertically or horizontally .

Case 3: For this case, rescaling, random vertical or horizontal flips and random rotations between [-20,20] degrees were applied to images and their corresponding masks.

Case 4: The images and their corresponding masks were randomly shifted along the width and height by a factor of 0.3.

Case 5: The sheer transformations were randomly applied to the images and their corresponding masks using a factor of 20. They were also randomly zoomed in between the range [0.2,0.5].

Different types of Augmentations on the same image and it’s mask (By Author)

A Comparison of Results

Across all 5 cases, the model was trained for 250 epochs with a batch size of 16. The Adam optimizer was used with a learning rate of 0.00001, beta 1 of 0.99 and beta 2 of 0.99. In the below graphs, we see that, each case of data augmentation gave a varying performance for the same model, trained for the same number of epochs with the same initial state of the optimizer.

The base case gave the best performance in terms of both training accuracy and loss. However, this could imply overfitting. On the other hand the perfomance of the third, fourth and fifth case was worse. The training accuracy for these three cases did not go above 80% and the training loss did not fall below 0.1. This could mean underfitting. The second case in which data augmentation consisted of randomly flipping the images and their corresponding masks, seems to show a balanced performance. Even though it performed a bit worse than the base case, it had a much higher accuracy and lower loss than the last three cases.

The validation accuracy and loss, showed a similar trend. The last case performed the worst for both these metrics. Looking closer we see that, the cases 3 and 4 showed similar performance to training in terms of loss but a slightly higher validation accuracy than training. This signals underfitting. For the base case, the validation loss and accuracy were much worse than their training counterparts which again points towards overfitting. While we see very high fluctuations in validation accuracy and loss for the second case, on an average it seems to perform the best out of the five.

From the four graphs we can also see that each case of data augmentation converged at a different point during training for the same number of epochs and same initial state of the optimizer. It is almost like each case followed a separate trajectory.

The maximum accuracy and the minimum loss for Training and Validation are shown in the table below. The table supports the conclusion from the graphs. The second case seems to balance overfitting and underfitting during training quite well.

Predictions on Test images for the 5 Cases

We see predicted segmentation of water bodies in satellite images in each of the five cases using 3 test images. It is again worth noting that, the same model trained using 5 different data augmentation combinations, predicted 5 different images for each test image.

It should be noted that, the test images were most similar in terms format to the base case since, the base cases used no augmentations. Hence, for all three images, the base case model was able to segment the overall shape of the water body. However, case 2 was able to capture the minute edges as well. This confirms our conclusion from the graphs that, the base case model slightly over fit the training data (due to least variance).

The performance of Case 3 and 4 varied with each test image. Case 5 gave the worst prediction for all three images (especially second and third images). This could imply that, for training this model on such a data set, changing the sheer intensity or zooming in too much can lead to poor prediction results. And this can be attributed to the fact that the model under fit the training data as it had, a very high variance.

In the end, it again comes back to the core tenet — balancing the bias and variance. While data augmentation does have an explicit regularization effect, exploiting it can actually lead to the model not learning enough resulting in poor prediction results. Thus, we can see that there is a need to try out different combinations of data augmentation to find the most appropriate one for the data set of the problem statement.

The entire code can be found on Github!