
Introduction
Why do we need data augmentation?
Data Augmentation is one of the critical elements of Deep Learning projects. It proves its usefulness in combating overfitting and making models generalize better. Besides the regularization feature, transformations can artificially enlarge the dataset by adding slightly modified copies of already existing images.
How to pick the right augmentations?
There are two ways to choose augmentations: manually or using an optimized policy. As you might think, manual design could only yield sub-optimal solution without an extensive background research in the dataset domain.
On the other hand, automated policies are optimized to get the highest validation accuracy for specific tasks without human interference.
In this blog, we will go through both of these methods in detail, along with torchvision
code. In the end, we will compare the performance of three setups with no-augmentation, manual and automatic policies on the CIFAR-10 dataset.
Without further ado, let’s dive deeper into it!
Manual augmentations
There are over 30 different augmentations available in the torchvision.transforms
module. In this part we will focus on the top five most popular techniques used in computer vision tasks. To combine them together, we will use the transforms.Compose()
function. Before we apply any transformations, we need to normalize inputs using transforms.Normalize()
. This scheme reduces the instability of the model and speeds up the convergence.
Inputs are normalized using the mean and standard deviation of the whole dataset. These values are calculated separately for each channel(RGB). In this case, we used values specific to the CIFAR-10. If you want to know more about normalization, you should check out my article.
The transforms.ToTensor()
command converts the PIL image format to torch Tensor
so it can be passed to the PyTorch model.
Now we let’s see how to add transformations:
1. Random Flipping
— Horizontal

— Vertical

Horizontal and Vertical Flipping is one of the most straightforward and powerful transformations. The parameter p
denotes the probability of reflection to occur.
2. Padding

The parameter pad
defines extra pixels in height and width for the output.
3. Random Cropping

This function is cropping out the new image to the desired size
. Padding
helps to keep the exact dimensions as an input.
4. Color Jittering

This function randomly manipulates brightness
, contrast
, and saturation
. In this way, we can simulate the daylight and night condition which aids generalization.
5. Random Erasing

This function randomly selects the rectangle region and ‘erases‘ its pixels with 0 values. Again, p
denotes the probability of the occurrence. Unlike other transformations RandomErasing()
is applied directly on the tensor; hence it’s after the toTensor()
.
Manual Design
transforms.Compose()
function allows us to chain multiple augmentations and create a policy. One thing that is important to keep in mind, some of the techniques can be useless or even decrease the performance.
The simplest example is horizontally flipping the number ‘6’, which becomes ‘9’. Unfortunately, labels can’t do the same. So it’s worth considering only relevant augmentation to your dataset.
Automatic augmentation
AutoAugment

In 2017 Google developed the first algorithm to automatically search for improved data augmentations policies. The main problems of manual design were time-consuming background research and sub-optimal results. The new solution was built upon two components, search algorithm and search space.

At each step, the controller generates 5 sub-policies with 2 sampled operations. It does it sequentially, where the transformation is picked first, then its magnitude and probability. The augmentations are applied to the dataset and passed to the smaller(‘child‘) version of the original network. The produced validation accuracy is modified using the PPO algorithm(reward) to update the weights of the controller. This process is repeated 15 000 times, and based on the results the best policies are chosen.

AutoAugment beat all the state-of-the-art results at that time, but there’s a caveat: computational cost. There are 16 possible augmentations(like rotation, equalize etc.), 10 magnitude values, and 11 probabilities, which results in enormous (161110)¹⁰ search space for 5 sub-policies. Finding the optimal ones is a non-trivial task at a great computational expense.
Fortunately, torchvision
provides us with pre-trained policies on datasets like CIFAR-10, ImageNet, or SVHN. All of these are available in AutoAugemntPolicy
package.
Torchvision
only accepts already trained policies and does not support learning procedures. Suppose you want to find the optimal augmentations for your dataset. In that case, you need an external library with AutoAugment or re-implement the algorithm. But before you consider this option, check out the following method, which does not require any additional packages and is computationally lighter.
RandAugment

Apart from the computational burden of AutoAugment, it also performs worse on the larger datasets. The child(proxy) network only approximates the original model performance; thus it can only generate approximated(sub-optimal) policies.
RandAugment removes any learning techniques and proxy tasks to find optimal augmentations.
Wait, so how do they do it?
First of all, RandAugment takes only two arguments, N and M. N is the number of augmentations out of 14 available(whole list). M is the magnitude of those operations on a scale of 1–10, which defines how much the image is rotated, translated, etc.
We can find those parameters by performing a simple grid search which depends on the dataset and the main model rather than the ‘child‘ network in AutoAugment.

Essentially we train the network on different combinations of N and M and pick one with the best validation accuracy. As simple as it sounds, the results speak for themselves producing better or equal validation accuracy as AutoAugment.
This is the torchvision
code for RandAugment.
Comparison
In this part, we will finally see the performance of three augmentation setups:
- Plain – only
Normalize()
operation is applied. - Baseline – combination of
HorizontalFlip()
,RandomCrop()
, andRandomErasing()
. - AutoAugment – **** policy where
AutoAugment
is an additional transformation along with the baseline configuration.Dataset

The CIFAR-10 consists of 60 000 32×32 colored images in 10 classes, with 6000 images per class. The dataset is split into 50 000 training images, 2500 validation images, and 7500 testing images.
*Only training images are augmented.
Code with application of transformations to the dataset:
Model

The ResNet-20 is a specially tailored version of ResNet for the CIFAR-10 dataset presented in Deep Residual Learning for Image Recognition paper.
The first layer Conv1
is convolution with a 3×3 kernel size. It’s followed by 3 stages where each contains {16, 32, 64} number of filters. The parameter n controls the depth of the network by manipulating the number of residual blocks at each stage. In our case, it is equal to 3, which produces 9 residual blocks, and 18 convolution layers.
Feature maps between stages are downsampled using stride 2, which gives output sizes {32, 16, 8} respectively. Residual projections are here to match the channels of the following stage using 1×1 convolution.
The network ends with global average pooling and a fully connected layer resulting in 20 trainable layers.
Here is the implementation in PyTorch:
If you want to know more about ResNets in general, check out my video on YouTube.
Hyperparameters
These are the set of parameters used for training:
learning_rate = 0.001
batch = 256
optimizer = Adam
loss = CrossEntropyLoss
epochs = 40
n = 3
weight initialization = Kaiming He
I also added the Learning Rate scheduler to the model. It reduces the learning rate when the validation loss plateus, which helps to prevent overfitting.
I don’t include all the training, validation, and testing loops since they are a bit lengthy and aren’t crucial in understanding the topic. Nevertheless, you can easily find them on my Github with some intuitive explanations.
Results
Now it is time to compare the results after training 3 different models for 40 epochs. The main metrics to evaluate is the accuracy and loss function, so let’s see what they look like:


In the ‘Plain’ model, training accuracy goes towards 100%, while validation saturates and decreases. Also, validation loss is increasing, which indicates the overfitting problem.
The Baseline configuration seems to handle overfitting well. Surprisingly the learning rate has not decreased even once. The validation accuracy goes steadily towards higher values without significant drops. Possibly the model could be trained longer to get better results with a lower learning rate.
Last but not least, AutoAugment is interestingly keeping the validation accuracy higher than the testing accuracy. It points out the problem of underfitting so the model generalize too much, and can’t fit the data well. Again, the model could be trained longer with a learning rate schedule and probably achieve better outcome.
The final result might be surprising that Baseline augmentation gives the best result with ~87% testing accuracy, while AutoAugement ~84%. It proves how powerful random flipping, cropping and erasing are.
In defense of AutoAugment, I personally think that there were too many transformations for such a small network. Larger models like ResNet50 would have additional learning capabilities to cover more complex cases and generalize better.
Conclusions
Congratulations if you managed to get there. In this article, we went through plenty of different data augmentation techniques used in Deep Learning. We compared the performance of three different setups.
Plain configuration proved that data augmentations are essential to improve the accuracy of the model.
Manually picked transformations seem to require much more intuition behind the dataset and often yield sub-optimal results. Nevertheless, even the standard policy can still produce impressive results.
On the other hand, automated augmentations are definitely the future of the transformations and in theory should perform better with optimal solution. In our case learned policy actually did not improve results of the manual design. Nevertheless, It’s an active area of research which only recently started to get more attention and still has a lot of potential.
I hope this article gives you a solid introduction to explore and experiment more with Torchvision and test some other techniques. Feel free to follow up with questions in the comments.
If you liked this post, you should check my Medium and Github to see other projects I’m working on.