Intro to PyTorch 2: Convolutional Neural Networks

Tony Flores

Published in

Towards Data Science

16 min readFeb 13, 2023

Intro

In the previous iteration of this series, we worked with the CIFAR-10 dataset and introduced the basics of PyTorch:

The Tensor and some associated operations
Datasets and the DataLoader
Building a basic neural network
Basic model training and evaluation

The model we developed for classifying images in the CIFAR-10 dataset was only able to achieve a 53% accuracy on the validation set, and really struggled to correctly classify images of some classes, like birds and cats (~33–35%). This was expected, since we would normally use Convolutional Neural Networks for image classification. In this part of the tutorial series, we will focus on CNN’s and improving the performance of image classification on CIFAR-10.

CNN Basics

Before we dive into the code, let’s discuss the basics of convolutional neural networks so we can have a better understanding of what our code is doing. If you’re comfortable with how CNN’s work, feel free to skip this section.

In comparison to feed-forward networks, like the one we developed in the previous part of the series, CNN’s have different architecture, and are composed of different types of layers. In the figure below, we can see the general architecture of a typical CNN, including the different types of layers it can contain.

The three types of layers usually present in a Convolutional Network are:

Convolutional Layers (red dashed outline)
Pooling Layers (blue dashed outline)
Fully Connected Layers (Red and Purple solid outlines)

Convolutional Layer

The defining component, and first layer of a CNN is the convolutional layer, and it consists of the following:

Input data (in this case, in image)
Filters
Feature Maps

What really differentiates a convolutional layer from a densely connected layer is the convolution operation. We wont get into the deep specifics on the definition of convolution, but if you are really interested and want to get into the meat of it, this article does an excellent job of explaining the mathematical definition, as well as giving some really fine concrete examples. I highly recommend it if you’re interested!

So why is convolution better than a densely/fully connected layer for image data? In essence, dense layers will learn global patterns in their inputs, while convolutional layers have the advantage of learning local and spatial patterns. That may sound kind of vague or abstract, so let’s check out an example of what this means.

On the left of the image we can see how a basic 2-D, black and white image of a 4 would be represented in a convolutional layer. The red square would be the filter/feature detector/kernel, convolving over the image. On the right is how the same image would be input into in a densely connected layer. You can see the same 9 image pixels that were framed by the kernel in red. Notice how on the left, pixels are grouped spatially, adjacent to other neighboring pixels. On the right, however, those same 9 pixels are no longer neighbors.

With this, we can see how the spatial/location-based information is lost when an image is flattened and represented in a fully-connected/linear layer. This is why convolutional neural networks are more powerful at working with image data. The spatial structure of the input data is maintained, and patterns (edges, textures, shapes, etc.) in the image can be learned.

This is essentially the why for using CNN’s on images, but now let’s discuss the how. Let’s have a look at the structure of our input data, these things we keep talking about called ‘filters’, and what convolution looks like when we put it all together.

Input Data

The CIFAR-10 dataset contains 60,000 32x32 color images, and each image is represented as a 3-D tensor. Each image will be a (32,32,3) tensor, where the dimensions are 32 (height) x 32 (weight) x 3 (R-G-B color channels). The figure below illustrates the 3 different color channels (RGB) separated out from the fully color image of a plane in the dataset.

Images are usually thought of as 2-dimensional, so it can be easy to forget that since they have 3 color channels, they will actually be represented in 3 dimensions!

Filters

The filter (also referred to as a kernel or feature detector) in a convolutional layer is an array of weights that essentially scans over the image in a sliding-window fashion, computing the dot product at each stop, and outputs this dot product into a new array called a feature map. The sliding-window scanning is called convolution. Let’s have a look at an illustration of this process to help make sense of what’s going on.

*Illustration of a 3x3 filter (blue) convolving over an input (red) to create a feature map (purple). Image Source: Author*

*Illustration of the dot product computation at every step of the convolution. Image Source: Author*

It’s important to note that the weights of the filter remain the same through each step. Just like the weights in a fully connected layer, these values are learned during training, and adjusted after each training iteration through backpropagation. The illustrations don’t tell the whole picture though. When training a CNN, your model won’t just have 1 filter at a convolutional layer. It’s pretty common to have 32 or 64 filters in a single convolutional layer, and in fact, we will have up to 96 filters in a layer in the model we develop in this tutorial.

Finally, though the weights of the filters are the main parameters that are trained, there are also hyper-parameters that can be tuned for CNNs:

number of filters in a layer
dimensions of filters
stride (number of pixels a filter moves each step)
padding (how the filter handles boundaries of images)

We won’t get into the details of these hyperparameters, since this isn’t intended to be a comprehensive CNN walk-through, but these are important factors to be aware of.

Pooling Layer

Pooling layers are similar to convolutional layers, in that a filter convolves over the input data (usually a feature map that was output from a convolutional layer). However, rather than feature detection, the function of pooling layers is dimensionality reduction or downsampling. The two most common types of pooling used are Max Pooling and Average Pooling. With Max Pooling, the filter slides across the input, and at each step will select the pixel with the largest value as the output. In Average Pooling, the filter will output the average value of the pixels that the filter is passing over.

Fully Connected Layer

Finally, CNNs typically will have fully connected layers after convolutional and pooling layers, and these layers will perform the classification in image classification tasks such as the one in this tutorial.

Now that we’ve gotten to see how Convolutional Neural Nets are structured and how they operate, let’s get to the fun part and train our own CNN in PyTorch!

Setup

As with the first part of this tutorial, I recommend using Google Colab to follow along since you will have your Python environment set up already with PyTorch and other libraries installed, as well as a GPU to train your model.

So, if you are using Colab, to make sure you are utilizing a GPU go to Runtime and click Change runtime type.

In the dialog select GPU and save.

Now you have GPU access in Colab, and we can verify your device with PyTorch. So first, let’s get our imports taken care of:

If you want to check what GPU you have access to, type and execute torch.cuda.get_device_name(0) and you should see your device output. Colab has a few different GPU options available, so your output will vary depending on what you are given access to, but as long as you dont get RuntimeError: No CUDA GPUs are available when you run this code, you are using a GPU!

We can set our GPU as device so as we develop our model, we can assign it to the GPU by referencing device, as well as use CPU if we don’t have a CUDA GPU device available.

Next, let’s set a random seed so that our results are reproducible as well as download our training data and set a transform to convert images to Tensors and Normalize the data.

Once that has finished downloading, let’s check out the classes in the dataset:

Finally, let’s setup our train and test dataloaders:

Now we’re ready to build our model!

Building the CNN

In PyTorch, nn.Conv2d is the convolutional layer that is used on image input data. The first argument for Conv2d is the number of channels in the input, so for our first convolutional layer, we will use 3 since a color image will have 3 color channels. After the first convolutional layer, this argument will depend on the number of channels output from the previous layer. The second argument is the number of channels that are output from the convolution operation in the layer. These channels are the feature maps that were discussed in the intro to the convolutional layer. Finally, the third argument will be the size of the kernel or filter. This can be an integer value like 3 for a 3x3 kernel, or a tuple such as (3,3). So our convolutional layers will take the form of nn.Conv2d(in_channels, out_channels, kernel_size). Additional optional parameters can be added, including (but not limited to): stride, padding, and dilation. We will use stride=2 in our convolutional layer conv4.

After our series of convolutional layers, we will want to use a flattening layer to flatten our feature maps to be able to feed into linear layers, and for that we will use nn.Flatten(). We can apply batch normalization with nn.BatchNorm1d() and will need to pass the number of features as an argument. Finally, our linear, fully-connected layers are built using nn.Linear(), which will also take the number of features as the first argument, as well as specifying the number of output features as the second argument.

So to begin defining the base architecture of our model, we will define a ConvNet class that inherits from the PyTorch nn.Module class. We can then define each of our layers as attributes for our class, and build them as we see fit. Once we’ve specified the layer architecture, we can define the flow of the model by creating a forward() method. We can wrap each layer with an activation function, and in our case we will be using relu. We can apply dropout between layers by passing the previous layer and p the probability of an element being dropped out (which defaults to 0.5). Finally, we create our model object and attach it to our device so that it can train on the GPU.

Train and Test Functions

If you went through the first part of this tutorial, our train and test functions will be identical to what we created then, except that we will be returning the loss in our train method, and loss and number of correct in our test method to utilize when we are tuning hyperparameters.

Finally, we define the loss function and optimizer before the base model training.

Let’s train the model.

After only 10 epochs, 61.7% is much better performance than fully connected model we trained! It’s pretty clear that a CNN is much better suited for classifying images, but we can squeeze out even more performance by extending the training duration and tuning hyperparameters. Before we get to that, let’s take a quick peek under the hood and check out what the filters look like. Recall that the pixels of the filters are the trainable parameters in our model. This isn’t a necessary step for training a model for image classification, nor will we find much useful information, but it’s pretty neat to see what’s going on inside our model.

Visualizing Filters

We can write a function to plot the filters from a specified layer in the model. All we have to do is specify which layer we want to see and pass that into our function.

Let’s check out what the filters in the first convolutional layer (`conv1`) look like since these are applied directly to the images.

Below is the output, containing the visualization of the 48 filters from our conv1 convolutional layer. We can see that each filter is a 3x3 tensor of different values or colors.

If our filters were 5x5 instead, we would see this difference in the plot. Recall that with nn.Conv2d we can change the size of the filter with the third argument, so if we wanted a 5x5, conv1 would look like this:

If we re-trained the model with the new 5x5 filters the output would now look like this:

Like I mentioned before, not too much useful information, but interesting to see nonetheless.

Hyperparameter Optimization

For this tutorial, the hyperparameters that we’ll be tuning are the number of filters in our convolutional layers, and the number of neurons in our linear layer. Right now these values are hard-coded into our model, so to make them tunable we will need to make our model configurable. We can use parameters (c1, c2, and l1) in our models __init__ method, and create the model’s layers with these values, which will be passed dynamically during the tuning process.

We certainly aren’t limited to tuning only these hyperparameters. In fact, learning rate and batch size are commonly included in the list of hyperparameters to tune, but since we will be using a grid search, we’ll have to greatly reduce the number of tunable variables to keep the training time reasonable.

Next let’s define a dictionary for our search space, as well as one to save the parameters that give us the best results. Since we’re using grid search for our optimization, every combination of each hyperparameter listed will be used. You can just as easily add more values to the lists for each hyperparameter, but each additional value will greatly increase the runtime, so it’s recommended to start with the following values to save time.

Early Stopping

One component that will be important in our optimization process is the usage of early stopping. Since we’ll have multiple training runs, each taking a significant amount of time to complete, we will want to cut a run short if training performance doesn’t show improvement. There’s no sense it continuing to train a model that isn’t improving.

In essence, we will keep track of the lowest loss the model has produced after each epoch. We then define a tolerance, which specifies the number of epochs the model has to attain a better loss. If it doesn’t achieve a lower loss within the specified tolerance, training is terminated for that run, and we move on to the next combination of hyperparamters. If you’re like me, and you like to check in on the training process, we can log updates to the console and see when the early stopping counter increases by setting self.verbose = True. You can hard code that into the EarlyStopping class here, or you can change the verbose value when we instantiate an EarlyStopping object during our optimization process.

Image Augmentation

We have one last thing to do before setting up our hyperparameter optimization method to squeeze out some extra performance, and curb overfitting on our training data. Image Augmentation is a technique which applies random transforms to images, essentially creating “new” artificial data. These transforms can be things like:

rotating an image a few degrees
flipping an image horizontally/vertically
cropping
slight brightness/hue shifts
random zooming

Including these random transforms will improve the model’s ability to generalize, since augmented images will be similar, but distinct to the original image. The contents and patterns will remain, but the array representation will be different.

PyTorch makes image augmentation easy with the torchvision.transforms module. If we have several transforms we would like to apply, we can chain them together with Compose. One thing to keep in mind is that image augmentation requires a little bit of computation per transform, and this is applied to every image in the dataset. Applying a lot of different random transforms to our dataset will increase the time it takes to train. So for now, let’s limit the transforms so our training doesn’t take too long. If you would like to add a few more, check out the PyTorch docs on transforming and augmenting images, and just add those into the Compose list.

Once we have the augmentation transforms picked, we can apply them to the dataset just as we would apply Normalization and transforming the images to a tensor.

Now that we have image augmentation set up on our training data, we’re ready to set up our hyperparameter optimization method.

Defining the Optimization Method

We can create a class (HyperSearch) with attributes for the hyperparameter value configuration, verbose reporting setting, a report list so we can see how each configuration performed after optimization completes, and a variable to store the config with the best performance.

Next, we can create a method (still in our HyperSearch class) to perform the grid search and do a training run with each combination of hyperparameters. First we’ll configure EarlyStopping with tolerance=3, and set it to save the weights for each hyperparameter combination. If we have self.verbose set to True we can see which hyperparameter combination is currently training in the console.

After that, we define our model with the CoinfigNet model we designed, and pass the l1, c1, and c2 values, as well as picking the loss function and optimizer, and setting up our train and validation DataLoaders. We will keep the number of epochs low, because we don’t have the time, nor desire, to train every combination fully. The goal is to get an idea of which combination will work best at classifying the dataset, then we can take that model and train it fully to see how well it can perform from a full training cycle.

Now, we define our training loop, mostly the same as before, except now we’ll save the loss of the train and test methods so that early_stopping can keep track of training progress (or lack thereof). Finally after each epoch, the results are saved to a report, and the value for the best loss is updated.

We can output the results of the entire hyperparameter optimization cycle in a nice table, where we’ll be able to see the hyperparameter configuration for each run, and the respective loss and accuracy.

So putting all of this code together, our HyperSearch class should look like this:

Time to tune!

Now we can tune our hyperparameters! By using %%time, at the completion of execution of the entire tuning process, we can see exactly how long it all took. Let’s keep our learning rate lrate=0.001 and the batch size batch_sz=512, instantiate HyperSearch with the search_space we defined earlier, set verbose equal to True or False (whichever you prefer), and call the optimize() method to start.

Note: This took about 50 minutes to complete on my machine with an NVIDIA RTX 3070, so expect this to take around that long to complete if you’re on Colab using the provided GPU.

Once the entire optimization cycle is complete, you should get a table like this:

Results

Looking at the table, the best results came from Run 00 which had c1=48, c2=96, and l1=256. A loss of 0.84 and accuracy of 71.24% is a nice improvement, especially considering it was only 10 epochs!

So, now that we have the hyperparameters with the best performance over 10 epochs, let’s fine tune this model! We can train it over many more epochs, and lower the learning rate slightly to try and squeeze out a little more performance. So first, let’s define the model we’d like to use, and set the batch size and learning rate:

Finally, we can set epochs to 50, and change the path that we want to save the weights to. Let the training cycle run, and early stopping will terminate training if progress halts.

Early stopping should terminate training before hitting 50 epochs, and should achieve an accuracy of about 77%.

Now that we’ve tuned hyperparameters, found our best configuration, and fine-tuned that model, it’s time to evaluate the model’s performance a little more in-depth.

Model Evaluation

In this case, our test dataset is actually our validation data. We will be reusing our validation data to evaluate the model, but usually you will want to use your real test data for model evaluation after hyperparameter tuning. Let’s load in our optimized model, prepare the test_dataloader without any image augmentation applied, and run test() to evaluate.

This should output the accuracy and loss:

The overall performance is nice, but the performance for each class will be more useful to us. The following code will output our model’s accuracy for each class in the dataset:

Executing this block will give us the following output:

Our model performed quite well on the airplane, automobile, frog, ship, and truck classes. Also interesting to note that the classes it struggled most with are dog and cat, which were also the toughest classes for the fully connected model in the previous part of this series.

Confusion Matrix

We can gain even more insight on performance with a confusion matrix. Let’s set one up, then get a nice visualization.

With confusion_matrix defined, we can use the Seaborn library to help us visualize it.

The two dimensions of this table are the “actual” and “predicted” values. We want most of our data to align in that center diagonal, where actual and predicted are the same class. From the incorrect predictions we can see the model often confused cats and dogs, which were the two classes with the lowest accuracy.

Totals are nice to see, but precision and recall for each class will give us much more meaningful data. Let’s have a look at the recall per class first.

Recall per Class

Precision per Class

Sample Model Predictions

Finally, let’s feed our model a few images and check out the predictions it makes. Let’s make a function to get our image data ready to view:

Now, we can get our test data prepared, and make another function to get a sample of n predictions

Call the function, passing the number of images you want to sample. The output will give us the ground truth and predicted class for each image starting from left to right.

Utilizing a convolutional network with hyperparameter tuning and image augmentation really helped improve the performance on the CIFAR-10 dataset! As always, thanks for reading, and I really hope you’ve learned a bit about PyTorch and CNN’s for image classification. The full Notebook with all of the code presented here is available on GitHub.