The world’s leading publication for data science, AI, and ML professionals.

PyTorch Image Classification Tutorial for Beginners

Fine-tuning pre-trained Deep Learning models in Python

"Not sure if this is supposed to be a lion or a cheetah..."
"Not sure if this is supposed to be a lion or a cheetah…"

This practical tutorial shows you how to classify images using a pre-trained Deep Learning model with the PyTorch framework.

The difference between this beginner-friendly image classification tutorial to others is that we are not building and training the Deep neural network from scratch. In practice, only a few people train neural networks from scratch. Instead, most Deep Learning practitioners use a pre-trained model and fine-tune it to a new task.

In practice, only a few people train neural networks from scratch.

The specific problem setting is to build a binary image classification model to classify images of cheetahs and lions based on a small dataset. For this purpose, we will fine-tune a pre-trained image classification model using PyTorch.

Sample images from the dataset [1].
Sample images from the dataset [1].

This tutorial follows a basic Machine Learning workflow:

  1. Prepare and explore data
  2. Build a baseline
  3. Run experiments
  4. Make predictions

You can follow along in my related Kaggle Notebook.

Prerequisites and Setup

Ideally, you should have some familiarity with Python.

As this is a practical tutorial, we will only cover how to build an image classification model at a high level. We will not cover a lot of theory, such as how convolutional layers or backpropagation work. I will mark sections where you can dig deeper once you feel comfortable with this topic with this sign: ⚒️

If you want to supplement this guide with some theoretical background information, I recommend the free Kaggle Learn courses on Deep Learning and Computer Vision.

Let’s begin by importing PyTorch and other relevant libraries:

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

import cv2

import albumentations as A
from albumentations.pytorch import ToTensorV2

import numpy as np # data processing
import matplotlib.pyplot as plt # Data visualization
from tqdm import tqdm # Progress bar

The essential libraries are PyTorch (version 1.13.0) for deep learning, OpenCV (version 4.5.4) for image processing, and Albumentations (version 1.3.0) for data augmentation.

Step 1: Prepare and Explore Data

The first step is to become familiar with the data. For this tutorial, we will keep the exploratory data analysis step short.

First, we will load the data. The example dataset [1] has two folders with images – one folder for each class.

Example dataset [1] for binary image classification.
Example dataset [1] for binary image classification.

The following code goes through all subfolders and creates a Pandas dataframe containing the file name and its label.

import os
import pandas as pd 

root_dir = ... # Insert your data here
sub_folders = ["Cheetahs", "Lions"] # Insert your classes here
labels = [0, 1]

data = []

for s, l in zip(sub_folders, labels):
    for r, d, f in os.walk(root_dir + s):
        for file in f:
            if ".jpg" in file:
                data.append((os.path.join(s,file), l))

df = pd.DataFrame(data, columns=['file_name','label'])

Insert your data here! – To follow along in this article, your dataset should look something like this:

Example dataset [1] for binary image classification. Insert your data here.
Example dataset [1] for binary image classification. Insert your data here.

We have about 170 photographs: roughly 85 lions and 85 cheetahs (see remark in [1]). This is a very small but balanced dataset. It’s perfect for fine-tuning!

import seaborn as sns
sns.countplot(data = df, x = 'label');
Class distribution of sample dataset for image classification plotted with seaborn
Class distribution of sample dataset for image classification plotted with seaborn

To get a feeling for the dataset, it is always a good idea to plot a few samples:

fig, ax = plt.subplots(2, 3, figsize=(10, 6))

idx = 0
for i in range(2):
    for j in range(3):

        label = df.label[idx]
        file_path = os.path.join(root_dir, df.file_name[idx])

        # Read an image with OpenCV
        image = cv2.imread(file_path)

        # Convert the image to RGB color space.
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

        # Resize image
        image = cv2.resize(image, (256, 256))

        ax[i,j].imshow(image)
        ax[i,j].set_title(f"Label: {label} ({'Lion' if label == 1 else 'Cheetah'})")
        ax[i,j].axis('off')
        idx = idx+1

plt.tight_layout()
plt.show()
Sample images from the dataset [1].
Sample images from the dataset [1].

By exploring a dataset like this, you can gain some insights. E.g. as you can see here, the images are not limited to the animals but also to statues.


Before we go any further, let’s split the dataset into training and testing data. The training data will be used to build our model, and the test data will be a hold-out dataset to evaluate the final model’s performance on unseen data. In this example, we will set 10% of the data aside for testing.

from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, 
                                      test_size = 0.1, 
                                      random_state = 42)
Splitting the data into training and testing datasets (Inspired by scikit-learn)
Splitting the data into training and testing datasets (Inspired by scikit-learn)

Step 2: Build a Baseline

Next, we will build a baseline. A baseline consists of three key components:

In this section, we will go through each component and finally wrap it up nicely.

Because training a Deep Learning model includes a lot of experimentation, we want to be able to switch out specific parts of the code quickly. Thus, we will try to make the following code as modular as possible and work with a configuration for tuning:

from types import SimpleNamespace

cfg = SimpleNamespace(**{})

We will add the configurable parameters as we go along.

Build a data pipeline for loading images

First, you must build a pipeline to load, preprocess and feed your images to the neural network in batches (instead of all at once). PyTorch provides two core classes you can use for this purpose:

  • Dataset class: Loads and preprocesses the dataset. You will need to customize this class for your purpose.
  • Dataloader class: Loads batches of data samples to the neural network.

First, you need to customize the Dataset class. Its key components are:

  • Constructor: to load the dataset as, e.g., Pandas Dataframe
  • __len__(): to get the length of the dataset. This usually will require minimal adjustments related to how you pass in the dataset.
  • __getitem__(): to get a sample from the dataset by index. This is usually the part where you modify most of the code depending on any preprocessing you want to do.

Below you can find a template to customize the Dataset class.

class CustomDataset(Dataset):
    def __init__(self, df):
        # Initialize anything you need later here ...
        self.df = df
        self.X = ...
        self.y = ...
        # ...

    # Get the number of rows in the dataset
    def __len__(self):
        return len(self.df)

    # Get a sample of the dataset
    def __getitem__(self, idx):
        return [self.X[idx], self.y[idx]]

When loading your dataset, you can also perform any required preprocessing, such as transforms or image standardization. This happens in __getitem__().

In this example, we first load the image from the root directory (cfg.root_dir) with OpenCV and convert it to the RGB color space. Then we will apply basic transforms: Resize the image (cfg.image_size) and convert the image from a NumPy array to a tensor. Finally, we will normalize the values of the image to be in the [0, 1] range by dividing the values by 255.

cfg.root_dir = ... # Insert your data here
cfg.image_size = 256

class CustomDataset(Dataset):
    def __init__(self, 
                 cfg, 
                 df, 
                 transform=None, 
                mode = "val"):
        self.root_dir = cfg.root_dir
        self.df = df
        self.file_names = df['file_name'].values
        self.labels = df['label'].values

        if transform:
          self.transform = transform
        else:
          self.transform = A.Compose([
                              A.Resize(cfg.image_size, cfg.image_size), 
                              ToTensorV2(),
                           ])

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        # Get file_path and label for index
        label = self.labels[idx]
        file_path = os.path.join(self.root_dir, self.file_names[idx])

        # Read an image with OpenCV
        image = cv2.imread(file_path)

        # Convert the image to RGB color space.
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

        # Apply augmentations        
        augmented = self.transform(image=image)
        image = augmented['image']

        # Normalize because ToTensorV2() doesn't normalize the image
        image = image/255

        return image, label

Next, we need a Dataloader to feed the samples of the Dataset to the neural network in batches because we (probably) don’t have enough RAM to feed all the images to the model at once.

You need to provide the Dataloader the instance of the Dataset you want to navigate, the size of the batches (cfg.batch_size), and the information on whether to shuffle the data.

cfg.batch_size = 32

example_dataset = CustomDataset(cfg, df)

example_dataloader = DataLoader(example_dataset, 
                              batch_size = cfg.batch_size, 
                              shuffle = True, 
                              num_workers=0,
                             )

The batch size should be fixed throughout the training and not be tuned [2]. Because the training speed is related to the batch size, we want to use the biggest batch size possible. Start with a batch size of 32 and then increase it in powers of two (64, 128, etc.) until you get a memory error, and then use the last batch size.

When you iterate over the Dataloader, it will give you batches of samples from the customized Dataset. Let’s retrieve the first batch for a sanity check:

for (image_batch, label_batch) in example_dataloader:
    print(image_batch.shape)
    print(label_batch.shape)
    break
torch.Size([32, 3, 256, 256])
torch.Size([32])

The Dataloader returns the image batch and a label batch. The image_batch is a tensor of the shape (32, 3, 256, 256). This is a batch of 32 (batch_size) images with the shape (3, 256, 256) (color_channels, image_height, image_width). The label_batch is a tensor of the shape (32). These are the corresponding labels to the 32 images.

Example output of the Dataloader with customized Dataset
Example output of the Dataloader with customized Dataset

This section explained how to build a data pipeline. In a later section (see Setup a training pipeline), we will use the Dataset and Dataloader to create separate pipelines for training, validation, and testing.


Before we train the model, we need to split the training data again into a training and a validation dataset. Training the model on a dataset and then evaluating the model on the same data is a methodological mistake because the model just needs to memorize the labels of the seen samples. Thus, instead of generalizing, the model will overfit to the training data.

To avoid overfitting, let’s randomly partition the training data into training and validation sets with the train_test_split() function for now. This section will later be replaced with a cross-validation strategy.

X = df
y = df.label

train_df, valid_df, y_train, y_test = train_test_split(X, 
                                                       y, 
                                                       test_size = 0.2, 
                                                       random_state = 42)
Splitting the training data again into training and validation (Inspired by scikit-learn)
Splitting the training data again into training and validation (Inspired by scikit-learn)

With this split, we can now create Datasets and Dataloaders for the training and validation data:

train_dataset = CustomDataset(cfg, train_df)
valid_dataset = CustomDataset(cfg, valid_df)

train_dataloader = DataLoader(train_dataset, 
                          batch_size = cfg.batch_size, 
                          shuffle = True)

valid_dataloader = DataLoader(valid_dataset, 
                          batch_size = cfg.batch_size, 
                          shuffle = False)

Prepare the model

This is the part where we would learn about building a neural network in PyTorch. When I started learning about Deep Learning, I thought building neural networks was an important part of training Deep Learning models. But the reality is that this is what researchers do for us. We, the practitioners, get to lean back and use the final models for our purposes.

The researchers try different model architectures, such as convolutional neural networks (CNNs), and usually train image classification models on large baseline datasets, such as ImageNet [3]. We call these models backbones.

Expectation vs. reality: In practice, only a few people train neural networks from scratch for image classification
Expectation vs. reality: In practice, only a few people train neural networks from scratch for image classification

Fine-tuning a pre-trained neural network works so well because the first few layers often learn general features (such as edge detection).

⚒️ Of course, you should understand how neural networks work in general, including backpropagation, and how different layers, such as convolutional layers, work. However, to follow along in this practical tutorial, you don’t need to understand these details right now. Once you have finished this tutorial, you can fill in some theoretical gaps with the free Kaggle Learn courses on Deep Learning and Computer Vision.


Fantastic backbones and where to find them – Now, which of these pre-trained models should you choose, and where do you get these from?

In this tutorial, we will use [timm](https://timm.fast.ai/) – a Deep Learning library containing a collection of state-of-the-art computer vision models created by Ross Wightman – to get pre-trained models. (You can use torchvision.models for pre-trained models, but I personally find it easier to switch out backbones during experimentation with timm.)

import timm

cfg.n_classes = 2
cfg.backbone = 'resnet18'

model = timm.create_model(cfg.backbone, 
                          pretrained = True, 
                          num_classes = cfg.n_classes)

There is a lot to unpack in this little piece of code. Let’s go step-by-step:

backbone = 'resnet18' **** – In this example, we use a ResNet [5] with 18 layers. ResNet stands for Residual Network, and it is a type of CNN using so-called residual blocks.

⚒️We will skip over the details of ResNet and residual blocks. If you are interested in the technical details, you can dig deeper into this post, for example.

There are many different models in the ResNet family, such as ResNet18, ResNet34, etc., where the number stands for how many layers the network has. As a (very rough) rule of thumb: The higher the number of layers, the better the performance. You can print timm.list_models('*resnet*') to see what other models are available.

⚒️ Learn about different backbones for computer vision/image classification like ResNet, DenseNet, and EfficientNet.

pretrained = True – This means we want the weights of the model trained on ImageNet [3]. If this is set to False, you will only get the model’s architecture without the weights [6].

num_classes = cfg.n_classes – Because the model was pre-trained on ImageNet [3], you will get a classifier with the 1000 classes that are in ImageNet. Thus, you need to remove the ImageNet classifier and define how many classes you have in your problem [6]. If you set num_classes = 0, you will get the model without a classifier [6].


To check output size, you can pass in a sample batch X with 3 channels of random values in the image size [6].

X = torch.randn(cfg.batch_size, 3, cfg.image_size, cfg.image_size)
y = model(X)

It will output torch.Size([1, cfg.n_classes]) [6].

Model inputs and outputs
Model inputs and outputs

Prepare loss function and optimizer

Next, to train a model, there are two key ingredients:

  • a loss function (criterion),
  • an optimization algorithm (optimizer), and
  • optionally a learning rate scheduler.

Loss function – Common loss functions are the following:

  • Binary cross-entropy (BCE) loss for binary classification.
  • Categorical cross-entropy loss for multi-class classification.
  • Mean squared loss for regression.

Although we have a binary classification problem, you can also use categorical cross-entropy loss. If you like, you can switch out the loss function with BCE.

criterion = nn.CrossEntropyLoss()

Optimizer – The optimization algorithm minimizes the loss function (in our case, the cross-entropy loss). There are many different optimizers available. Let’s use a popular one: Adam.

cfg.learning_rate = 1e-4

optimizer = torch.optim.Adam(
  model.parameters(), 
  lr = cfg.learning_rate, 
  weight_decay = 0,
 )

Learning rate scheduler – A learning rate scheduler adapts the value of the learning rate during the training process. Although you don’t have to use a learning rate scheduler, using one can help the algorithm converge faster. This is because if the learning rate stays constant, it can prevent you from finding the optimum if it is too large, and it can take too long to converge if it is too small.

There are many different learning rate schedulers available, but Kaggle Grandmasters recommend using cosine decay as a learning rate scheduler for fine-tuning [2].

cfg.lr_min = 1e-5
cfg.epochs = 5

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
  optimizer, 
  T_max = np.ceil(len(train_dataloader.dataset) / cfg.batch_size) * cfg.epochs,
  eta_min = cfg.lr_min
)

T_max defines the half period and should be equal to the maximum number of iterations (np.ceil(len(train_dataloader.dataset) /cfg.batch_size)*cfg.epochs).

The resulting learning rates will look as follows over the course of a training run:

Cosine decay learning rate scheduler
Cosine decay learning rate scheduler

Metric – While we’re at it, let’s also define a metric to evaluate the model’s overall performance. Again, there are many different metrics. For this example, we will use accuracy as the metric:

from sklearn.metrics import accuracy_score

def calculate_metric(y, y_pred):
  metric = accuracy_score(y, y_pred)
  return metric

Don’t confuse the metric with the loss function. The loss function is used to optimize the learning function during training, while the metric measures the model’s performance after the training.

⚒️ Learn about different metrics and which ones are suited for which problems.

Setup a training pipeline

This is probably, the most complex but also most interesting part of this tutorial. Are you ready?

A model is typically trained in iterations. One iteration is called an epoch. Training from scratch usually requires many epochs, while fine-tuning requires only a few (roughly 5 to 10) epochs.

In each epoch, the model is trained on the full training data and then validated on the full validation data. We will now define two functions: One function to train (train_an_epoch()) and one function to validate the model on an epoch (validate_an_epoch()).

Below you can see the training function:

cfg.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

def train_one_epoch(dataloader, model, optimizer, scheduler, cfg):
    # Training mode
    model.train()

    # Init lists to store y and y_pred
    final_y = []
    final_y_pred = []
    final_loss = []

    # Iterate over data
    for step, batch in tqdm(enumerate(dataloader), total=len(dataloader)):
        X = batch[0].to(cfg.device)
        y = batch[1].to(cfg.device)

        # Zero the parameter gradients
        optimizer.zero_grad()

        with torch.set_grad_enabled(True):
            # Forward: Get model outputs
            y_pred = model(X)

            # Forward: Calculate loss
            loss = criterion(y_pred, y)

            # Covert y and y_pred to lists
            y =  y.detach().cpu().numpy().tolist()
            y_pred =  y_pred.detach().cpu().numpy().tolist()

            # Extend original list
            final_y.extend(y)
            final_y_pred.extend(y_pred)
            final_loss.append(loss.item())

            # Backward: Optimize
            loss.backward()
            optimizer.step()

        scheduler.step()

    # Calculate statistics
    loss = np.mean(final_loss)
    final_y_pred = np.argmax(final_y_pred, axis=1)
    metric = calculate_metric(final_y, final_y_pred)

    return metric, loss

Let’s go through it step-by-step:

  1. Set the model to the training mode. The model can also be in evaluation mode. This mode affects the behavior of the layers [Dropout](https://pytorch.org/docs/stable/_modules/torch/nn/modules/dropout.html) and [BatchNorm](https://pytorch.org/docs/stable/_modules/torch/nn/modules/batchnorm.html) in a model.
  2. Iterate over the training data in small batches. The samples and labels need to be moved to GPU if you use one for faster training (cfg.device).
  3. Clear the last error gradient of the optimizer.
  4. Do a forward pass of the input through the model.
  5. Calculate the loss for the model output.
  6. Backpropagate the error through the model.
  7. Update the model to reduce the loss.
  8. Step the learning rate scheduler.
  9. Calculate the loss and metric for statistics. Because the predictions will be Tensors on the GPU, just like the inputs, we need to detach the Tensor from the automatic differentiation graph and call the NumPy function to convert them to NumPy arrays

Next, we define the validation function as shown below:

def validate_one_epoch(dataloader, model, cfg):
    # Validation mode
    model.eval()

    final_y = []
    final_y_pred = []
    final_loss = []

    # Iterate over data
    for step, batch in tqdm(enumerate(dataloader), total=len(dataloader)):
        X = batch[0].to(cfg.device)
        y = batch[1].to(cfg.device)

        with torch.no_grad():
            # Forward: Get model outputs
            y_pred = model(X)

            # Forward: Calculate loss
            loss = criterion(y_pred, y)  

            # Covert y and y_pred to lists
            y =  y.detach().cpu().numpy().tolist()
            y_pred =  y_pred.detach().cpu().numpy().tolist()

            # Extend original list
            final_y.extend(y)
            final_y_pred.extend(y_pred)
            final_loss.append(loss.item())

    # Calculate statistics
    loss = np.mean(final_loss)
    final_y_pred = np.argmax(final_y_pred, axis=1)
    metric = calculate_metric(final_y, final_y_pred)

    return metric, loss

Let’s go through it step-by-step again:

  1. Set the model to the evaluation mode.
  2. Iterate over the validation data in small batches. The samples and labels need to be moved to GPU if you use one for faster training.
  3. Do a forward pass of the input through the model.
  4. Calculate the loss and metric for statistics.

At first glance, training and validating an epoch looks similar. Let’s look at a code comparison to make the differences clearer:

Screenshot of side-by-side code comparison in BeyondCompare of training and validation code in PyTorch
Screenshot of side-by-side code comparison in BeyondCompare of training and validation code in PyTorch

You can see the following differences:

  • The model has to be in training or evaluation mode.
  • For training the model, we need an optimizer and an optional scheduler. For validation, we only need the model.
  • The gradient calculation is only active for training. For validation, we don’t need it.

Cross-validation strategy

Now, we are not yet done with the training pipeline. Earlier, we divided the training data into training and validation data. But partitioning the available data into two fixed sets limits the number of training samples.

Instead, we will use a cross-validation strategy by splitting the training data into k folds. The model is then trained in k separate iterations, in which the model is trained on k-1 folds and validated on one fold for each iteration while the folds switch at every iteration as shown below:

Splitting the training data again into training and validation (Inspired by scikit-learn)
Splitting the training data again into training and validation (Inspired by scikit-learn)

In this example, we are using [StratifiedKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) to create the splits. You could use KFold instead but StratifiedKFold has the advantage that it preserves the class distribution.

from sklearn.model_selection import StratifiedKFold

cfg.n_folds = 5

# Create a new column for cross-validation folds
df["kfold"] = -1

# Initialize the kfold class
skf = StratifiedKFold(n_splits=cfg.n_folds)

# Fill the new column
for fold, (train_, val_) in enumerate(skf.split(X = df, y = df.label)):
      df.loc[val_ , "kfold"] = fold

for fold in range(cfg.n_folds):
    train_df = df[df.kfold != fold].reset_index(drop=True)
    valid_df = df[df.kfold == fold].reset_index(drop=True)

Adding data augmentation

When the difference between the training and validation metric is significant, this indicates that the model is overfitting to the training data. Overfitting occurs when a model is trained on only a few examples and learns irrelevant details or noise from the training data. This negatively affects the model’s performance when it’s presented with new examples. As a result, the model struggles to generalize on new images.

To overcome overfitting during the training process, you can use data augmentation. Data augmentation generates additional training data by randomly transforming existing images. This technique exposes the model to more aspects of the data, helping it to generalize better.

We can use some prepared data augmentations from the albumentations package, such as:

  • Rotating images (A.Rotate())
  • Horizontal flipping (A.HorizontalFlip())
  • Cutout [4] (A.CoarseDropout())

Earlier, we defined a basic transform to resize and convert the image to a tensor. We will continue to use it for the validation and testing datasets because they don’t need any augmentations. For the training dataset, we create a new transform transform_soft , which has the three above augmentations in addition to the resizing and conversion to tensor.

transform_soft = A.Compose([A.Resize(cfg.image_size, cfg.image_size),
                             A.Rotate(p=0.6, limit=[-45,45]),
                             A.HorizontalFlip(p = 0.6),
                             A.CoarseDropout(max_holes = 1, max_height = 64, max_width = 64, p=0.3),
                             ToTensorV2()])

You can control the percentage of images the augmentations are applied to with the parameter p.

If we visualize a few samples from the augmented dataset, we can see that the three augmentations are applied successfully:

  • Rotation in images 0, 1, 2, 4
  • Horizontal flip is difficult to detect if you don’t know the original image, but we can see that image 2 must be horizontally flipped
  • Cutout (coarse dropout) in images 1 and 4
Augmented training dataset
Augmented training dataset

⚒️ Next, you can review and add other image augmentation techniques, e.g., Mixup and Cutmix, to your pipeline.

Cutout, Mixup, and Cutmix: Implementing Modern Image Augmentations in PyTorch

Putting it all together

Now that we have discussed each component of the baseline from the data pipeline to the model with loss function and optimizer, to the training pipeline, including a cross-validation strategy, we can put it all together as shown in the image below:

Flowchart of the baseline code
Flowchart of the baseline code

We will iterate over each fold of our cross-validation strategy. Within each fold, we set up a data pipeline for the training and validation data and a model with loss function and optimizer. Then for each epoch, we will train and validate the model.


Before we touch anything, let’s set ourselves up for success and fix the random seeds to ensure reproducible results.

import random

def set_seed(seed=1234):
    random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    np.random.seed(seed)

    # In general seed PyTorch operations
    torch.manual_seed(seed)

    # If you are using CUDA on 1 GPU, seed it
    torch.cuda.manual_seed(seed)

    # If you are using CUDA on more than 1 GPU, seed them all
    torch.cuda.manual_seed_all(cfg.seed)

    # Certain operations in Cudnn are not deterministic, and this line will force them to behave!
    torch.backends.cudnn.deterministic = True 

    # Disable the inbuilt cudnn auto-tuner that finds the best algorithm to use for your hardware.
    torch.backends.cudnn.benchmark = False

Next, we will write a fit() function that fits the model for all epochs. The function iterates over the number of epochs, while the training and validation functions contain inner loops that iterate over the batches in the training and validation datasets, as discussed in the section about the training pipeline.

cfg.seed = 42

def fit(model, optimizer, scheduler, cfg, train_dataloader, valid_dataloader=None):
    acc_list = []
    loss_list = []
    val_acc_list = []
    val_loss_list = []

    for epoch in range(cfg.epochs):
        print(f"Epoch {epoch + 1}/{cfg.epochs}")

        set_seed(cfg.seed + epoch)

        acc, loss = train_one_epoch(train_dataloader, model, optimizer, scheduler, cfg)

        if valid_dataloader:
            val_acc, val_loss = validate_one_epoch(valid_dataloader, model, cfg)

        print(f'Loss: {loss:.4f} Acc: {acc:.4f}')
        acc_list.append(acc)
        loss_list.append(loss)

        if valid_dataloader:
            print(f'Val Loss: {val_loss:.4f} Val Acc: {val_acc:.4f}')
            val_acc_list.append(val_acc)
            val_loss_list.append(val_loss)

    return acc_list, loss_list, val_acc_list, val_loss_list, model
Log of the fit function
Log of the fit function

For visualization purposes, we will also create plots of the loss and accuracy on the training and validation sets:

def visualize_history(acc, loss, val_acc, val_loss):
    fig, ax = plt.subplots(1,2, figsize=(12,4))

    ax[0].plot(range(len(loss)), loss,  color='darkgrey', label = 'train')
    ax[0].plot(range(len(val_loss)), val_loss,  color='cornflowerblue', label = 'valid')
    ax[0].set_title('Loss')

    ax[1].plot(range(len(acc)), acc,  color='darkgrey', label = 'train')
    ax[1].plot(range(len(val_acc)), val_acc,  color='cornflowerblue', label = 'valid')
    ax[1].set_title('Metric (Accuracy)')

    for i in range(2):
        ax[i].set_xlabel('Epochs')
        ax[i].legend(loc="upper right")
    plt.show()
Plotted history of metric and loss over epochs
Plotted history of metric and loss over epochs

When we combine everything, it will look as follows:

for fold in range(cfg.n_folds):
    train_df = df[df.kfold != fold].reset_index(drop=True)
    valid_df = df[df.kfold == fold].reset_index(drop=True)

    train_dataset = CustomDataset(cfg, train_df, transform = transform_soft)
    valid_dataset = CustomDataset(cfg, valid_df)

    train_dataloader = DataLoader(train_dataset, 
                              batch_size = cfg.batch_size, 
                              shuffle = True, 
                              num_workers = 0,
                             )
    valid_dataloader = DataLoader(valid_dataset, 
                              batch_size = cfg.batch_size, 
                              shuffle = False, 
                              num_workers = 0,
                             )

    model = timm.create_model(cfg.backbone, 
                              pretrained = True, 
                              num_classes = cfg.n_classes)

    model = model.to(cfg.device)

    criterion = nn.CrossEntropyLoss()

    optimizer = torch.optim.Adam(model.parameters(), 
                                 lr = cfg.learning_rate, 
                                 weight_decay = 0,
                                )

    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, 
                                                           T_max= np.ceil(len(train_dataloader.dataset) / cfg.batch_size) * cfg.epochs,
                                                           eta_min=cfg.lr_min)

    acc, loss, val_acc, val_loss, model, lrs = fit(model, optimizer, scheduler, cfg, train_dataloader, valid_dataloader)

    visualize_history(acc, loss, val_acc, val_loss)

Step 3: Run Experiments

Data Science is an experimental science. Thus, the aim of this step is to find the best configuration of hyperparameters, data augmentations, model backbones, and cross-validation strategy that achieve the best performance (or whatever your objective may be – e.g., best trade-off between performance and inference time).

Setup experiment tracking

Before jumping into this step, take a minute to think about how you will track your experiments. Experiment tracking can be as simple as writing everything down with pen and paper. Alternatively, you can track everything in a spreadsheet or even use an experiment tracking system to automate the whole process.

Intro to MLOps: Experiment Tracking for Machine Learning

If you are an absolute beginner, I recommend starting simple and tracking your experiments manually in a spreadsheet at first. Open an empty spreadsheet and create columns for all inputs, such as:

  • backbone,
  • learning rate,
  • epochs,
  • augmentations, and
  • image size

and outputs, such as loss and metrics for training and validation, you want to track.

The resulting spreadsheet could look something like this:

Example spreadsheet to track experiments for beginners
Example spreadsheet to track experiments for beginners

⚒️ Once you feel comfortable with the Deep Learning techniques, you can level up by implementing an experiment tracking system into your pipeline to automate experiment tracking, such as Weights & Biases, Neptune, or MLFlow.

Experimentation and hyperparameter tuning

Now that you have an experiment tracking system let’s run some experiments. You can start by tweaking the following hyperparameters:

  • Number of training steps: range of 2 to 10
  • Learning rate: range of 0.0001 to 0.001
  • Image size: range of 128 to 1028
  • Backbone: Try different backbones. First, try deeper models from the ResNet family (print timm.list_models('*resnet*') to see what other models are available), then try a different backbone family like timm.list_models('*densenet*') or timm.list_models('*efficientnet*')

⚒️ Once you feel comfortable with the Deep Learning techniques, you can level up by automating this step with Optuna or Weights & Biases.


Now it’s your turn! – Tweak a few notches and see how the model’s performance changes. Once you’re happy with the results, move on to the next step.

Example log of experiments
Example log of experiments

Step 4: Make Predictions (Inference)

Drum roll, please! Now that we have found the configuration that will give us the best model, we want to put it to good use.

First, let’s fine-tune the model with the optimal configuration on the full dataset to take advantage of every data sample. We don’t split the data into training and validation data in this step. Instead, we only have one big training dataset.

train_df = df.copy()

train_dataset = CustomDataset(cfg, train_df, transform = transform_soft)

train_dataloader = DataLoader(train_dataset, 
                          batch_size = cfg.batch_size, 
                          shuffle = True, 
                          num_workers = 0,
                         )

But the rest of the training pipeline stays the same.

model = timm.create_model(cfg.backbone, 
                          pretrained = True, 
                          num_classes = cfg.n_classes)

model = model.to(cfg.device)

criterion = nn.CrossEntropyLoss()

optimizer = torch.optim.Adam(model.parameters(), 
                             lr = cfg.learning_rate, 
                             weight_decay = 0,
                            )

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, 
                                                       T_max= np.ceil(len(train_dataloader.dataset) / cfg.batch_size) * cfg.epochs,
                                                       eta_min=cfg.lr_min)

acc, loss, val_acc, val_loss, model = fit(model, optimizer, scheduler, cfg, train_dataloader)

Inference – And finally, we will use the model to predict the hold-out test set.

test_dataset = CustomDataset(cfg, test_df)

test_dataloader = DataLoader(test_dataset, 
                          batch_size = cfg.batch_size, 
                          shuffle = False, 
                          num_workers = 0,
                         )

dataloader = test_dataloader

# Validation mode
model.eval()

final_y = []
final_y_pred = []

# Iterate over data
for step, batch in tqdm(enumerate(dataloader), total=len(dataloader)):
    X = batch[0].to(cfg.device)
    y = batch[1].to(cfg.device)

    with torch.no_grad():
        # Forward: Get model outputs
        y_pred = model(X)

        # Covert y and y_pred to lists
        y =  y.detach().cpu().numpy().tolist()
        y_pred =  y_pred.detach().cpu().numpy().tolist()

        # Extend original list
        final_y.extend(y)
        final_y_pred.extend(y_pred)

# Calculate statistics
final_y_pred_argmax = np.argmax(final_y_pred, axis=1)
metric = calculate_metric(final_y, final_y_pred_argmax)

test_df['prediction'] = final_y_pred_argmax

Below you can see the results of our model:

Predictions
Predictions

Summary and Next Steps

This tutorial showed you how to fine-tune a pre-trained image classification model for your specific task, evaluate it, and perform inference on unseen data using the PyTorch framework in Python.

Once you feel comfortable, you can level up by reviewing the sections marked with ⚒️ to level up to an intermediate level.

Intermediate Deep Learning with Transfer Learning

Enjoyed This Story?

Subscribe for free to get notified when I publish a new story.

Get an email whenever Leonie Monigatti publishes.

Find me on LinkedIn, Twitter, and Kaggle!

References

Dataset

[1] MikołajFish99 (2023). Lions or Cheetahs – Image Classification in Kaggle Datasets.

License: According to the original image source (Open Images Dataset V6) the annotations are licensed by Google LLC under CC BY 4.0 license, and the images are listed as having a CC BY 2.0 license.

Note the original dataset contains 200 images, with 100 images of each class. But the dataset needed some cleaning, including removing images of other animals; thus, the final dataset is slightly smaller. To keep this tutorial short, we will skip the data cleaning process here.

Images

If not otherwise stated, all images are created by the author.

Literature

[2] S. Bhutani with H20.ai (2023). Best Practises for Training ML Models | @ChaiTimeDataScience #160 presented on YouTube in January 2023.

[3] Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009, June). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255). Ieee.

[4] DeVries, T., & Taylor, G. W. (2017). Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552.

[5] K. He, X. Zhang, S. Ren, & J. Sun (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

[6] timmdocs (2022). Pytorch Image Models (timm) (accessed April 10th, 2023).


Related Articles