The world’s leading publication for data science, AI, and ML professionals.

How to set up a deep learning project in Pytorch

The 3 steps to setting up a deep learning project in Pytorch

Photo by Graham Holtshausen on Unsplash
Photo by Graham Holtshausen on Unsplash

In this article, I will explain how to set up a deep learning project in PyTorch.

Any PyTorch deep learning project is usually comprised of 3 essential steps:

  1. Setting up the dataset
  2. Creating the data loaders
  3. Creating the training, validation, and testing loops

I won’t go through how to build an actual model because it is fairly simple, it varies from task to task, and there are already many resources to learn how to do this online.

1. Setting up the dataset

In order to create a dataset that can work with the data loader to create the batches of data which can then be sent into the actual model, PyTorch requires that you create a specific class for your dataset and you override two functions of it: the getitem() function and the len() function.

In most Machine Learning projects, there is often some preprocessing to be done on the data before it can actually be sent into the model. For example, in Natural language Processing tasks, this may involve tokenizing the text and converting it into a numerical format. To do this, I create functions in the same class that can do all these processes, like a build_dataset() or preprocess() function, and I store all my preprocessed data in a separate file.

The getitem and len functions are simply functions that are used by the dataloader to build the mini-batches of data. In the next section, you will see how the data loader uses and requires a dataset object/class that has properly overridden the getitem and len functions as a parameter, so it is very important that you implement these two functions properly.

The getitem function is fairly straightforward. Imagine your dataset as being indexed. For example, if you were doing an image recognition task and are working with images, each image would be indexed from 0 to the dataset size/number of images in the dataset. For the getitem function, all you have to do is, given an index, return the x, y pair(or input-output pair) of data that exists at that particular index. Because of this, it is important to remember that you should probably store your data in an indexed dataset, like a list, so that you can easily access elements at a particular index. The len function is just a function that returns the length of the dataset. Taking the previous example of storing your data as a list, you would just return the length of the list.

class MyDataset:
   def __getitem__(self, index):
       return self.input[index], self.label[index]
   def __len__(self):
       return len(self.input)

It is important to note that your data from the getitem function does not necessarily have to be a tuple. You could return your data in any type of data structure. If you have multiple inputs to your model, you could return your data as a dictionary. If you only have one input, you could return it as a tuple. It really does not matter because, in the training loop, you simply have to properly retrieve the inputs from your data structure.

class MyDataset:
   def __getitem__(self, index):
       return {"Input_1": self.input1[index], 
               "Input_2": self.input2[index], 
               "Input_3": self.input3[index], 
               "Label": self.label[index]}
   def __len__(self):
       return len(self.input1)

2. Creating the DataLoaders

A DataLoader is a tool that Pytorch provides that makes it really easy to create the batches of data and send them into the model. It uses the dataset class we created in the last part, along with a few other things to create the batches. Here is how you code them:

from torch.utils.data import DataLoader
data = MyDataset(parameters here)
data.build_dataset()## A helper function to do the preprocessing
dataloader = DataLoader(dataset = data, batch_size = batchsize, shuffle=True)

One very important functionality that DataLoaders provide is they allow us to apply specific functions on each batch of data before the batch is sent to the model. They do this through a collate class. A collate is a class that takes in one batch at a time, and can modify that batch and perform any batch specific functions on the data. Here is how it works:

class MyCollate():
  def __call__(self, batch):
    ## do whatever operations you want here
    ## return the new batch of data

In order to correctly create a collate class, you have to override the call() function. The call function will take in a batch of data and has to return a new, modified batch of data. The input batch to the call() function in the collate is simply a list constructed by calling the getitem function in the dataset multiple times, which is done by the data loader. Each item in the list is simply one item or x,y pair from your dataset that you can retrieve from the getitem() function in your dataset class. The output for the batch has to be structured a little differently. When you send your batch of data into the model, if your batch size was 16 for example, your input tensor to the model would be structured as the 16 individual inputs in one list/tensor, and the label/output tensor would be 16 individual labels. Because we are overriding the collate function, we have to manually do this process. If we didn’t override the collate function, then the Dataloader would automatically do this through the default collate function.

The input to the collate function would be structured something like this:

(input1, label1), 
(input2, label2), 
(input3, label3), 
(input4, label4), 
(input5, label5)

if the batch size was 5. The output that we have to return from the collate function should be structured something like this:

(input1, input2, input3, input4, input5), 
(label1, label2, label3, label4, label5)

Suppose I was doing an NLP task. If I was trying to pad the tokenized sentences in a batch all to the same length(presumable the length of the longest sentence in the batch), then I would do it this way:

class MyCollate():
  def __call__(self, batch):
    ## find maximum length of a sentence
    max_len = 0
    for item in batch:
       max_len = max(max_len, len(item[0]))
    new_input = []
    new_label = [] 
    for item in batch:
       ## pad all items in batch to max_len
       val = item[0]
       val = pad(val, max_len) 
       ##pad is the function you should create
       new_input.append(val)
       new_label.append(item[1])
    new_input = torch.tensor(new_input)
    new_label = torch.tensor(new_label)
    return new_input, new_label

To integrate the collate function with your data loader, simply do this:

data = MyDataSet(parameters here)
data.build_dataset()
dataloader = DataLoader(dataset = data, batch_size = batchsize, shuffle=True, collate_fn = MyCollate())

The shuffle=true parameter just randomly shuffles the data in the dataset before creating the batches. It is common practice to shuffle the data for the train dataloader, but not for the validation and test dataloader. One thing to remember about all PyTorch machine learning projects is that a model can only accept input in the form of tensors, and that is the reason why I converted both the input and the label lists to tensors before I returned them in the code snippet above.

Something important to remember is that the data structure used to store your data in the getitem function and the collate must be the same, and they must also be the same size. If you return a dictionary with 5 key-value pairs in the getitem function, you must return the same keys in the collate. The values will be different because you will have done some preprocessing on the batch of data. As I mentioned before, the structure of the returned batch will be different. If you used a dictionary in your getitem function, the input to the call function will be

{keys:values}, 
{keys:values}, 
{keys:values}, 
{keys:values}, 
{keys:values}

if the batch size was 5. The keys for each batch will all be the same. The returned output structure will be

{key1: all values for key1}, 
{key2: all values for key2},
{key3: all values for key3},
{key4: all values for key4},
{key5: all values for key5}

In most machine learning projects, you usually want three different data loaders: one for the training loop, validation loop, and testing loop. Implementing this in PyTorch is very simple:

from torch.utils.data.dataset import random_split
data = MyDataset()
data.build_dataset()
train_len = int(0.7 * len(dataset))
test_len = len(dataset) - train_len
train_data, test_data = random_split(dataset, (train_len, test_len))
val_len = int(0.33 * len(test_data))
test_len = len(test_data) - val_len
test_data, val_data = random_split(test_data, (test_len, val_len))
train_loader = DataLoader(dataset = train_data, batch_size = batchsize, shuffle = True, collate_fn = MyCollate())
test_loader = DataLoader(dataset = test_data, batch_size = batchsize, shuffle = False, collate_fn = MyCollate())
val_loader = DataLoader(dataset = val_data, batch_size = batchsize, shuffle = False, collate_fn = MyCollate())

You simply have to split the original dataset into however many parts with whatever percentages you want using the random_split function that PyTorch provides. I split the dataset above into 70% train, 20% test, and 10% validation. After you create each respective dataset, you can simply pass these through their own data loaders and use them regularly.

3. Creating the training, validation, and testing loops

The training, validation, and testing loops in PyTorch are fairly simple and similar to do. Here is how you can create a training loop:

model = model.to(device)
model = model.train()
for index, batch in enumerate(train_loader):
    x = batch[0].to(device)
    y = batch[1].to(device)
    optimizer.zero_grad()
    output = model(x).to(device)
    loss = criterion(output, y).to(device)
    loss.backward()
    optimizer.step()

Here is how you can create a validation/test loop (they are the same thing but with different data loaders).

model = model.to(device)
model = model.eval()
with torch.no_grad():
    for index, batch in enumerate(train_loader):
        x = batch[0].to(device)
        y = batch[1].to(device)
        output = model(x).to(device)
        loss = criterion(output, y).to(device)

The optimizer is the optimizer that you choose(I usually choose Adam), and criterion is what I usually name my loss function. You have to initialize them beforehand, and here is how you can do this:

import torch.optim as optim
import torch.nn as nn
learning_rate = 1e-3
optimizer = optim.Adam(model.parameters(), lr = learning_rate)
criterion = nn.BCEwithLogitsLoss()##you can use any loss function

We have to specify with torch.no_grad() in the validation/testing loop to make sure PyTorch does not calculate the gradients for backpropagation because we are not doing backpropagation here. If you don’t include this piece of code, your program will still work, but it will consume much more memory because PyTorch will be calculating and storing gradients of the model that we won’t even use.

One important thing to point out about the training and testing loops is that even though I am putting x = batch[0] and y = batch[1], you don’t need to specifically structure your input to the model like this(as a tuple), especially if you used a different data structure. You just have to make sure you properly retrieve the data from the data structure that you used in the getitem function and the call function in the collate. One thing important to note is that certain loss functions require specific shapes/dimensions of the output and the y/labels to be passed into them, so after you get the output from the model, make sure to reshape and then send it in to the loss function, or the criterion as I’ve named it above.

One more important thing to remember is that you need to put .to(device) on everything except the optimizer and the loss in the training loop. This will essentially put your tensors and data onto your specified device. If you don’t know how to create a device, here is how:

device = torch.device('cuda' if torch.cuda.is_available() else "cpu")

If you have a GPU, then the device will automatically be a GPU. Otherwise, it will be a CPU.


The last step to finishing setting up your PyTorch project is simply combining the previous three steps together. Create a function that initializes everything you need, a function that trains the model, and a function that evaluates the model. You then just need to combine everything together, which is relatively straightforward.

You should also remember to save your model weights to a file. In addition to saving the model weights to a file, I also save the optimizer weights. You can save your model/optimizer weights like this:

model = Model(parameters here
optimizer = optim.Adam(model.parameters(), lr = learning_rate)
save_dict = {'Optimizer_state_dict': optimizer.state_dict(), 
             'Model_state_dict': model.state_dict()}
torch.save(save_dict, file_path)

Loading the model is also very easy.

load_dict = torch.load(file_path)
model.load_state_dict(load_dict['Model_state_dict'])
optimizer.load_state_dict(load_dict['Optimizer_state_dict'])

I hope you found this article easy to understand and informative. If you have any questions leave them in the comments below.


Related Articles