Parallel Neural Networks and Transfer Learning

A closer look at transfer learning and an interesting use case

Published in

Towards Data Science

10 min readApr 8, 2020

Introduction

Hey there! This is my first ever article on medium. I have been planning to write this article for a while now. My main motivation is to help simplify and even maybe provide a template for taking the next step in building complex neural networks that involve parallel neurons in addition to those in a certain architecture you may have already built. This is so as although it may seem simple on paper, building, saving, loading and refining (training at a later time) neural nets do form a stumbling block when we actually begin to code them up. Hopefully, this article helps you understand issues related to those a bit better too and help you in solving those minor hiccups that you face from time to time when you are getting started with deep learning. With that, let’s begin our journey.

For those who are beginning to learn or are practicing neural nets, you may have observed that most of the neural network architectures we work with are usually feed-forward i.e. tensors flow from one layer to the next sequentially. Let’s say we want to explore what happens if we increase the number of neurons in one layer after we have built and trained our architecture to some degree of satisfaction. This would usually mean starting all over from scratch again and possibly wasting a few hours training. Just to see the effect of adding a few neurons in a particular layer, it seems unreasonable to begin all over from scratch. This is where transfer learning comes in. What it means, in simple terms, is the fact that you want to transfer what you have learnt while learning a particular task to another task. In the paradigm of neural networks, what we learn is represented by the weight values obtained after training.

When we begin to learn more about how to utilize transfer learning, most of the in-built functions have fixed neural architectures as well as subsume code utilized for reloading weights and updating them in a new context. To figure out the specifics of applying it to your custom model will most probably take a few hours or days of hair scratching and pondering over multiple examples available online. I have done that myself and believe consolidating what I have learnt, all in one place, will definitely help anyone wanting to become more hands-on with transfer learning as well as exploring another alternative to augmenting neural networks. With that, let’s begin our example.

I shall assume you are familiar with the tutorial available on building neural nets in PyTorch available at the link: https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html. To learn the basics of transfer learning for existing popular architectures, do refer the link: https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html. I strongly encourage you to visit these links and others available there to practice as well as form a well-rounded idea of training neural network architectures.

Our Scenario

Let’s say we now want to create a copy of the network described in the CIFAR 10 tutorial and place it in parallel with our trained version create a bigger network for the same problem of image classification. Our current model looks like this conceptually:

The Simple Feed Forward Neural Network followed from the tutorial. Source: This is my own conceptual drawing in MS Paint.

A TensorBoard depiction of the graph reveals the following:

TensorBoard representation of the model on my computer.

Our goal now is to construct a neural network architecture that looks like this:

A Parallel Feed Forward Neural Network — Essentially the core of our model placed side-by-side. Source: This is my own conceptual drawing in MS Paint.

We also want that the upper sub-part of this new structure contain the same weights as that obtained by executing the tutorial.

Coding It Up

Let’s look at the code that helps us realize such a structure:

import torch
import torchvision
import torchvision.transforms as transformstransform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
                                          shuffle=True, num_workers=0)testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)testloader = torch.utils.data.DataLoader(testset, batch_size=4,
                                         shuffle=False, num_workers=0)classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')import matplotlib.pyplot as plt
import numpy as np# functions to show an image
def imshow(img):
    img = img / 2 + 0.5     # unnormalize
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.show()# get some random training images
dataiter = iter(trainloader)
images, labels = dataiter.next()# show images
imshow(torchvision.utils.make_grid(images))
# print labels
print(' '.join('%5s' % classes[labels[j]] for j in range(4)))import torch.nn as nn
import torch.nn.functional as Fclass Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return xnet = Net()
PATH = './cifar_net.pth'
net.load_state_dict(torch.load(PATH))

Till now, we have recreated our model learnt from the tutorial that we had trained and loaded the weights for us to copy into our new network.

The following code constructs our required architecture.

class SideNet(nn.Module):
    def __init__(self):
        super(SideNet, self).__init__()
        self.pool = nn.MaxPool2d(2, 2)        self.conv11 = nn.Conv2d(3, 6, 5)
        self.conv12 = nn.Conv2d(6, 16, 5)
        
        self.conv11.weight.data.copy_( net.conv1.weight.data)
        self.conv12.weight.data.copy_(net.conv2.weight.data)
        
        self.conv21 = nn.Conv2d(3, 6, 5)
        self.conv22 = nn.Conv2d(6, 16, 5)
        
        self.fc11 = nn.Linear(16 * 5 * 5, 120)
        self.fc12 = nn.Linear(120, 84)
    
        self.fc11.weight.data.copy_(net.fc1.weight.data)
        self.fc12.weight.data.copy_(net.fc2.weight.data)
        
        self.fc21 = nn.Linear(16 * 5 * 5, 120)
        self.fc22 = nn.Linear(120, 84)
        
        self.fc3 = nn.Linear(168,10)
    def forward(self, x):
        y = self.pool(F.relu(self.conv11(x)))
        y = self.pool(F.relu(self.conv12(y)))
        y = y.view(-1, 16 * 5 * 5)
        y = F.relu(self.fc11(y))
        y = F.relu(self.fc12(y))
        
        x = self.pool(F.relu(self.conv21(x)))
        x = self.pool(F.relu(self.conv22(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc21(x))
        x = F.relu(self.fc22(x))
    
        out = self.fc3(torch.cat((x,y),dim=1))
        return out# create a new model
net1 = SideNet()

We have now created our network architecture and defined the flow of tensors in our network in the forward function of class SideNet().

self.conv11.weight.data.copy_( net.conv1.weight.data)self.conv12.weight.data.copy_(net.conv2.weight.data)self.fc11.weight.data.copy_(net.fc1.weight.data)self.fc12.weight.data.copy_(net.fc2.weight.data)

This code snippet is key in many aspects. One may instantly recognize what is going on here as this is what fundamentally takes place in transfer learning. We have just copied the weights of our trained network into the upper sub-part of our new structure. Voila!

We may now choose to keep those weights as it is, by setting requires_grad = False (essentially freezing the weights for those layers whose weights have been copied from our trained network) or update them while training our new architecture. As I trained the tutorial’s network only for a few epochs, I will train the weights of both sub parts of our architecture. The reason for training initially for only a single epoch shall be clear later on.

#check weights
print(net.fc2.weight.data)
print(net1.fc12.weight.data)
print(net1.fc22.weight.data)#for param in net.parameters():
#    param.requires_grad = False
from torch.utils.tensorboard import SummaryWriter# default `log_dir` is "runs" - we'll be more specific here
writer = SummaryWriter('runs/temp')# write model to tensorboard
writer.add_graph(net1, images)writer.close()# train the new model
import torch.optim as optimcriterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net1.parameters(), lr=0.01)for epoch in range(1):  # loop over the dataset multiple timesrunning_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data# zero the parameter gradients
        optimizer.zero_grad()# forward + backward + optimize
        outputs = net1(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()# print statistics
        running_loss += loss.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0print('Finished Training')correct = 0
total = 0
with torch.no_grad():
    for data in testloader:
        images, labels = data
        outputs = net1(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()print('Accuracy of the network on the 10000 test images: %d %%' % (
    100 * correct / total))
#check weights
print(net.fc2.weight.data)
print(net1.fc12.weight.data)
print(net1.fc22.weight.data)

Here, the optimizer’s parameters has been set as follows: learning rate = 0.01 and no momentum has been used. I have used the same parameters for training the initial network. This is so as when we view the weights for our new network and our initial network side-by-side, we need to check that the old network’s weights are copied, the new network’s weights have been appropriately initialized and while the new network’s weights are trained, the initial network’s weights do not change. The last phrase is key, there are many ways to achieve the first two phrases in the previous sentence but they usually result in modifying the initial network’s weights too! An example of this is if the following code snippet is used:

self.conv11.weight.data = net.conv1.weight.dataself.conv12.weight.data = net.conv2.weight.dataself.fc11.weight.data = net.fc1.weight.dataself.fc12.weight.data = net.fc2.weight.data

Let’s have a look at the structure TensorBoard has inferred:

It does match what we are trying to build. Great!

Verifying our understanding of what’s actually going on

As a verification step let us print the weight values of layer FC2(fully connected 2, for those of you wondering) for both of our networks:

print(net.fc2.weight.data) # before train
tensor([[-2.5432e-03, -6.9285e-02,  7.7019e-02,  ...,  2.8243e-02,
          4.9399e-02, -8.7909e-05],
        [-7.2035e-02, -1.2313e-03, -8.9993e-02,  ...,  1.8121e-02,
         -6.1479e-02, -3.8699e-02],
        [-6.3147e-02,  5.5815e-02, -6.0806e-02,  ...,  3.3566e-02,
          7.6486e-02,  7.3699e-02],
        ...,
        [ 1.9772e-03, -1.8449e-02,  6.8946e-02,  ..., -2.1011e-02,
          7.5202e-02,  4.1823e-02],
        [ 2.9912e-02, -7.9396e-02, -8.7561e-02,  ...,  4.6011e-02,
         -9.0685e-02,  4.1302e-02],
        [-1.8297e-02, -7.3356e-02,  4.7250e-02,  ..., -7.5147e-02,
         -6.4722e-02,  6.0243e-02]])print(net.fc2.weight.data) # after train
tensor([[-0.0151, -0.0470,  0.1057,  ...,  0.0288,  0.0280,  0.0171],
        [-0.0720, -0.0029, -0.0907,  ...,  0.0181, -0.0630, -0.0408],
        [-0.0417,  0.0548, -0.1226,  ...,  0.0335,  0.0679,  0.0900],
        ...,
        [ 0.0074, -0.0028,  0.0292,  ..., -0.0218,  0.0754,  0.0473],
        [ 0.0307, -0.0784, -0.0875,  ...,  0.0460, -0.0903,  0.0510],
        [-0.0252, -0.0824,  0.0380,  ..., -0.0744, -0.0741,  0.1009]])In our new model code, before train,
print(net.fc2.weight.data)
print(net1.fc12.weight.data)
print(net1.fc22.weight.data)
tensor([[-0.0151, -0.0470,  0.1057,  ...,  0.0288,  0.0280,  0.0171],
        [-0.0720, -0.0029, -0.0907,  ...,  0.0181, -0.0630, -0.0408],
        [-0.0417,  0.0548, -0.1226,  ...,  0.0335,  0.0679,  0.0900],
        ...,
        [ 0.0074, -0.0028,  0.0292,  ..., -0.0218,  0.0754,  0.0473],
        [ 0.0307, -0.0784, -0.0875,  ...,  0.0460, -0.0903,  0.0510],
        [-0.0252, -0.0824,  0.0380,  ..., -0.0744, -0.0741,  0.1009]])
tensor([[-0.0151, -0.0470,  0.1057,  ...,  0.0288,  0.0280,  0.0171],
        [-0.0720, -0.0029, -0.0907,  ...,  0.0181, -0.0630, -0.0408],
        [-0.0417,  0.0548, -0.1226,  ...,  0.0335,  0.0679,  0.0900],
        ...,
        [ 0.0074, -0.0028,  0.0292,  ..., -0.0218,  0.0754,  0.0473],
        [ 0.0307, -0.0784, -0.0875,  ...,  0.0460, -0.0903,  0.0510],
        [-0.0252, -0.0824,  0.0380,  ..., -0.0744, -0.0741,  0.1009]])
tensor([[ 0.0864,  0.0843,  0.0060,  ...,  0.0325, -0.0519, -0.0048],
        [ 0.0394, -0.0486, -0.0258,  ...,  0.0515,  0.0077, -0.0702],
        [ 0.0570, -0.0178,  0.0411,  ..., -0.0026, -0.0385,  0.0893],
        ...,
        [-0.0760,  0.0237,  0.0782,  ...,  0.0338,  0.0055, -0.0830],
        [-0.0755, -0.0767,  0.0308,  ..., -0.0234, -0.0403,  0.0812],
        [ 0.0057, -0.0511, -0.0834,  ...,  0.0028,  0.0834, -0.0340]])After training,
print(net.fc2.weight.data)
print(net1.fc12.weight.data)
print(net1.fc22.weight.data)tensor([[-0.0151, -0.0470,  0.1057,  ...,  0.0288,  0.0280,  0.0171],
        [-0.0720, -0.0029, -0.0907,  ...,  0.0181, -0.0630, -0.0408],
        [-0.0417,  0.0548, -0.1226,  ...,  0.0335,  0.0679,  0.0900],
        ...,
        [ 0.0074, -0.0028,  0.0292,  ..., -0.0218,  0.0754,  0.0473],
        [ 0.0307, -0.0784, -0.0875,  ...,  0.0460, -0.0903,  0.0510],
        [-0.0252, -0.0824,  0.0380,  ..., -0.0744, -0.0741,  0.1009]])tensor([[-0.0322, -0.0377,  0.0366,  ...,  0.0290,  0.0322,  0.0069],
        [-0.0749, -0.0033, -0.0902,  ...,  0.0179, -0.0650, -0.0402],
        [-0.0362,  0.0748, -0.1354,  ...,  0.0352,  0.0715,  0.1009],
        ...,
        [ 0.0244, -0.0192, -0.0326,  ..., -0.0220,  0.0661,  0.0834],
        [ 0.0304, -0.0785, -0.0976,  ...,  0.0461, -0.0911,  0.0529],
        [-0.0225, -0.0737,  0.0275,  ..., -0.0747, -0.0805,  0.1130]])tensor([[ 0.0864,  0.0843,  0.0060,  ...,  0.0325, -0.0519, -0.0048],
        [ 0.0390, -0.0469, -0.0283,  ...,  0.0506,  0.0030, -0.0723],
        [ 0.0571, -0.0178,  0.0411,  ..., -0.0027, -0.0389,  0.0893],
        ...,
        [-0.0763,  0.0230,  0.0792,  ...,  0.0337,  0.0065, -0.0802],
        [-0.0756, -0.0769,  0.0306,  ..., -0.0235, -0.0413,  0.0810],
        [ 0.0048, -0.0525, -0.0822,  ...,  0.0019,  0.0785, -0.0313]])

Thus, we can observe that our initial model’s parameters weights are copied but not changed while the new model is training. Also, the layer that had the weights copied, had its weights changed after training, thereby validating our sanity check.

We have now built a more complex model and are able to reuse our weights in parallel. Of course, if you want parallelism in between, you just need to change the flow of tensors in the forward function of class SideNet().

Another Example

For example, let’s say we want to keep our convolution layers but introduce two parallel routes after that. We want:

Introducing parallelism in between a Neural Network. Source: This is my own conceptual drawing in MS Paint.

The class SideNet() now looks as follows:

class SideNet(nn.Module):
    def __init__(self):
        super(SideNet, self).__init__()
        self.pool = nn.MaxPool2d(2, 2)self.conv11 = nn.Conv2d(3, 6, 5)
        self.conv12 = nn.Conv2d(6, 16, 5)
        
        self.conv11.weight.data.copy_(net.conv1.weight.data)
        self.conv12.weight.data.copy_(net.conv2.weight.data)
                
        self.fc11 = nn.Linear(16 * 5 * 5, 120)
        self.fc12 = nn.Linear(120, 84)
    
        self.fc11.weight.data.copy_(net.fc1.weight.data)
        self.fc12.weight.data.copy_(net.fc2.weight.data)
        
        self.fc21 = nn.Linear(16 * 5 * 5, 120)
        self.fc22 = nn.Linear(120, 84)
        
        self.fc3 = nn.Linear(168,10)def forward(self, x):
        y = self.pool(F.relu(self.conv11(x)))
        y = self.pool(F.relu(self.conv12(y)))
        
        z = y.view(-1, 16 * 5 * 5)
        y = F.relu(self.fc11(z))
        y = F.relu(self.fc12(y))
        
    
        x = F.relu(self.fc21(z))
        x = F.relu(self.fc22(x))
    
        out = self.fc3(torch.cat((x,y),dim=1))
        return out
# create a new model
net1 = SideNet()

The TensorBoard depiction confirms what we were aspiring to build:

TensorBoard depiction of our modified parallel neural network on my computer.

I hope all these examples give you confidence on coding up complex neural networks as well as serve as a key stepping stone in your journey in machine learning. Thanks for reading :)