Implementing Transfer Learning from RGB to Multi-channel Imagery

Semantic Segmentation using ResNet50 Backbone plus Pyramid Pooling

Published in

Towards Data Science

7 min readMar 23, 2021

Introduction

I very recently had the privilege and opportunity to participate in a computer vision challenge in partnership with Omdena and WeedBot, an impact-driven startup developing a laser weeding machinery for farmers to find and remove weeds with a laser beam.

We explored Image Segmentation techniques for crops vs weeds classification and explored both Semantic and Instance Segmentation approaches. In this article, we shall be exploring two distinct concepts implemented within the Semantic Segmentation part of the project —

Transfer Learning for Multi-Channel Input

What is Transfer Learning?

Transfer learning is a machine learning technique for the re-use of a pre-trained model on a new problem.

In transfer learning, a machine leverages knowledge gained from a different task to improve generalization by extracting useful features from new samples for a new but related task.

For example, in training a classifier to predict a dog’s breed, one could use the knowledge gained during training to distinguish animals in general.

Advantages

There are many advantages to using transfer learning. The benefits are it saves time, gives better performance, and requires much less data.

Deep learning models for Natural Language Processing and Computer Vision problems typically require huge amounts of data for the model to learn. This can both be time-consuming as well as costly and can be a huge barrier to machine learning adoption for individuals and small organizations.

Transfer Learning reduces this barrier by allowing one to take a model already trained and applying it to a different but related problem. Because the model is pre-trained, this means we are not entirely training from scratch and get to take advantage of what the model already has learned.

Provided for the challenge was 775 images with a resolution of 3008x3008. Given the small number of images, transfer learning seemed like a good path to explore.

Available publicly are open-source models trained on publicly available datasets like ResNet, AlexNet, VGG to name a few. Two such common datasets are the ImageNet and Coco datasets. These datasets consist of over 14M and 330K images respectively.

From RGB to Multi-Channel

Our exploration suggested three methods we could employ to convert a model trained on 3 channels to more channels. These methods range across different levels of complexity. We shall briefly discuss the methods —

One method was to simply expand the weight dimensions to account for the additional number of channels and randomly initialize the values.
A second method is similar to the first, except rather than filling the values with randomly initialized values, we fill it with the mean of the other values. We came across this method in a scientific paper (referenced below) that described this method as working better than the first method. This is the method we shall be exploring in this article.
The final method theoretically, should offer the best performance. This method, however, would take longer with regard to training time. This method suggests that the previous methods discussed would be biased towards the first three channels since this is what the pre-trained model was originally trained on. What this method proposes instead is that we create a second parallel network that performs feature extraction on the remaining channels, we then concatenate the output with the output of the original pre-trained model. This way the second model learns the representation specific to the additional channels and we still taking advantage of using the pre-trained model as is. This method will be explored in a different article.

ResNet50 Backbone & 15-Channel Image

The backbone model chosen for this problem is ResNet50. ResNet50 short for “Residual Networks” is a 50-layer deep convolutional neural network that leverages residual learning.

It bears mentioning, one characteristic of using pre-trained models is that the model expects the input dimensions of the new task to be the same as the input dimensions of the old task it was pre-trained on.

The resnet50 model was pre-trained on an input dimension of 224x224 for the height and width, with 3 channels for RGB.

For this segmentation task, we are using a number of feature generation techniques adding an additional 12 channels to the original 3 channel RGB image. See this article for more information on how the additional channels were generated.

The challenge then was getting a pre-trained resnet50 model to take as input the new image dimension of 480x400 with 15 channels for the third dimension.

We shall be doing a code walkthrough of how this was achieved. First, we download and import the resnet50 model using Keras —

Here, we specify that we wish to download the pre-trained weights for imagenet. Typically with transfer learning, we exclude the final layer and replace it with layers more specific to the new task. Setting “include_top=False” allows us to exclude the final layer. This should be set to true if we’re making inference with the pre-trained model as opposed to implementing transfer learning.

At this point, we need to change both the resolution (height & width) from 224x224 to 480x400 and the number of channels from 3 to 15. Since changing the input height and width does not affect the dimensions of the weights, this is more straightforward to change.

Changing the number of input channels, on the other hand, does affect the dimensions of the weights. Let us look into this in more detail.

For comparison purposes, we shall be using a modified uNet architecture, we can compare the model summary for a 224x224x3 input and a 480x400x15 input by first changing the height-width, and then the number of channels.

semantic_segmentation(224, 224, 3).summary()

semantic_segmentation(400, 480, 3).summary()

We notice that the total number of parameters stays the same. This confirms that the input height and width do not affect the weight dimensions. Now, let us look into changing the number of channels —

semantic_segmentation(400, 480, 15).summary()

We see the notice the number of parameters increases from 18,515 to 20,243. we also notice this is only as a result of the parameters of the first convolutional layer increasing from 448 to 2,176 and the number of parameters for subsequent layers remain the same.

Without having to try this out in Keras, this is confirmed theoretically by recalling that the weight dimensions of a convolutional layer are determined by the height and width of the filter, the input depth, and output depth. The height and width of the image play no part in this.

The first step to changing the input dimensions involves copying the config information of the model. This gives us the composition of the model in dictionary format. We can edit this dictionary by changing the input dimensions, and create a new model with the edited config dictionary —

We have now created a new model with the same network structure as ResNet50. It is important to note that this doesn’t automatically copy over the weights of the resnet50 model which is the main point of doing this.

To do this, we would need to loop through the layers of both the resnet50 model and the newly created model and copy over the weights.

We would run into a problem, however, as the dimensions won’t match. We confirmed earlier changing the number of channels affects the dimensions of the weights. To get around this, we expand the weight dimensions to more accurately represent the increase in channels and copy over the mean of the weights. This is done as follows —

This takes care of the ResNet50 backbone of the Semantic Segmentation model.

Pyramid Pooling Module

In a different article, I discussed that while exploring Semantic Segmentation, we split into multiple teams to explore different segmentation models. While exploring PSPNet, we noticed that while the model wasn’t completely accurate, it produced smooth segmentations.

We theorized that this might be as a result of the Pyramid Pooling Module leveraged by the model. For the final model. we resolved to use this in combination with resnet50 as the backbone.

The pyramid pooling works by observing the whole feature map with sub-regions in different locations. The pooling kernel covers the whole, half of, 1/4th, and 1/8th of the image and is fused as the global prior. This is then combined with the original feature map from the backbone.

Follow along on this Colab notebook.

References

Lin T, Maire M, Belongie S, Bourdev L, Girshick R, Hays J, Perona P, Ramanan D, Zitnick L, Dollár P 2014 Microsoft COCO: Common Objects in Context https://arxiv.org/abs/1405.0312

Jia Deng; Wei Dong; Richard Socher; Li-Jia Li; Kai Li; Li Fei-Fei 2009 ImageNet: A large-scale hierarchical image database https://ieeexplore.ieee.org/document/5206848

Donges N 2019 WHAT IS TRANSFER LEARNING? EXPLORING THE POPULAR DEEP LEARNING APPROACH https://builtin.com/data-science/transfer-learning

He K, Zhang X, Ren S, Sun J 2015 Deep Residual Learning for Image Recognition https://arxiv.org/pdf/1512.03385.pdf

Zhao et al 2016 Pyramid Pooling Module https://arxiv.org/abs/1612.01105v2