Semantic hand segmentation using Pytorch

Published in

Towards Data Science

6 min readDec 1, 2020

Semantic segmentation is the task of predicting the class of each pixel in an image. This problem is more difficult than object detection, where you have to predict a box around the object. It is slightly easier than instance segmentation, where you have to not only predict the class of each pixel but also differentiate between multiple instances of the same class. The picture below shows the result that we are trying to get.

A sample of semantic hand segmentation. (images from HOF dataset[1])

Here we will try to get a quick and easy hand segmentation software up and running, using Pytorch and its pre-defined models.

We would not be designing our own neural network but will use DeepLabv3 with a Resnet50 backbone from Pytorch’s model repository. Then we will train our model on a combined dataset comprising of EGO Hands[2], GTEA[3] and Hand over Face[1] datasets. This will make up roughly 28k images and their segmentation mask which is 2.1 GB of data. Finally we will write some functions to use the model to segment hands in real time using OpenCV.

Model

The first step is to get the model from Pytorch’s repository. This is fairly simple.

Here we use model module from torchvision to get the deeplabv3_resnet50 model. We specify the number of classes usingnum_classes as two because we will generate two grayscale images, one for predicting region with hands and another with no hands. The grayscale images will have the same size as the input image. We will compare these two predictions to find out whether the model predicts higher chances of hands or no hands at each pixel of the image. The following image shows a sample prediction from the model with the two predicted masks and the final output after comparing these two masks.

The ouputs from the model. From left, hand prediction map, no hand prediction map and the mask generated after comparison — Outputs : 1. Hands prediction mask 2. No-hands prediction mask 3. Mask generated after comparing

Then we write a custom model to process the data.

This is all that we have to do with the model. I told you it’s going to be easy. 😜

Data, Dataset and DataLoader

Now we need to get the data. You can get the data from this link. The folder has 3 datasets. You should place the folder for the dataset that you wish to use in the folder where you will run the python code. Alternatively you can use your own dataset. 😅

We would create a dataset and then use Pytorch’s dataloader to fetch batches from the dataset for us. Here is our custom dataset.

I assume that you have your masks and images in separate folders and that those folders are located in the same parent folder.
The constructor takes three arguments :

parentDir: The name of the parent folder where image and mask folder are located.
imageDir: The folder where images are located.
maskDir: The folder where segmentation masks are located.

In the constructor for the SegDataset class we get a list of image and mask filenames. In the __getitem__ function we get both the image and mask. We resize and standardize the image and get X. For the mask, which is also our label, we resize and standardize the image. After that we apply a __bitwisenot__ operation on the mask to get another mask that is the exact negative of the original one. We then stack those two masks to get a two channel image, where the first channel corresponds to the hand label and the second one to the no-hand label.

We need to combine multiple datasets into one. Here is how I did it.

We create a SegDataset for each and then combine them together as a megaDataset 💪. Then we split the dataset and create two dataloaders for the train and validation datasets.

This was a bit more involved, but this was the DATA part so it was to be expected

Performance Metrics

Now that we have our model and dataloader ready, we will write some performance metrics to keep track of our training process.

We would use Intersection over Union and Pixel Accuracy as our metrics for this task. IOU is a standard metric for segmentation tasks. It gives the area of intersection between prediction and target divided by their union. Intuitively you can think of it as the area of correctly predicted region divided by the total relevant area. Numerically an IOU value over 0.5 is good. The farther the better.

Pixel accuracy is simply the number of correctly predicted pixels divided by total number of pixels. Here is my implementation of it.

Training

Now the moment, we have all been waiting for, the training.

First we create the optimizer and loss function, model and a learning rate scheduler objects with appropriate hyperparameters.

Then we write a training loop function that will take all these objects created as arguments and train the model.

It will display mean loss and the performance metrics during training. After every epoch it will save a checkpoint with all the losses and metrics calculated till now. This will allow us to stop the training at any time and resume it without any loss of data. The last argument to this function is the path to the checkpoint from which you want to resume training. The function returns a tuple of all the important data from the training, i.e. loss and metrics on train and validation dataset.

First, we run the training loop.

To plot the data we can use running average of the lists. Here is how I did it.

You can try adjusting the N value a bit to get smoother or noisier plots.

Real-time hand segmentation

We are ready to see the results of our labor. We will use OpenCV to read frames from a camera and then predict the segmentation mask for the frames.

First we will write some helper functions to get predictions from the model.

This function ouputs a binary mask. It takes as input either a numpy array of image or the path of the image. We apply closing and opening operations on the image to reduce noise and dilate the mask a bit.

Next, we have some functions to add a weighted green mask to hand region in the image.

Finally we will use OpenCV to read images from camera and predict the hand region in the image.

We process the images every fifth frame to keep the calculation on the lighter side. Here is a sample output

Conclusion

We trained a hand segmentation model and learned how to see the results in real-time.

Other things to try:

use data augmentation i.e. random cropping, scaling, noise addition
use a different model architecture
use bigger or tougher dataset
tune hyperparameters

References

[1]Urooj, Aisha, and Ali Borji. “Analysis of hand segmentation in the wild.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.

[2]Bambach, Sven, et al. “Lending a hand: Detecting hands and recognizing activities in complex egocentric interactions.” Proceedings of the IEEE International Conference on Computer Vision. 2015.

[3]Li, Yin, Miao Liu, and James M. Rehg. “In the eye of beholder: Joint learning of gaze and actions in first person video.” Proceedings of the European Conference on Computer Vision (ECCV). 2018.