The world’s leading publication for data science, AI, and ML professionals.

Gentle introduction to 2D Hand Pose Estimation: Let’s Code It!

Learn how to train a 2D hand pose estimator in PyTorch. This tutorial could be also your introduction to PyTorch.

Image by Author
Image by Author

Welcome back!

We are continuing our journey into Hand Pose Estimation. Now you are reading the second part, which is about coding and PyTorch. I highly recommend you to read the first part before diving deep into coding:

Gentle introduction to 2D Hand Pose Estimation: Approach Explained

It would help you a lot in understanding the dataset, preprocessing, model, training, and evaluation.

For this tutorial, I’ve created a Github repository, where you’ll find a complete code on training hand pose estimator and doing inferences. You can go there now or later – clone it, read it and run it. It was implemented in Pytorch, and if you haven’t worked with this library before – it’s a great opportunity to start! Experience with PyTorch is not required here; I am going to explain all the main concepts, so this tutorial could be also your introduction to PyTorch.

And now… Open your Jupyter Notebooks!

Contents Data Let’s Train

  • DataLoader
  • Model —Trainer Let’s do Inference
  • Post-Processing
  • Evaluation What’s Next

Data

Let’s start with data. We are going to use the FreiHAND dataset, which you can download here. It would be really useful for you to spend some time reading dataset description, exploring archive structure, and opening images and files. Make yourself comfortable with the data we are going to use throughout this tutorial.

Image 1. FreiHAND dataset folder structure. In green - the files needed for this tutorial. Image by Author
Image 1. FreiHAND dataset folder structure. In green – the files needed for this tutorial. Image by Author

For now, we need only a folder with RGB images and 2D labels. 2D labels are calculated by projecting 3D labels on the image plane using a camera matrix. Here is a formula that I found in the FreiHAND dataset Github repo:

def projectPoints(xyz, K):     
    xyz = np.array(xyz)     
    K = np.array(K)     
    uv = np.matmul(K, xyz.T).T     
    return uv[:, :2] / uv[:, -1:]

We are going to use only the first 32,560 images – the raw images. Other images in the dataset are exactly the same as raw ones but with background augmentation, so we’ll skip them for now.

Dataset Split. FreiHAND dataset already looks like a shuffled set of hand images, so we may do a split by image ids. Let the first 80% be the train part, the next 15% – validation, and the last 5% – test. Train images are used primarily for training, validation – to control validation loss and decide when to stop model training, and test – to do a final model evaluation.

Image 2. Dataset split into train, validation, and test parts. Image by Author
Image 2. Dataset split into train, validation, and test parts. Image by Author

Let’s Train

To train a model, we need:

  • DataLoader Class. FreiHAND dataset (as most of the image datasets) is too large to be fully stored in RAM, so we are going to train the model with batch training. For this reason, DataLoader is needed, which iterates through the dataset and loads only a batch of data at the time.
  • Model Class. We are going to create our own custom UNet-like model.
  • Trainer Class. This class does all the training: requests the batches of data, gets model predictions, calculates loss, updates model weights, evaluates model, and finishes training when validation loss stops decreasing.

We will discuss all these classes in detail right now. And check a notebook that has a complete training flow.

Image 3. Training Pipeline and needed PyTorch classes. Image by Author
Image 3. Training Pipeline and needed PyTorch classes. Image by Author

DataLoader

There are two PyTorch classes that we are going to use to load data.

DataLoader. This class is implemented in PyTorch, so you just call it, providing an instance of Dataset class and some other arguments (check their meaning here):

train_dataloader = DataLoader( 
    dataset=train_dataset, 
    batch_size=48, 
    shuffle=True, 
    drop_last=True, 
    num_workers=2 
)

So now we can loop over the dataloader like this:

for data_batch in train_dataloader: 
    # do something

Dataset. PyTorch has a small list of the implemented Datasets, where you may find what you need. However, be prepared – in most cases, you are going to write your own Dataset class. And today is this day.

Dataset class is not hard to write when you follow the rules:

  • Your Dataset class inherits (subclasses) torch.utils.data.Dataset
  • You need to rewrite function len(), which returns the length of the dataset. You can put here the length of the file with labels, or the number of images in the folder.
  • And you need to rewrite function getitem(), which takes sample id and returns a list or dictionary with image and its labels. Later, the output of getitem() will be stacked into batches by DataLoader.

So, your Dataset class for the FreiHAND dataset should look something like this. The full version is on Github.

from torch.utils.data import Dataset

class FreiHAND(Dataset):
    def __init__(self, config, set_type="train"):
        ## initialize path to image folders
    ## initialize paths to files with labels
    ## create train/test/val split
    ## define data augmentations

    def __len__(self):
    return len(self.anno)

    def __getitem__(self, idx):
        ## load image by id, use PIL librabry
        ## load its labels
        ## do augmentations if needed
        ## convert everything into PyTorch Tensors

        return {
            "image": image,
            "keypoints": keypoints,
            "heatmaps": heatmaps,
            "image_name": image_name,
            "image_raw": image_raw,
        }

My personal preference is to add all image information to getitem() output – raw image, resized and normalized image, image name, keypoints as numbers, and keypoints as heatmaps. It simplifies my life a lot when I do debug, plot images, or evaluate model accuracy.

Don’t forget to calculate R,G,B channel means and standard deviations (just before initialization of datasets and dataloaders) using this function. Then add the values to Normalize() transformation in Dataset class; it goes before or after Resize() transformation – check here how to do it. And only now – initialize the datasets and dataloaders.

To summarize, there will be 3 instances of a Dataset class (train_dataset, val_dataset, and test_dataset) and 3 instances of Dataloader class (train_dataloader, val_dataloder, and test_dataloader). That’s because train, validation, and test sets are completely different sets of images.

You don’t need to write 3 different Datasets classes, though. One is enough. Just provide _settype argument when creating Dataset instance, like this:

train_dataset = FreiHAND(config=config, set_type="train") 
train_dataloader = DataLoader( 
    dataset=train_dataset, 
    batch_size=config["batch_size"], 
    shuffle=True, 
    drop_last=True, 
    num_workers=2 
)

And make sure, you have a code snippet in your Dataset class that splits images into the train, validation, and test parts.

Model

The list of implemented and pre-trained models in PyTorch is huge – there are models for image classification, video classification, semantic and instance segmentation, object detection, and even one – for human keypoint detection. PRE-TRAINED. Isn’t it cool?!

But for the study purposes, we will implement our own UNet-like architecture. This one:

Image 4. My custom UNet-like model for 2D Hand Pose Estimation. Image by Author
Image 4. My custom UNet-like model for 2D Hand Pose Estimation. Image by Author

All custom models in PyTorch should subclass torch.nn.Module and have function forward() rewritten. UNet is not a typical feedforward network, it has skip connections. So, during forward pass, some layer outputs should be saved, and later – concatenated with outputs from the deeper layers. There is no problem here, you can write literally any type of forward passes, and PyTorch will understand how to do a backward pass on its own.

So, our custom UNet model should look something like this. And here is a full version.

class ShallowUNet(nn.Module):     
    def __init__(self, in_channel, out_channel): 
        super().__init__()  
        # initialize layer - custom or from PyTorch list
    def forward(self, x):  
        # implement forward pass  
        # you can do literally anything here  
        return out

By the way, if your model has some repetitive blocks, you may implement blocks as Modules, the same way as a model – by subclassing torch.nn.Module and rewriting function forward(). Look, how I did it with double convolution block in UNet.

class ConvBlock(nn.Module):     
    def __init__(self, in_depth, out_depth):
        super().__init__()
        self.double_conv = nn.Sequential(
            nn.BatchNorm2d(in_depth), 
            nn.Conv2d(in_depth, out_depth, kernel_size=3, padding=1, bias=False),
            nn.ReLU(inplace=True), 
            nn.BatchNorm2d(out_depth), 
            nn.Conv2d(out_depth, out_depth, kernel_size=3, padding=1, bias=False), 
            nn.ReLU(inplace=True), 
        ) 
     def forward(self, x): 
        return self.double_conv(x)

Trainer

Trainer class is what PyTorch doesn’t have at all, so everything you need to write from scratch, using code snippets from PyTorch tutorials.

Sometimes it freaks me out, but sometimes I see benefits: it makes me better understand what’s going on during training, and gives me full control over the training.

Here is the code of the Trainer class. I am not going to show the Trainer code in this post because the exact code is not as important as it is for Dataset and Model classes. By the way, you don’t have to even write Trainer as a class, it is, again, my personal preference. You can use functions, or just put all your training code into a single Jupyter Notebook cell – it’s up to you.

However, there are some things to keep in mind when writing code for training:

  • Each epoch should have train and evaluation stages. Before training, you need to explicitly put the model into training mode, do this – model.train(). And the same for evaluation – model.eval(). This is because some layers may behave differently during training and evaluation/inference, examples are Dropout and BatchNorm.
  • During the evaluation, you may additionally use _torch.nograd() function. It tells your model "Do not calculate gradients now", so forward pass would take faster and use less memory.
  • You don’t have to loop over all train (or validation) dataset during training (or validation). Good ideas would be to limit the number of training and validation batches per epoch.
  • For those, interested in training on GPUs. Both your data and model should be on exactly the same device. By default, torch tensors and the model are initialized to be on CPU, so you should explicitly transfer them to GPU if it is available. Here is a code snippet how to do it:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 
model = model.to(device) 
inputs = data["image"].to(device) 
labels = data["heatmaps"].to(device)

Yeah, labels as well, otherwise you will not be able to calculate loss, as model output is on GPU.

And it is okay to train a model on GPU and run predictions on CPU, and vice versa. Only do this, when loading your model weights:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 
model.load_state_dict(torch.load(model_path, map_location=device)))

Loss. PyTorch has standard losses implemented, however, if you’d like to use custom loss, that’s is what you need to write on your own. Custom losses are implemented the same way as models – by subclassing torch.nn.Module and rewriting function forward(). A short template is below and the full version you can find here.

class IoULoss(nn.Module): 
    def __init__(self): 
        super(IoULoss, self).__init__() 
        #initialize parameters 
    def forward(self, y_pred, y_true): 
        #calculate loss from labels and predicitons 
        return loss

That’s it about training. I really recommend you to go through Train Notebook.ipynb if you haven’t done it yet.

And if you wish to retrain the model on your own, this should be a useful text snippet from the previous part of the tutorial:

_"For this tutorial, I have trained the model with batch_size=48 and batches_perepoch=50. I started with learning rate=0.1 and reduced it by half every time training loss stopped decreasing. I finished training when loss stopped decreasing on the validation set. It took about 200 epochs to converge (and 2 hours on GPU) and my final train and validation losses are 0.437 and 0.476 respectively.

These numbers are only landmarks. When training your model, you may come up with a different number of epochs to converge and slightly different final losses. Also, feel free to increase/decrease the batch size to fit into your machine memory, and increase/decrease the learning rate."

Good luck!

Let’s do Inference

The inference pipeline is much simpler comparing to the Training pipeline. We only need to:

  • Load the trained model;
  • Create an instance of dataloader for test set;
  • Write post-processor that converts heatmaps to a vector of keypoint locations;
  • And, ideally, calculate prediction error on the test set to evaluate model performance.

For the full code refer to Inference Notebook.ipynb.

Loading model, creating test dataloader, and run predictions – should be easy tasks after you understand the training part. But on post-processing and evaluation we are going to focus more right now.

Image 4. Inference Pipeline in detail. Image by Author
Image 4. Inference Pipeline in detail. Image by Author

Post-Processing

The keypoint locations are better to calculate by averaging among all heatmap values. And here is the code.

Image 5. How to calculate keypoint location from a heatmap by averaging. Image by Author
Image 5. How to calculate keypoint location from a heatmap by averaging. Image by Author

Evaluation

Calculate the average prediction error on the test set is a must-do. Yeah, you can go further, report maximum and minimum image errors, show percentiles, calculate errors by fingers and joints, and visualize images with the largest and lowest errors. But for now, let’s leave it simple.

Here are the evaluation results of my hand pose estimator on the test set:

  • Average error per keypoint: 4.5% from image size
  • Average error per keypoint: 6 pixels for image 128×128
  • Average error per keypoint: 10 pixels for image 224×224

Average error per keypoint means error averaged among 1) all keypoints in the image and then 2) all images in the datasets. It can be reported as a percentage of the image size, or as pixel error for raw or resized images. Keypoint error here is Euclidian distance on the image plane between actual and predicted keypoint locations.

Image 6. Visualization of prediction errors for some keypoints. Image by Author
Image 6. Visualization of prediction errors for some keypoints. Image by Author

What’s Next

Hope now, after reading the full tutorial and looking through the code, 2D hand pose estimation doesn’t seem a complicated task to you anymore.

However, the hand pose estimator that we’ve trained today is far from being production-ready. It fails for poses with occluded keypoints, and it works poorly for images from different datasets. There are a lot of improvements to be done.

It’s a good idea to begin with something simple for study purposes, and slowly, step-by-step introduce more advanced techniques into the algorithm. Maybe later I’ll write tutorials on these techniques. Let me know if you are interested : )


Originally published at https://notrocketscience.blog on April 30, 2021.

If you’d like to read more tutorials like this, subscribe to my blog "Not Rocket Science" – Telegram and Twitter.


Related Articles