Make yourself comfortable, it’s a long read

In 2018 I spent 6 months working on my master’s thesis on Hand Pose Estimation. That was a challenging and insightful period of my life that resulted in 40-page research. Till now I receive job interview invites, as well as thank you and question letters, which make me think that this research is still relevant, even though 3 years have passed – and for Deep Learning it’s a long time.
My thesis was on 2D and 3D Hand Pose estimation from a single RGB image. The 3D part was… em… I wouldn’t say it was a complete failure, but close to it : ) However, the 2D part (surprisingly) was comparable to state-of-the-art approaches at that time.
In the next months, I am planning to convert my thesis into series of human-friendly tutorials on 2D Hand Pose Estimation: today we start with the gentle introduction of the approach, next time I will show how to implement it in code, and later we will go over several more advanced techniques to improve model performance.
Intrigued? Let’s go!
Contents
- What exactly we are estimating in the Hand Pose Estimation task
- Where to find a dataset (and how to understand it)
- How to preprocess data for training
- What model architecture to use
- Details on training
- Understand and visualize model predictions
- What’s next
What exactly we are estimating in the Hand Pose Estimation task
A good question to start with.
The pose of the hand is defined by the locations of its keypoints. So, in the hand pose estimation task we are looking for keypoint locations.
Hand has 21 keypoints: wrist + 5 fingers * (3 finger joints + 1 fingertip) = 21. Think about it: knowing the location of each keypoint we can easily understand whether a person shows "palm down", "thumb up", "fist" or "peace" sign.
Order matters. We don’t need just a "cloud of locations", we need keypoint-location correspondences. Where is the wrist located? Where is the fingertip of the index finger? etc.

A typical 2D hand pose estimator looks something like this:
- Input: hand image. It is an important assumption for many hand pose estimators: there is only one hand in the input image, and the image is cropped to exclude any other body parts and background items.
- Output: a list of (x,y) keypoint locations. Location may be represented as pixel location, where x and y are in the range [0, image_size], or as normalized location – in the range [0,1]. Normalized location is pixel location divided by image size. Hope, the example below will make it clear.

Couple more things to clarify.
How about the second hand? Usually, we train pose estimators to work only for right hands (or left hands). If we would like to do inference for a left hand, we just give a model mirrored image that looks exactly like… right hand.
Usually, in the papers (or demo) hand pose is visualized as a skeleton, sometimes with multicolor fingers. Don’t let it mislead you, the skeleton is drawn for visualization purposes and by connecting keypoints with lines. We know the keypoint order – so we know which keypoint pairs to connect to draw a skeleton, right?

In this tutorial, we will learn how to estimate 2D hand pose from a single RGB image. This means, that we will train a model that inputs a single RGB image and outputs keypoint locations on the image plane. However, Hand Pose Estimation is a large family of different tasks: you may estimate 2D or even 3D hand pose having single RGB, single depth, single RGBD (RGB+Depth) image, or even multiple images. Maybe, I will write a post showing all the variety of these tasks. But for now, let’s move on.
Where to find a dataset (and how to understand it)
Even though collecting and (more importantly) labeling hand pose data are not easy tasks, there are dozens of datasets available on the internet. Depending on your research or business goal you may need:
- Real or Synthetic data
- Photos or Video sequence
- Image types: RGB, depth, RGBD, or stereo
- Hand showing signs or interacting with objects
- Egocentric or third-person point of view
- Labels: 2D locations, 3D locations, mesh, hand mask,…
- Number of keypoint labeled
- Whether occluded keypoints labeled, whether all keypoints present in the image etc
Full list of all open-source datasets you may find in "Awesome Hand Pose Estimation" Github repository [2]. For study and research purposes feel free to use any of these datasets, however, if you are planning to train a model for a business, make sure that dataset license allows that.
For this tutorial, I’ve selected FreiHAND dataset [3]. It contains 33k real images of the right hand and has 2D labels for 21 keypoints. That is all we need for now.
FreiHAND dataset is clean and well-structured, however, pay attention to:
- The order of keypoint locations is the same as I showed in Image 1. Labels are stored in a 2D numpy array, where the location of a wrist is in row 0 and the location of a fingertip of the little finger – in row 20.
- You need to calculate 2D locations on your own, using 3D locations and a camera matrix. Use this formula, that I found in FreiHAND dataset Github repo [4]:
def projectPoints(xyz, K):
xyz = np.array(xyz)
K = np.array(K)
uv = np.matmul(K, xyz.T).T
return uv[:, :2] / uv[:, -1:]
- In the training part, there are 130,240 images and only 32,560 labels. These labels are only for the first 32,560 raw images. If you want to train a model on all the images (raw + augmented), here is how to get labels:
image_labels = labels_array[image_id % 32560]
That’s because image 32560 looks exactly like image 0, and so on.
How to preprocess data for training
Here is a step-by-step instruction:
- Split the dataset into train, validation, and test parts. As usual, the train set will be used to train a model, with the validation part we choose when to stop training, and we evaluate model performance on the test set.
- Resize images to 128×128 pixels. Hand is a simple object, so such size should be okay. If keypoint locations are in pixel format, make sure you also "resize" them.

- Original image values are in range [0,255], Min-max scale them to be in range [0,1].
- Standard normalize images using train set means and standard deviations. Each channel (R,G,B) is normalized separately, so there are 3 means and 3 stds in total. Channel mean (and std) is calculated among all pixels in all images within a color channel.
-
Create heatmaps from an array of keypoint locations. Estimating pose with heatmaps is a widely used approach in 2D Hand (and Human) Pose estimation, and you’ll see it in literally any paper (with slight modifications). I believe research [5] was one of the earliest where heatmaps were used.
We need to create a separate heatmap for each keypoint, so there will be 21 heatmaps in total. Check the image below for details.

Heatmaps are blurred to prevent the model from overfitting and make learning stable and faster. Actual blur parameters do not matter, the only rule here is not to make a keypoint "dot" neither too big nor too small. Final MIN-MAX scaling is needed as we are going to use Sigmoid transformation in the last layer of the neural network, so heatmaps and model predictions are in the same range.
To conclude:
- X is an image of size 3x128x128.
- Y is an array of size 21x128x128, that contains 21 stacked heatmaps the same size as an input image. Make sure, heatmap order is the same as the keypoint order in Image 1.
What model architecture to use
We need some kind of encoder-decoder model because the output is the same size as an input – 128×128. My personal preference here is UNet [6]. I’ve worked with UNets a lot, used them for various segmentation tasks, and they always perform awesomely.
We don’t need as complex UNet as in the original paper [6], because, again, the hand is a simple object. So let’s start with something like this:

Details on training
Loss. Most papers use MSE loss for heatmaps, for instance, these two popular papers on 2D Human Pose Estimation – [7], [8]. I’ve spent some time playing around with MSE loss, but it just didn’t work.
Then I found paper [9], where authors train a model for semantic segmentation using Intersection over Union (IoU) loss. 2D Hand Pose estimation is similar to segmentation, with the only difference – heatmaps have continuous values in the range [0,1], but not just binary labels 0/1 as segmentation masks. Nevertheless, we may use the formulas from [9], slightly modified for heatmaps. And it works!

Training. For this tutorial, I have trained the model with batch_size=48 and batches_per_epoch=50. I started with learning rate=0.1 and reduced it by half every time training loss stopped decreasing. I finished training when loss stopped decreasing on the validation set. It took about 200 epochs to converge (and 2 hours on GPU) and my final train and validation losses are 0.437 and 0.476 respectively.
These numbers are only landmarks. When training your model, you may come up with a different number of epochs to converge and slightly different final losses. Also, feel free to increase/decrease the batch size to fit into your machine memory, and increase/decrease the learning rate.
Understand and visualize model predictions
So now we have a trained model that outputs heatmaps. However, heatmaps are not keypoint locations, so additional post-processing is needed. By looking at the heatmap we may easily understand where the model thinks a keypoint is located. Yeah, it is the center of the "white" region, a region with the largest values. So let’s incorporate the same logic in our post-processing.

There are 2 options:
- Most simply, we may just find a pixel with the largest value in the heatmap. (x,y) location of this pixel is the keypoint location.
- But a more robust way would be to calculate the average among all heatmap values. See the image below for details on how to do that.

Now we are ready to evaluate the model and visualize predictions. The model works well for most poses, however, it fails for poses with severe keypoint occlusions. Well, labeling occluded keypoints is not an easy task even for a human annotator.

And here is model accuracy on the test set. The average error is calculated by averaging errors among all keypoints in the image and then among all images. 4.5% – not bad!
- Average error: 4.5% from image size
- Average error: 6 pixels for image 128×128
- Average error: 10 pixels for image 224×224
What’s next
I am glad that you’ve read so far. Hope, now the 2D Hand Pose Estimation is much more clear to you. But…
You know it when you can code it, right? : ) So in the next part of this tutorial, I’ll share and explain all the code, and you’ll learn how to train a 2D hand pose estimator on your own. Don’t shut down your Jupyter Notebook!
Update: The second part is already available here.
References
[1] In case you want to check my thesis, here is text and code.
[2] " Awesome Hand Pose Estimation ", a Github repository with a list of open-source datasets and papers.
[3] FreiHAND dataset, here you can download it.
[4] FreiHAND dataset Github repository.
[5] Tomas Pfister, James Charles, Andrew Zisserman. " Flowing ConvNets for Human Pose Estimation in Videos "
[6] Olaf Ronneberger, Philipp Fischer, Thomas Brox. " U-Net: Convolutional Networks for Biomedical Image Segmentation "
[7] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, Yaser Sheikh. " Convolution Pose Machines "
[8] Alejandro Newell, Kaiyu Yang, Jia Deng. " Stacked Hourglass Networks for Human Pose Estimation "
[9] Md Atiqur Rahman and Yang Wang. " Optimizing Intersection-Over-Union in Deep Neural Networks for Image Segmentation "
Originally published at https:notrocketscience.blog on April 21, 2021.
If you’d like to read more tutorials like this, subscribe to my blog "Not Rocket Science" – Telegram and Twitter.