Review: DeepPose — Cascade of CNN (Human Pose Estimation)

Using Cascade of Convolutional Neural Networks for Refinement, State-of-the-art Performance on Four Datasets

Published in

Towards Data Science

6 min readMar 30, 2019

In this story, DeepPose, by Google, for Human Pose Estimation, is reviewed. It is formulated as a Deep Neural Network (DNN)-based regression problem towards body joints. With a cascade of DNN, high precision pose estimates are achieved. This is a 2014 CVPR paper with more than 900 citations. (Sik-Ho Tsang @ Medium)

Outline

Pose Vector
Convolutional Neural Network (CNN) As Pose Regressor
Cascade of Pose Regressors
Results

1. Pose Vector

To express a pose we encode the locations of all k body joints in pose vector defined as y:

For each yi, there is x and y coordinates of i-th joint. These are absolute coordinates within the image.
A labeled image is denoted by (x, y) where x is image data and y is the ground truth pose vector as shown in the equation above. (I followed the denotation in the paper though it might be a bit confused for y.)
And we can normalize the coordinates yi w.r.t. a box b bounded by the human body or parts of it, where b=(bc, bw, bh) with bc is the center, bw is the width, and bh is the height:

As shown above, yi is scaled by the box size and translated by the box center. With:

N(y; b) is the normalized pose vector. And N(x; b) is a crop of the image x by the bounding box b.

2. Convolutional Neural Network (CNN) As Pose Regressor

With trained parameters θ, ψ based on CNN outputs the normalized prediction of joints. y* can be obtained by denormalization N^-1.
The architecture is as shown above which is AlexNet.
The first layer takes as input an image of predefined size.
The last layer outputs the 2k joint coordinates.
C(55×55×96) — LRN — P — C(27×27×256) — LRN — P — C(13×13×384) — C(13×13×384) — C(13×13×256) — P — F(4096) — F(4096) where C is convolution, LRN is local response normalization, P is pooling, and F is fully connected layer.
Total number of parameters is 40M.
The loss is the linear regression loss predicting the pose vector by minimizing L2 distance between the predicted and the ground true pose vector.
With normalized training set D_N, the L2 loss is:

where k is the number of joints at that image.
Mini-batch size is 128. Data is augmented with random translation and left/right flip.

3. Cascade of Pose Regressors

**Cascade of Pose Regressors: First Stage: (Left), Subsequent Stages (Right)**

It is not easy to increase the input size to have a finer pose estimation since this will increase the already large number of parameters. Thus, a cascade of pose regressors are proposed.
Thus, with stage s involved, the first stage:

where b⁰ is full image or a box obtained by person detector.
Then, subsequent stages:

where diam(y) is the distance of opposing joints, such as left shoulder and right hips, and then scaled by σ, σdiam(y).
For subsequent layer, augmentation is done to generate a simulated prediction based on a sampled displacement ẟ from Ni^(s-1):

And the training is based on this augmented training set:

4. Results

4.1. Datasets

Frames Labeled In Cinema (FLIC): 4000 training and 1000 test images from Hollywood movies with diverse poses and diverse clothing. For each labeled human, 10 upper body joints are labeled.
Leeds Sports Dataset (LSP): 11000 training and 1000 testing images from sports activities with challenging in terms of appearance and especially articulations. The majority of people have 150 pixel height. For each person the full body is labeled with total 14 joints.

4.2. Metrics

Percentage of Correct Parts (PCP): measures detection rate of limbs, where a limb is considered detected if the distance between the two predicted joint locations and the true limb joint locations is at most half of the limb length.
Percent of Detected Joints (PDJ): A joint is considered detected if the distance between the predicted and the true joint is within a certain fraction of the torso diameter. By varying this fraction, detection rates are obtained for varying degrees of localization precision.

4.3. Ablation Study

A small held out set of 50 images for FLIC and LSP datasets.
For FLIC, σ = 1.0 after exploring values {0.8, 1.0, 1.2}.
For LSP, σ = 2.0 after exploring values {1.5, 1.7, 2.0, 2.3}.
Stop improvmenting when S = 3 for the above datasets.
For each cascade stage starting at s = 2, 40 randomly translated crop boxes are augmented. For LSP with 14 joints, number of training samples = 11000×40×2×14 = 12M.
Running time is approximately 0.1s per image on a 12 core CPU.
The initial stage was trained within 3 days on approx. 100 workers, most of the final performance was achieved after 12 hours though.
Each refinement stage was trained for 7 days since the amount of data was 40 larger than the one for the initial stage due to the data augmentation.