Review: DeepPose — Cascade of CNN (Human Pose Estimation)

Using Cascade of Convolutional Neural Networks for Refinement, State-of-the-art Performance on Four Datasets

Sik-Ho Tsang
Towards Data Science

--

In this story, DeepPose, by Google, for Human Pose Estimation, is reviewed. It is formulated as a Deep Neural Network (DNN)-based regression problem towards body joints. With a cascade of DNN, high precision pose estimates are achieved. This is a 2014 CVPR paper with more than 900 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. Pose Vector
  2. Convolutional Neural Network (CNN) As Pose Regressor
  3. Cascade of Pose Regressors
  4. Results

1. Pose Vector

Pose Vector (Miss You Paul Walker!)
  • To express a pose we encode the locations of all k body joints in pose vector defined as y:
  • For each yi, there is x and y coordinates of i-th joint. These are absolute coordinates within the image.
  • A labeled image is denoted by (x, y) where x is image data and y is the ground truth pose vector as shown in the equation above. (I followed the denotation in the paper though it might be a bit confused for y.)
  • And we can normalize the coordinates yi w.r.t. a box b bounded by the human body or parts of it, where b=(bc, bw, bh) with bc is the center, bw is the width, and bh is the height:
  • As shown above, yi is scaled by the box size and translated by the box center. With:
  • N(y; b) is the normalized pose vector. And N(x; b) is a crop of the image x by the bounding box b.

2. Convolutional Neural Network (CNN) As Pose Regressor

CNN As Pose Regressor
  • With trained parameters θ, ψ based on CNN outputs the normalized prediction of joints. y* can be obtained by denormalization N^-1.
  • The architecture is as shown above which is AlexNet.
  • The first layer takes as input an image of predefined size.
  • The last layer outputs the 2k joint coordinates.
  • C(55×55×96) — LRN — P — C(27×27×256) — LRN — P — C(13×13×384) — C(13×13×384) — C(13×13×256) — P — F(4096) — F(4096) where C is convolution, LRN is local response normalization, P is pooling, and F is fully connected layer.
  • Total number of parameters is 40M.
  • The loss is the linear regression loss predicting the pose vector by minimizing L2 distance between the predicted and the ground true pose vector.
  • With normalized training set D_N, the L2 loss is:
  • where k is the number of joints at that image.
  • Mini-batch size is 128. Data is augmented with random translation and left/right flip.

3. Cascade of Pose Regressors

Cascade of Pose Regressors: First Stage: (Left), Subsequent Stages (Right)
  • It is not easy to increase the input size to have a finer pose estimation since this will increase the already large number of parameters. Thus, a cascade of pose regressors are proposed.
  • Thus, with stage s involved, the first stage:
  • where b⁰ is full image or a box obtained by person detector.
  • Then, subsequent stages:
  • where diam(y) is the distance of opposing joints, such as left shoulder and right hips, and then scaled by σ, σdiam(y).
  • For subsequent layer, augmentation is done to generate a simulated prediction based on a sampled displacement from Ni^(s-1):
  • And the training is based on this augmented training set:

4. Results

4.1. Datasets

  • Frames Labeled In Cinema (FLIC): 4000 training and 1000 test images from Hollywood movies with diverse poses and diverse clothing. For each labeled human, 10 upper body joints are labeled.
  • Leeds Sports Dataset (LSP): 11000 training and 1000 testing images from sports activities with challenging in terms of appearance and especially articulations. The majority of people have 150 pixel height. For each person the full body is labeled with total 14 joints.

4.2. Metrics

  • Percentage of Correct Parts (PCP): measures detection rate of limbs, where a limb is considered detected if the distance between the two predicted joint locations and the true limb joint locations is at most half of the limb length.
  • Percent of Detected Joints (PDJ): A joint is considered detected if the distance between the predicted and the true joint is within a certain fraction of the torso diameter. By varying this fraction, detection rates are obtained for varying degrees of localization precision.

4.3. Ablation Study

  • A small held out set of 50 images for FLIC and LSP datasets.
  • For FLIC, σ = 1.0 after exploring values {0.8, 1.0, 1.2}.
  • For LSP, σ = 2.0 after exploring values {1.5, 1.7, 2.0, 2.3}.
  • Stop improvmenting when S = 3 for the above datasets.
  • For each cascade stage starting at s = 2, 40 randomly translated crop boxes are augmented. For LSP with 14 joints, number of training samples = 11000×40×2×14 = 12M.
  • Running time is approximately 0.1s per image on a 12 core CPU.
  • The initial stage was trained within 3 days on approx. 100 workers, most of the final performance was achieved after 12 hours though.
  • Each refinement stage was trained for 7 days since the amount of data was 40 larger than the one for the initial stage due to the data augmentation.
PDJ on FLIC or the first three stages of the DNN cascade
  • Cascading CNN for refinement helps to improve the results.
Predicted Pose (Red) Ground Truth Poses (Green)
  • Again, refinement helps to improve the results.

4.4. Comparison with State-of-the-art Approaches

PDJ on FLIC for Two Joints: Elbows and Wrists
PDJ on LSP for Two Joints: Arms and Legs
  • DeepPose obtains the highest detection rate with different normalized distance to true joint for both dataset.
PCP at 0.5 on LSP
  • DeepPose-st2 and DeepPose-st3 obtain state-of-the-art results.

4.5. Cross-Dataset Generalization

PDJ on Buffy Dataset for To Joints: Elbow and Wrist
  • Further, the upper-body model trained on FLIC was applied on the whole Buffy dataset.
  • DeepPose obtains comparable results.
PCP at 0.5 on Image Parse Dataset
  • The full body model trained on LSP is tested on the test portion of the Image Parse dataset.

4.6. Example Poses

Visualization of LSP
Visualization of FLIC

--

--