OpenPose Research Paper Summary: Multi-Person 2D Pose Estimation with Deep Learning

Machines understanding human posture.

Chonyy

Follow

Published in

Towards Data Science

6 min readSep 13, 2020

--

AI Basketball Analysis. Image by Chonyy.

Introduction

This paper summary will give you a good understanding of the high-level concept of OpenPose. Since we will be focusing on their creative pipeline and structure, there will be no difficult math or theory included in this summary.

I would like to start by talking about why I want to share what I learn from this wonderful paper. I have implemented the OpenPose library in my AI Basketball Analysis project. At the time I was building the project, I only knew the basic concept of OpenPose. I spent most of the time working on the code implementation and trying to figure out the best way to combine OpenPose with my original basketball shot detection.

Source: https://github.com/CMU-Perceptual-Computing-Lab/openpose

Now, as you can see in the GIF, the project is almost completed. I have a full grasp of the implementation of OpenPose after building this project. In order to have a better understanding of what I have been dealing with, I think now it’s time for me to take a deeper look at the research paper.

Image taken from “Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields”.

Overview

In this work, we present a realtime approach to detect the 2D pose of multiple people in an image.

The proposed method uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. This bottom-up system achieves high accuracy and realtime performance, regardless of the number of people in the image.

Why is it difficult?

Let’s start by talking about what makes estimating the poses of multi-person in an image so difficult. Here are some difficulties listed.

Unknown number of people
People can appear at any pose or scale
People contact and overlapping
Runtime complexity grows with the number of people

Common Approach

OpenPose is definitely not the first team facing this challenge. Then how the other teams try to tackle these problems?

Source: https://medium.com/syncedreview/now-you-see-me-now-you-dont-fooling-a-person-detector-aa100715e396

A common approach is to employ a person detector and perform single-person pose estimation for each detection.

This kind of top-down method sounds really intuitive and simple. However, there are some hidden pitfalls in this approach.

Early commitment: no resource to recovery when person detector fails
Runtime proportional to the number of people
Pose estimation is executed even if the person detector fails

Initial Bottom-Up Approach

If the top-down method doesn’t sound like the best approach. Then why don’t we try bottom-up?

Not surprisingly, OpenPose is not the first team that came up with a bottom-up method. Some other teams have also tried the bottom-up approach. However, they are still facing some problems with it.

Required costly global inference at the final parse
Didn’t retain the gains in efficiency
Taking several minutes per image

OpenPose Pipeline

(a) Take the entire image as the input for a CNN
(b) Predict confidence maps for body parts detection
(c) Predict PAFs for part association
(d) Perform a set of bipartite matching
(e) Assemble into a full body pose

Confidence Map

Confidence map is the 2D representation of the belief that a particular body part can be located. A single body part will be represented on a single map. So, the number of maps is the same as the total number of the body parts.

The map on the right is only for detecting the left shoulder. Image taken from “Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields”.

PAF

Part Affinity Fields (PAFs), a set of 2D vector fields that encode the location and orientation of limbs over the image domain.

Bipartite Matching

When it comes to finding the full body pose of multiple people, determining Z is a K-dimensional matching problem. This problem is NP-Hard and many relaxations exist. In this work, we add two relaxations to the optimization, specialized to our domain.

Relaxation 1: Choose a minimal number of edges to obtain a spanning tree skeleton.
Relaxation 2: Further decompose the matching problem into a set of bipartite matching subproblems. Determine the matching in adjacent tree nodes independently.

Structure

Original Structure

The original structure is split into two branches.

Beige branch: predicts the confidence map
Blue branch: predicts the PAF

Both branches are organized as an iterative prediction architecture. The predictions from the previous stage are concatenated with the original feature F to produce more refined predictions.

New Structure

The first set of stages predicts PAFs L t , while the last set predicts confidence maps S t . The predictions of each stage and their corresponding image features are concatenated for each subsequent stage.

Comparing to their previous publication, they have made a big breakthrough and came up with a new structure. As you can see from the structure above, the confidence map prediction is runned on top of the most refined PAF predictions.

Why? The reason is actually really simple.

Intuitively, if we look at the PAF channel output, the body part locations can be guessed.

What about Other Libraries?

A growing number of computer vision and machine learning applications require 2D human pose estimation as an input for their systems. The OpenPose’s team is definitely not the only one doing this research. Then why we always think of OpenPose when it comes to pose estimation and not Alpha-Pose? Here are some of the problems with other libraries.

Require users to implement most of the pipeline
Users have to construct their won frame reader
Facial and body keypoint detector are not combined

AI Basketball Analysis

I have implemented OpenPose in this AI Basketball Analysis project. In the beginning, I only have an idea that I want to analyze the shooting pose of the shooter, but I have no clue how to do it! Luckily, I came across OpenPose, which gives me everything I want.

Although the installation process is a little troublesome, the actual code implementation is fairly simple. Their function takes a frame as an input, and output the human coordinate. What’s even better, it could also show the detection overlay on the frame! The simplicity of the code implementation is the main reason why their repo could obatin18.6k+ stars on GitHub.

My project is below, feel free to check it out!

chonyy/AI-basketball-analysis

🏀 Analyze basketball shots and shooting pose with machine learning! This is an artificial intelligence application…

github.com

Reference

[1] Zhe Cao, Student Member, IEEE, Gines Hidalgo, Student Member, IEEE, Tomas Simon, Shih-En Wei, and Yaser Sheikh, “OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields,” in CVPR, 2019.