Towards a more applicative Pose Estimation

Designing a stillness scorer for a person doing meditation — with implementation in python

Aakash Agrawal
Towards Data Science

--

Photo by @ericmuhr on Unsplash.

In this blog, I talk about making the pose estimation algorithms more effective by highlighting a key issue during inference and discuss ways to mitigate this issue. I also walk through an example where pose estimation is made more applicative by implementing the same in python.

Keywords: human pose-estimation, jitter, low-pass filter, signal.

Human pose estimation is one of the quite challenging problems in Computer Vision, where the goal is to localize human body keypoints (e.g. hips, shoulder, wrists, etc.). It has countless applications, including AR, VR-based games (e.g., Microsoft Kinect), interactive fitness, therapy, motion capture, etc. Frame-by-frame smoothness of the results is quite critical for these applications to be of any use.

src: https://arxiv.org/pdf/1902.09212.pdf

Jitter Problem

Almost every pose estimation algorithm suffers from the problem of jitter during inference. The high-frequency oscillations of keypoints around a point characterize a noisy signal is known as jitter.

Fig: Example of the Jitter problem using the Movenet model. The person is sitting still, but the pose estimation is jittery. GIF by author.

The jitter cause can be attributed to the fact that we perform these inferences at a frame level for the entire video input. And these consecutive frames have varying occlusion (and a range of complex poses). Another reason can be the inconsistency in the annotations in training data that results in uncertainty in pose estimation. Jitter poses the following problems:

  1. Malfunctioned and noisy data will result in the bad performance of the algorithm.
  2. Keypoints are too noisy to build any useful feature and application in the production environment.
  3. High probability of getting false positive data points.
  4. Example: Say you want to build a stillness scorer using pose estimation (for a person doing meditation); these jitters can contribute significantly to the score. Hence, resulting in inaccurate and poor results.

A solution to the Jitter Problem

Signal Processing offers two major ways to attenuate the noise in signals. Low pass-filter: a filter that attenuates all frequencies in a signal below a specified threshold frequency and passes the rest of the signal unchanged.

Fig: LPF (Image by author)

High pass-filter: a filter that attenuates all frequencies in a signal above a specified threshold frequency and passes the rest of the signal unchanged.

Fig: HPF (Image by author)

Our Natural movements are low-frequency signals, whereas jitter is a high-frequency signal. So, in order to address the issue of jitter, we can use a Low Pass Filter that would filter all the signals of higher frequency.

Other approaches to solving the jitter problem involve using neural networks for pose refinement. One such example is SmoothNet. However, LPFs are far easier to implement and use. Another variation of LPF is the One Euro filter which is also quite powerful in filtering noisy signals in real-time.

Movenet Pose Estimation

Let's start with some code and make LPF work in python. For the purpose of illustration in this blog, I use Tensorflow’s Movenet pose estimation model. This model is quite fast and accurate.

Now, let us consider some simple functions that will be used for inference. The tflite model can be downloaded from here. The Python API for running inference on tflite is provided in the tf.lite module. (Ref: load and run a model in python using tflite). The entire code can be found in my GitHub repository here.

The entire Python script can be found here. Use the following command to run the inference in your local (firstly, do “cd motion-detection” after clone): python -m inference.movenet_infer — path file.mp4 — lpf n Let's have a look at a sample inference result using the Movenet model:

Fig: Sample example of inference using the Movenet model. The model looks accurate and fast. GIF by author.

Clearly, the inference looks quite accurate and latency is also small. Now, let us revert back to the jittery example we saw at the start and see how we can solve the jitter issue. For the goal of demonstration, we use 1€ Low Pass Filter. We can also use a popular signal processing library in Python — Scipy that supports different types of low pass filters (for example, signal.lfilter module). 1€ LPF usage has been highlighted below:

The entire Python script can be found here. Use the following command to run the inference in your local (using LPF):

python -m motion_detection.inference.stillness_scorer — path file.mp4 — lpf y

Fig: Sample example showing jittery pose estimation at the center. LPF solves the jitter issue on the right and the estimation is quite smooth. GIF by author.

Application Example

Now, let us look at a very simple example where pose estimation can become slightly more applicative using the concepts mentioned above. Consider the following problem statement: “Score a person doing meditation based solely on the body stillness.”

Can you think of some other techniques besides pose estimation that can be used to solve this problem?

Vanilla Image Processing

Maybe we can use simple image processing methods for the problem. We can start by subtracting two consecutive frames in a video stream, and then we can apply binary thresholding to obtain the subtracted mask; here, the number of white pixels will be indicative of stillness.

Fig: Subtracetd mask. Using a simple image processing technique to measure stillness. If the person is still, the number of white pixels will be less than a specified threshold. Image by Author.

The approach is good, but the problem arises when there is also a fan or a cat moving in the background; in that likely case, this approach would not be effective. As the moving cat would become a part of the subtracted mask. The goal will be to come up with a method that works exclusively for humans.

Image (Human) Segmentation

How about using some human segmentation techniques? We can exclusively segment out a person using segmentation and then we take a difference of two consecutive segmented frames and check for the number of white pixels. Limitations: this approach would not work when there is a movement inside the segmented region.

Pose Estimation

Here, we calculate the euclidean distance for a particular body part keypoint across consecutive frames for all the (smoothed) key points. Our final score is a weighted sum of all these euclidean distances. Clearly, if a person does some movement, the Euclidean distances for the keypoints will be higher and vice-versa.

Score: The score should be lower if there is no significant movement. A lower Score will imply better meditation (based on body stillness, there are actually many factors that contribute to a good meditation, not just stillness). Note, if we didn't smooth the pose keypoints earlier, the jitters will contribute to the score, leading to poor and inaccurate results. The figures below illustrate the motion score in the y-axis vs time in the x-axis. The accompanying code can be found here. Firstly, let us see how the score behaves without smoothing.

Fig: Stillness Scorer without using the low pass filter. The plot is a motion score in the y-axis vs time in the x-axis. GIF by Author.

Clearly, the graph looks noisy due to jitter. And jitter also contributes to the score. Let’s see how the score behaves using the LPF.

Fig: Stillness Scorer using the low pass filter. The plot is a motion score in the y-axis vs time in the x-axis. GIF by Author.

Here, the graph looks smooth and clean this time. As we can infer, any motion contributes to the area under the curve. Hence, Smoothing keypoints become very critical in such applications.

Final Results

I also integrated a low pass filter in android and ran it on top of a custom pose estimation model. We got the following results:

Fig: using the LPF on top of a custom model has a smoothening effect. GIF by Author.

References

[1]. Ailing Zeng, Lei Yang, Xuan Ju, Jiefeng Li, Jianyi Wang, Qiang Xu — SmoothNet: A Plug-and-Play Network for Refining Human Poses in Videos.

[2]. MoveNet: Ultrafast and accurate pose detection model.

[3]. Jaan Tollander de Balsch — Noise Filtering Using 1€ Filter.

[4]. The Python scripts used in the blog can be found here.

I hope you enjoyed the use of low pass filters for making the pose estimation more applicative. I hope that the example was reasonable enough to imply that removing jitter is one of the most critical optimizations when building applications on top of pose estimation.

I would love to know the feedback of anyone reading this article. I would be happy to answer doubts/questions on any of the concepts mentioned above. Feedbacks are greatly welcomed. You can reach me via Linkedin.

ThankYou!

Special Thanks to @Apurva Gupta for reviewing the blog.

--

--