The world’s leading publication for data science, AI, and ML professionals.

Optimizing Pose Estimation On The Coral Edge TPU

How to access low level features in PoseNet models and unlock the full potential of pose estimation on an Edge TPU device

Tutorial

In 2018 Google announced the TensorFlow release of PoseNet, a machine learning model capable of detecting persons in images and estimating the location of their body parts, a technique known as pose estimation.

Poses estimated by the PoseNet algorithm running on the Coral Edge TPU accelerator. On the right, an overlay of the keypoint heatmaps that are not normally accessible in the Edge TPU model. Original photo by Jaisingh rathore /CC BY 2.5.
Poses estimated by the PoseNet algorithm running on the Coral Edge TPU accelerator. On the right, an overlay of the keypoint heatmaps that are not normally accessible in the Edge TPU model. Original photo by Jaisingh rathore /CC BY 2.5.

The PoseNet detection framework is available for JavaScript (TensorFlow.js), Android/iOS mobile devices (TensorFlow Lite), and Edge TPU accelerator (Google Coral).

Lately, I have been mostly interested in the Edge TPU version of the framework since I have been working on several projects that involve people detection and tracking "in the wild", and the USB version of the TPU makes it possible to perform real-time pose estimation on small embedded devices such as the Raspberry Pi or the STM32P1 platform.

Limitations of PoseNet for Edge TPU

The Coral engineers have done a great job at packaging the code and models in a way that makes it easy for the developer to get up and running with the PoseNet framework. To simplify usage, however, some of the model parameters have been hardcoded. For example, the number people/poses that can be detected in an image has been limited to 10. Other parameters of the decoding algorithm have also been hardcoded, and while I found the default values to work well in many situations, I think there is an opportunity to improve the pose estimation by fine-tuning some of the hidden parameters. Please read on if you are interested in the details.

Google Coral Edge TPU products, source https://coral.ai
Google Coral Edge TPU products, source https://coral.ai

Background: PoseNet Architecture

The PoseNet implementation is based on a two-stage architecture that includes ** a convolutional neural network (CNN) and a decoding algorith**m.

The convolutional network is trained to generate heatmaps that predicts the position of all the keypoints (i.e. body parts) in an image. It also generates short range and mid-range offset vectors that helps in "connecting the dots" when multiple persons are present in the same image.

The decoding algorithm takes the heatmaps and offset vectors produced by the CNN and creates the association between body parts and person instances, trying to make sure that all the keypoints from the same person are associated to the same instance. You can read all the details in two publications [1, 2] from the Google research team that developed the technology.

Relation between heatmaps, offset and displacement vector in PoseNet - source [2]
Relation between heatmaps, offset and displacement vector in PoseNet – source [2]

Parameters of the decoding algorithm – Ensuring a correct association between body parts and person instances is a rather challenging task in crowded scenes where many persons might appear in close contact with each other. To ensure maximum accuracy under a variety of conditions, the authors of PoseNet made provision for a handful of parameters that controls how the decoding algorithm works.

Some of these parameters, which affect both speed and accuracy in computing ** the final output,** are the following:

  • Maximum pose detections – The maximum number of poses to detect.
  • Pose confidence score threshold – 0.0 to 1.0. At a high level, this controls the minimum confidence score of poses that are returned.
  • Non-maximum suppression (NMS) radius – A number in pixels. At a high level, this controls the minimum distance between poses that are returned.

Please see the Medium post from the TensorFlow group for a reader-friendly description of the parameters. Additional insights can be gained by experimenting with the parameters values in the real-time multi-pose online demo.

PoseNet Implementation on the Edge TPU

In both the JavaScript and mobile implementations (Android and iOS) the decoding algorithm is contained in the source repository and the parameters can be changed at runtime. For the EdgeTPU version, however, the code maintainers opted for a different approach and decided to embed the decoding algorithm as a custom operator directly inside the TensorFlow Lite model.

[…] Note that unlike in the TensorflowJS version we have created a custom OP in Tensorflow Lite and appended it to the network graph itself. […] The advantage is that we don’t have to deal with the heatmaps directly and when we then call this network through the Coral Python API we simply get a series of keypoints from the network. (source: https://github.com/google-coral/project-posenet)

While on one hand this approach makes it easier to get up and running with PoseNet on the Edge TPU, on the other hand it prevents us from the possibility of tweaking the decoding parameters and achieve the best trade-off between accuracy and speed. For example, in the EdgeTPU implementation, the maximum number of poses is fixed to 10, so if we were trying to process images with more than ten people in them, we would not be able to get the expected results. Also, if we were going to process single-pose scenes, we would probably waste some CPU cycle decoding poses that are not going to be in the image.

Accessing Heatmaps and Offset Vectors in the Edge TPU Model

In the following sections we will explore some steps that can be taken to (re)gain access to the output of the convolutional network and see how it can be decoded "by hand" while retaining the possibility of tweaking the algorithm parameters. We will first take a look at how to inspect the PoseNet models available with the Edge TPU implementation and how to modify them to access the low-level features produced by the convolutional layer.

Step 1: Inspecting the original TFLite model (optional)

The Coral TPU code repository for PoseNet contains three models based on the MobileNet architecture and optimized for three different image resolutions:

1) posenet_mobilenet_v1_075_353_481_quant_decoder_edgetpu.tflite
2) posenet_mobilenet_v1_075_481_641_quant_decoder_edgetpu.tflite
3) posenet_mobilenet_v1_075_721_1281_quant_decoder_edgetpu.tflite

The models are saved in TFLite format and their content can be inspected using the visualize.py tool contained in the TensorFlow code repository.

We will be taking the smallest model (353 x 481) and convert it to an HTML file for inspection. Note that since the name of the model is quite long, it’s easier to create a symlink to it.

# Create symlink to original model (353x481)
MODEL=/path/to/posenet_mobilenet_v1_075_353_481_quant_decoder_edgetpu.tflite
ln -s ${MODEL} /tmp/input_model.tflite
# Convert file from TFLITE format to HTML
cd ~/tensorflow/tensorflow/lite/tools
python visualize.py /tmp/input_model.tflite /tmp/input_model.html

The HTML file contains all the information about the input/output tensors and operators present in the model:

Html representation of the TFLITE model posenet_mobilenet_v1_075_353_481_quant_decoder_edgetpu.tflite generated with theTensorflow utility visualize.py
Html representation of the TFLITE model posenet_mobilenet_v1_075_353_481_quant_decoder_edgetpu.tflite generated with theTensorflow utility visualize.py

From the HTML file we can see there are two operators Ops (0, 1). The first one, Ops #0, takes as input the original image stored in RGB format inside Tensor 3 having dimensions [1, 353, 481, 3]. The second operator, Ops #2, produces the output tensors with the results of the pose estimation process:

- Tensor 4, FLOAT32 [1, 10, 17, 2]...: Keypoint coordinates (y, x)
- Tensor 5, FLOAT32 [1, 10, 17]......: Keypoint scores
- Tensor 6, FLOAT32 [1, 10]..........: Poses scores
- Tensor 7, FLOAT32 []...............: Number of poses

The second dimension being equal to 10 in Tensors 4, 5, 6 is because, as mentioned in the previous section, the maximum number of poses parameter is hard-coded to ten. The third dimension value in both Tensor 4 and 5 matches the 17 keypoints currently detected by PoseNet:

Seventeen pose keypoints detected by PoseNet - Image source TensorFlow
Seventeen pose keypoints detected by PoseNet – Image source TensorFlow

If we want to decode more than 10 poses or change some of the other input parameters, we will need to work with the output of the first operator Ops#0 that are stored in the following tensors:

- Tensor 0, UINT8 [1, 23, 31, 17]....: Keypoint heatmap
- Tensor 1, UINT8 [1, 23, 31, 34]....: Keypoint offsets
- Tensor 2, UINT8 [1, 23, 31, 64]....: Forward and backward displacement vectors (mid-range offsets)

A detailed description of heatmaps, keypoint offset and displacement vectors is available in the [2].

Note the size of the heatmaps is [23, 31] since this particular model uses an OUTPUT_STRIDE = 16 and the relation between image size and heatmap size is given by the following equations:

heatmap_height = 1 + (img_heigth - 1) / OUTPUT_STRIDE
heatmap_width = 1 + (img_width - 1) / OUTPUT_STRIDE

Step 2: Getting access to the heatmaps, offsets, and displacement vectors

The TFLite models are serialized to disk using the Google Flatbuffer protocol file format. In most cases, the toolchain needed to work with this format has to be installed from source (see instructions in this StackOverflow post for an example). After installation, you will have access to the flatc schema compiler that enables conversion from TFLite format to JSON and vice-versa. The schema file necessary for the conversion is available in the TensorFlow repo (schema.fbs) and should be utilized in a call with the following syntax:

# Converts from TFLITE to JSON
SCHEMA=~/tensorflow/tensorflow/lite/schema/schema.fbs
flatc -t --strict-json --defaults-json -o /tmp ${SCHEMA} -- /tmp/input_model.tflite

The image below shows an outline of the data contained in the JSON output, which match what we already saw in the HTML representation:

The JSON file is pretty large (~25MB) and not easy to edit "by hand". But we can manipulate it with some simple Python code as shown below. The idea is to remove the custom operator decoding algorithm (Ops #1), any unused tensor, and make sure the model output is given by tensors [0, 1, 2] (heatmaps, offsets, and displacement vectors).

We finally convert the output from JSON format back to TFLite using the same flatc utility we used for the first step of the conversion:

# Converts from TFLITE to JSON
SCHEMA=~/tensorflow/tensorflow/lite/schema/schema.fbs
flatc -c -b -o /tmp ${SCHEMA} /tmp/output_model.json

The result of the conversion should be now available in output_model.tflite (you probably want to give it a more specific name) flatbuffer file. We can verify that when we load this model from TensorFlow, it produces three output tensors with the expected sizes:

The size of the output tensors matches the expected dimensions:

Output #0, UINT8 [1, 23, 31, 17]....: 12121 elements (heatmap)
Output #1, UINT8 [1, 23, 31, 34]....: 24242 elements (offsets)
Output #2, UINT8 [1, 23, 31, 64]....: 45632 elements (disp. vectors)

So far, so good… 🙂

Decoding the Heatmaps and Offset Vectors

We are now ready for the final step of the process, which consists in taking the new output from the model (heatmaps, offset and displacement vectors) and extracting the poses.

As mentioned in the introduction, both the JavaScript and the Android/iOS versions of PoseNet contain an implementation of the decoding algorithm. Google didn’t release an official Python implementation of the same algorithm, but luckily we can rely on an unofficial port of available in the brilliant project PoseNet-Python on GitHub.

The code, which was ported from the JavaScript version of PoseNet, is located in the file [decode_multi.py](https://github.com/rwightman/posenet-python/blob/master/posenet/decode_multi.py) which contains a function with the following signature:

def decode_multiple_poses(
        scores, offsets, displacements_fwd, displacements_bwd,
        output_stride, max_pose_detections=10, score_threshold=0.5,
        nms_radius=20, min_pose_score=0.5):

In the function above, the first parameter scores represents the keypoint heatmaps, while the offsets and displacement vectors (forward and backward) match the tensors we previously discussed.

The only tricky part for making this function work is to properly pre-process the heatmaps and reshape the displacement vectors to match the format expected by the decode_multiple_pose function. The snippet of code below contains the function extract_outputs that takes care of converting the output of the modified TFLite model in a format that can be passed directly to the decoding function.

Conclusions

This post shows how to inspect a EDGE TPU TFLite model and change it to gain access to the output of the convolutional layer in a PoseNet model. It also shows how to take that output and and extract the poses with full access to the parameters of the decoding algorithm. I am planning to create a follow-up post that explores what are the effects of changing these parameters in real-world computer vision application based on pose estimation. If you have ideas or suggestions please leave a comment or contact me at Stura.io.

Output of pose detection with overlaid keypoint heatmaps (right half)
Output of pose detection with overlaid keypoint heatmaps (right half)

References

[1] G. Papandreou et al., Towards Accurate Multi-person Pose Estimation in the Wild (2017), Proceedings of CVPR.

[2] G. Papandreou et al., PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model (2018).


Related Articles