Tutorial
In 2018 Google announced the TensorFlow release of PoseNet, a machine learning model capable of detecting persons in images and estimating the location of their body parts, a technique known as pose estimation.

The PoseNet detection framework is available for JavaScript (TensorFlow.js), Android/iOS mobile devices (TensorFlow Lite), and Edge TPU accelerator (Google Coral).
Lately, I have been mostly interested in the Edge TPU version of the framework since I have been working on several projects that involve people detection and tracking "in the wild", and the USB version of the TPU makes it possible to perform real-time pose estimation on small embedded devices such as the Raspberry Pi or the STM32P1 platform.
Limitations of PoseNet for Edge TPU
The Coral engineers have done a great job at packaging the code and models in a way that makes it easy for the developer to get up and running with the PoseNet framework. To simplify usage, however, some of the model parameters have been hardcoded. For example, the number people/poses that can be detected in an image has been limited to 10. Other parameters of the decoding algorithm have also been hardcoded, and while I found the default values to work well in many situations, I think there is an opportunity to improve the pose estimation by fine-tuning some of the hidden parameters. Please read on if you are interested in the details.

Background: PoseNet Architecture
The PoseNet implementation is based on a two-stage architecture that includes ** a convolutional neural network (CNN) and a decoding algorith**m.
The convolutional network is trained to generate heatmaps that predicts the position of all the keypoints (i.e. body parts) in an image. It also generates short range and mid-range offset vectors that helps in "connecting the dots" when multiple persons are present in the same image.
The decoding algorithm takes the heatmaps and offset vectors produced by the CNN and creates the association between body parts and person instances, trying to make sure that all the keypoints from the same person are associated to the same instance. You can read all the details in two publications [1, 2] from the Google research team that developed the technology.
![Relation between heatmaps, offset and displacement vector in PoseNet - source [2]](https://towardsdatascience.com/wp-content/uploads/2020/07/1Q9FAfaY_MSmxKArKrTN41w.png)
Parameters of the decoding algorithm – Ensuring a correct association between body parts and person instances is a rather challenging task in crowded scenes where many persons might appear in close contact with each other. To ensure maximum accuracy under a variety of conditions, the authors of PoseNet made provision for a handful of parameters that controls how the decoding algorithm works.
Some of these parameters, which affect both speed and accuracy in computing ** the final output,** are the following:
- Maximum pose detections – The maximum number of poses to detect.
- Pose confidence score threshold – 0.0 to 1.0. At a high level, this controls the minimum confidence score of poses that are returned.
- Non-maximum suppression (NMS) radius – A number in pixels. At a high level, this controls the minimum distance between poses that are returned.
Please see the Medium post from the TensorFlow group for a reader-friendly description of the parameters. Additional insights can be gained by experimenting with the parameters values in the real-time multi-pose online demo.
PoseNet Implementation on the Edge TPU
In both the JavaScript and mobile implementations (Android and iOS) the decoding algorithm is contained in the source repository and the parameters can be changed at runtime. For the EdgeTPU version, however, the code maintainers opted for a different approach and decided to embed the decoding algorithm as a custom operator directly inside the TensorFlow Lite model.
[…] Note that unlike in the TensorflowJS version we have created a custom OP in Tensorflow Lite and appended it to the network graph itself. […] The advantage is that we don’t have to deal with the heatmaps directly and when we then call this network through the Coral Python API we simply get a series of keypoints from the network. (source: https://github.com/google-coral/project-posenet)
While on one hand this approach makes it easier to get up and running with PoseNet on the Edge TPU, on the other hand it prevents us from the possibility of tweaking the decoding parameters and achieve the best trade-off between accuracy and speed. For example, in the EdgeTPU implementation, the maximum number of poses is fixed to 10, so if we were trying to process images with more than ten people in them, we would not be able to get the expected results. Also, if we were going to process single-pose scenes, we would probably waste some CPU cycle decoding poses that are not going to be in the image.
Accessing Heatmaps and Offset Vectors in the Edge TPU Model
In the following sections we will explore some steps that can be taken to (re)gain access to the output of the convolutional network and see how it can be decoded "by hand" while retaining the possibility of tweaking the algorithm parameters. We will first take a look at how to inspect the PoseNet models available with the Edge TPU implementation and how to modify them to access the low-level features produced by the convolutional layer.
Step 1: Inspecting the original TFLite model (optional)
The Coral TPU code repository for PoseNet contains three models based on the MobileNet architecture and optimized for three different image resolutions:
1) posenet_mobilenet_v1_075_353_481_quant_decoder_edgetpu.tflite
2) posenet_mobilenet_v1_075_481_641_quant_decoder_edgetpu.tflite
3) posenet_mobilenet_v1_075_721_1281_quant_decoder_edgetpu.tflite
The models are saved in TFLite format and their content can be inspected using the visualize.py
tool contained in the TensorFlow code repository.
We will be taking the smallest model (353 x 481) and convert it to an HTML file for inspection. Note that since the name of the model is quite long, it’s easier to create a symlink to it.
# Create symlink to original model (353x481)
MODEL=/path/to/posenet_mobilenet_v1_075_353_481_quant_decoder_edgetpu.tflite
ln -s ${MODEL} /tmp/input_model.tflite
# Convert file from TFLITE format to HTML
cd ~/tensorflow/tensorflow/lite/tools
python visualize.py /tmp/input_model.tflite /tmp/input_model.html
The HTML file contains all the information about the input/output tensors and operators present in the model:

From the HTML file we can see there are two operators Ops (0, 1). The first one, Ops #0, takes as input the original image stored in RGB format inside Tensor 3 having dimensions [1, 353, 481, 3]. The second operator, Ops #2, produces the output tensors with the results of the pose estimation process:
- Tensor 4, FLOAT32 [1, 10, 17, 2]...: Keypoint coordinates (y, x)
- Tensor 5, FLOAT32 [1, 10, 17]......: Keypoint scores
- Tensor 6, FLOAT32 [1, 10]..........: Poses scores
- Tensor 7, FLOAT32 []...............: Number of poses
The second dimension being equal to 10 in Tensors 4, 5, 6 is because, as mentioned in the previous section, the maximum number of poses parameter is hard-coded to ten. The third dimension value in both Tensor 4 and 5 matches the 17 keypoints currently detected by PoseNet:

If we want to decode more than 10 poses or change some of the other input parameters, we will need to work with the output of the first operator Ops#0 that are stored in the following tensors:
- Tensor 0, UINT8 [1, 23, 31, 17]....: Keypoint heatmap
- Tensor 1, UINT8 [1, 23, 31, 34]....: Keypoint offsets
- Tensor 2, UINT8 [1, 23, 31, 64]....: Forward and backward displacement vectors (mid-range offsets)
A detailed description of heatmaps, keypoint offset and displacement vectors is available in the [2].
Note the size of the heatmaps is [23, 31]
since this particular model uses an OUTPUT_STRIDE = 16
and the relation between image size and heatmap size is given by the following equations:
heatmap_height = 1 + (img_heigth - 1) / OUTPUT_STRIDE
heatmap_width = 1 + (img_width - 1) / OUTPUT_STRIDE
Step 2: Getting access to the heatmaps, offsets, and displacement vectors
The TFLite models are serialized to disk using the Google Flatbuffer protocol file format. In most cases, the toolchain needed to work with this format has to be installed from source (see instructions in this StackOverflow post for an example). After installation, you will have access to the flatc
schema compiler that enables conversion from TFLite format to JSON and vice-versa. The schema file necessary for the conversion is available in the TensorFlow repo (schema.fbs) and should be utilized in a call with the following syntax:
# Converts from TFLITE to JSON
SCHEMA=~/tensorflow/tensorflow/lite/schema/schema.fbs
flatc -t --strict-json --defaults-json -o /tmp ${SCHEMA} -- /tmp/input_model.tflite
The image below shows an outline of the data contained in the JSON output, which match what we already saw in the HTML representation:

The JSON file is pretty large (~25MB) and not easy to edit "by hand". But we can manipulate it with some simple Python code as shown below. The idea is to remove the custom operator decoding algorithm (Ops #1), any unused tensor, and make sure the model output is given by tensors [0, 1, 2] (heatmaps, offsets, and displacement vectors).
We finally convert the output from JSON format back to TFLite using the same flatc
utility we used for the first step of the conversion:
# Converts from TFLITE to JSON
SCHEMA=~/tensorflow/tensorflow/lite/schema/schema.fbs
flatc -c -b -o /tmp ${SCHEMA} /tmp/output_model.json
The result of the conversion should be now available in output_model.tflite
(you probably want to give it a more specific name) flatbuffer file. We can verify that when we load this model from TensorFlow, it produces three output tensors with the expected sizes:
The size of the output tensors matches the expected dimensions:
Output #0, UINT8 [1, 23, 31, 17]....: 12121 elements (heatmap)
Output #1, UINT8 [1, 23, 31, 34]....: 24242 elements (offsets)
Output #2, UINT8 [1, 23, 31, 64]....: 45632 elements (disp. vectors)
So far, so good… 🙂
Decoding the Heatmaps and Offset Vectors
We are now ready for the final step of the process, which consists in taking the new output from the model (heatmaps, offset and displacement vectors) and extracting the poses.
As mentioned in the introduction, both the JavaScript and the Android/iOS versions of PoseNet contain an implementation of the decoding algorithm. Google didn’t release an official Python implementation of the same algorithm, but luckily we can rely on an unofficial port of available in the brilliant project PoseNet-Python on GitHub.
The code, which was ported from the JavaScript version of PoseNet, is located in the file [decode_multi.py](https://github.com/rwightman/posenet-python/blob/master/posenet/decode_multi.py)
which contains a function with the following signature:
def decode_multiple_poses(
scores, offsets, displacements_fwd, displacements_bwd,
output_stride, max_pose_detections=10, score_threshold=0.5,
nms_radius=20, min_pose_score=0.5):
In the function above, the first parameter scores
represents the keypoint heatmaps, while the offsets and displacement vectors (forward and backward) match the tensors we previously discussed.
The only tricky part for making this function work is to properly pre-process the heatmaps and reshape the displacement vectors to match the format expected by the decode_multiple_pose
function. The snippet of code below contains the function extract_outputs
that takes care of converting the output of the modified TFLite model in a format that can be passed directly to the decoding function.
Conclusions
This post shows how to inspect a EDGE TPU TFLite model and change it to gain access to the output of the convolutional layer in a PoseNet model. It also shows how to take that output and and extract the poses with full access to the parameters of the decoding algorithm. I am planning to create a follow-up post that explores what are the effects of changing these parameters in real-world computer vision application based on pose estimation. If you have ideas or suggestions please leave a comment or contact me at Stura.io.

References
[1] G. Papandreou et al., Towards Accurate Multi-person Pose Estimation in the Wild (2017), Proceedings of CVPR.
[2] G. Papandreou et al., PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model (2018).