The world’s leading publication for data science, AI, and ML professionals.

The Machine Learning Web – Pose and actions estimator

Real-time pose estimation web application

I think we all can agree that 2020 was an insane year. To keep my sanity straight, I decided to revive an old project that I worked on a long time back with Omer Mintz on pose estimation using PoseNet.

While reviving this project, what I wanted to achieve became clear to me: A pose and action estimation web application that relies on Machine Learning capabilities for "learning" new actions without compromising on performance.

The results? Well, you can see yourself.

The code is also shared on this Git Repository.

TL;DR

We used data output provided by the PoseNet pre-trained model and applied some data engineering. With the help of some data exploratory, we found out that a KNN Machine algorithm can classify the results very well. The end result – a system that estimates the exercise a participant is doing.


The Aim

  • A web application that knows how to estimate what pose a participant found at (stand, squat, pushup).
  • Count how many repeats a participant has done.
  • High performance – Minimal delay should be found between rendering cycles. Application interactivity should not be affected.

  • Easily extendable – Can learn new actions with minimal change
  • Text to speech – Bonus

The stack

  • Python using TensorFlow and NumPy— We need a way to apply EDA and train the model.
  • React – For rendering and an interactive web application.
  • TensorFlow.JS— To run the trained models and run ML algorithms on the browser.
  • Canvas – Image rendering and modification.
  • Web Workers— For performance in order not to overload the main thread.

For the sake of pose detection, I’ve used the pre-trained model of PoseNet based on RestNet50 architecture.

This pre-trained model allows us to capture the human part from an image, which, later on, will be used to estimate the actions.

PoseNet

PoseNet is a pre-trained model for pose estimation, found under computer vision. The PoseNet model detects human figures in images and videos and provides the ability to determine the different parts of human(s) found in a frame.

The PoseNet library handles the following:

  • Data pre-processing ( Crop & resize, Scale the pixel values)
  • Apply model on a given data using TensorFlow.
  • Decode key points from a result.
  • Calculate the confidence score for each part and the entire pose.

The Input

The PoseNet model takes a processed camera image as the input. For better performance we are will work with frames of 224 X 224 pixels, this will allow us to handle and process less data.

A reminder the PoseNet library will apply another resizing (as mentioned in the previous section).

The Output

An object with:

  1. score – An overall confidence score of the Pose
  2. keypoints – A list of 17 elements, each element determine the result of a different keypoint (part) identified with, the x & y positions, part name, and a score
{
  score: float;
  keypoints: Array<{ // Array of the 17 keypoints identified
    position: {x: float, y: float};
    part: EBodyParts; // the keys of the enum
    score: float;
  }>
}
enum EBodyParts {
  nose,
  leftEye,
  rightEye,
  leftEar,
  rightEar,
  leftShoulder,
  rightShoulder,
  leftElbow,
  rightElbow,
  leftWrist,
  rightWrist,
  leftHip,
  rightHip,
  leftKnee,
  rightKnee,
  leftAnkle,
  rightAnkle
}

The Configuration

The configuration I used for PoseNet was

architecture: 'ResNet50'
outputStride: 16
quantBytes: 4
inputResolution: {width: 224, height: 224}
  • Architecture – ResNet50 or MobileNet v1
  • Output stride – The output stride determines how much we’re scaling down the output relative to the input image size. It affects the size of the layers and the model outputs. The higher the output stride, the smaller the resolution of layers in the network and the outputs, and correspondingly their accuracy. In this implementation, the output stride can have values of 8, 16, or 32. In other words, an output stride of 32 will result in the fastest performance but lowest accuracy, while 8 will result in the highest accuracy but slowest performance. Resolution = ((InputImageSize - 1) / OutputStride) + 1In my configuration Resolution = ((224- 1) / 16) + 1 = 14.9375

ResNet50

PoseNet allows us to use one of two model architectures:

  1. Mobilenet v1
  2. ResNet50

The official PoseNet documentation mentioned that Mobilenet v1 is smaller and faster with lower accuracy than the ResNet50 architecture, which is larger, slower but more accurate.

To better understand the difference between the two articles I highly recommend review those two articles:

From Pose to Action

To transform those key points (X and Y coordinates) into action, we will need to apply more statistical power here, for that case, I decided to proceed with a clustering algorithm. More precisely, the KNN algorithm

The data

"Information is the oil of the 21st century, and analytics is the combustion engine." Said Peter Sondergaard, SVP Gartner, in 2011.

We are surrounded by data platforms. Data is just laying there waiting for us to pick it up, clean it and use it.

Of course this task of "pick it up" and "clean it" is not that simple, Engineers and Data science strive for good data to use to train their models. I like to compare it to a beach treasure hunt using a metal detector; you have many metal things around but in rare cases where you will find a real treasure.

My beach was youtube. More specifically, personal training videos on youtube where you can train along with the trainers by the same pose. So many poses, now all that is needed is to breakdown the videos into frames and categorize them into the correct pose (Stand, Squat, Push-up, Push-down, e.g.)

In order to break down the videos into frames, I used the following simple python code:

import os
import cv2
def video_to_frames(video_path: str, destination: str):
    if not os.path.exists(os.path.join(os.getcwd(),'output')):
        os.mkdir(os.path.join(os.getcwd(),'output'))
    if not os.path.exists(destination):
        os.mkdir(destination)
    vid = cv2.VideoCapture(file_path)
    success,image = vid.read() # read the first image
    count = 0
    while success: # in case there are more images - proceed
        cv2.imwrite(os.path.join(destination,f'frame{count}.jpg'), image) # write the image to the destination directory
        success,image = vid.read() # read the next image
        count += 1

After we extracted the frames, we can get our hands dirty with some categorization work. This effort requires us to mainly move files to the correct directory of the pose – is it a "squat" or a "stand" position.

Now that we have our training set completely ready it’s time to train our model.

After categorizing the images we are now able to proceed with the model training phase. But first, we need to think about how to handle the data.

We know we have a classification problem where we have a set of features we want to output to a single class.

The options are:

  1. Deep learning classification: Using deep learning for classification is the trend now, we can set training-test sets to identify the pose. Something like the YOLO model can help us identify if the image is of a squat, stand, push up, e.g. The main problem here is that it requires us tons of images to train on, very high compute power, and will probably lead us to low prediction confidence (for both F1 & Accuracy scores).

  2. Machine learning clustering algorithm on top of PoseNet outcome: We already have a very solid model to identify the different body parts of a participant. In that case, we can take an image and convert it to a tabular model, but the X and Y positions of body parts are not that helpful, but it is still something to begin with.

We will proceed with option number 2. Now we need to prepare our features for the clustering algorithm. That’s means, instead of working with the X and Y positions of different body parts we need to have angles. That required reviving basic trigonometry formulas from the back of my mind to:

  • Convert x & y points to lines
  • Calculate lines vertex angles for:

Left armpit angle – using the left shoulder, left elbow, and left hip Right armpit angle – using the right shoulder, right elbow, and right hip Left shoulder angle – using the left shoulder, right shoulder, and left hip Right shoulder angle – using the right shoulder, left shoulder, and right hip Left elbow angle – using the left elbow, left shoulder, and left wrist Right elbow angle – using the right elbow, right shoulder, and right wrist Left hip angle – using the left hip, right hip, and left shoulder Right hip angle – using the right hip, left hip, and right shoulder Left groin angle – using the left hip, left knee, and left ankle Right groin angle – using the right hip, right knee, and right ankle Left knee angle – using the left knee, left ankle, and left hip, Right nee angle – using the right knee, right ankle, and right hip

  • Calculate the slope of a person’s pose in radians, this will help us to identify if the person is found in a vertical or horizontal position.
The min(slope1, slope2) is found to identify the real state of the person - slop of the entire body, not just part of it
The min(slope1, slope2) is found to identify the real state of the person – slop of the entire body, not just part of it

This allowed us to get this data set.

The training

After we got the dataset ready we can do some analysis using PCA to get a visualization of the principal components, this will help us be more sure about the success rate of the classification procedure and to identify which algorithm will fit the most.

Here is the google colab project of the PCA, thanks to Ethel V for helping set it up and fine-tune the features.

As we can see the clusters are pretty obvious to classify (except squats and standing where there is some work to be done). I decided to go with KNN to apply the classification.

KNN – k-nearest neighbors, a supervised statistical algorithm that fits both for classification and regression problems and is used in machine learning.

KNN Flow

  1. Load the data and initialize K neighbors
  2. For each example in the data:
  • Calculate the distance between the current record from the dataset and a query example.
  • Add the distance and the index of the example to a collection
  1. Sort the collection of distances and indices from smallest to largest by their distances.
  2. Pick the first K entries from the sorted collection
  3. Get the labels of the selected K entries
  4. Result:
  • In the case of regression – return the mean of the K labels.
  • In the case of classification – return the mode of the K labels.

KNN fits our thanks to:

  • Clear grouping of classes – In most cases, we can identify the group of classes very easily.
  • Some outliers require more complex handling – KNN performs better than other classification algorithms (like SVM) to handle more complex data structure, especially non-linear.
  • A small number of records in our dataset – Neural network-based solution will require
  • Computational complexity – KNN requires less computational power for train/evaluation time than a Neural network. Also, due to the fact we have a small number of classes to pick from and a few hundred records in the dataset, we should not suffer a major decrease in performance.

The use case

Using KNN we will classify the right action using the angles and slope. Later on, using the web application, we will use the different actions combinations to identify which exercise is performed by the participant.

We will review this and more in the next chapter.

Hope you enjoyed this article, stay tuned😊

Links


Related Articles