Generating digital signatures with the gait of people

An innovative attempt to improve the cybersecurity with machine learning

Prasad Pai
Towards Data Science

--

Photo by Matt Quinn, Unsplash

Back in early 2018, we had foreseen the usefulness of landmark detection of various points in the human hand’s palm and fingers and explained how we can build on top of landmark detection to comprehend various signals through the hand with machine learning. Since then, we have witnessed multiple products leveraging this useful idea to understand the gesture made by hand posture mostly through static snapshots. In today’s article, we would like to take a step ahead and concentrate on a bigger problem of recognizing people through gait detection making use of more landmarks on a human body in a short video. We believe such types of applications will take center stage with the continuous advancements in human landmark detection and faster processing.

Why gait detection is important?

When the Covid-19 pandemic hit the world, people had to wear a mask covering their faces. All of a sudden, many facial recognition models had to make decisions only with the limited set of landmark points coming from the forehead and eyes. Overnight, the digital signature generated through facial recognition was being rendered incapable. We have also seen with the advent of 3D printers, criminals are generating faces to impersonate their targets. To make the matter worst, today, the easily available Deepfake models are good enough to convince the audience of some statements being said by the target victim.

All these loopholes and shortcomings in today’s most sought-after technology will necessitate the exploration of newer techniques. Generation of digital signatures through the manner of one’s walking style could be a promising start.

Human landmark detection model

While developing a gait detection model from scratch can be a very interesting and fun-filled exercise, it will require a huge amount of data, a lot of computing power, and time that many of us will not have the access to. Hence, we will try to build on top of some good work that has already taken place. Being a machine learning application developer, it is very important to be aware of all the developments taking place in the research field. The work done in recognizing the various landmarks in the human body through deep learning models will form a fundamental component of our architecture.

TensorFlow Hub provides us with many learned models that will help us in detecting the landmarks of the human body. As we will not be tweaking these models and instead we are more interested in performing only inferences through these models, we can even opt to improve our latency by switching to the TensorFlow Lite version by trading in a little bit on the accuracy of the models. In our experiment, we will be choosing the TensorFlow Lite version of Thunder variant of MoveNet model present in the TensorFlow Hub. MoveNet model will give us the 17 X-Y coordinates along with the confidence score at each keypoint.

Landmark keypoints (Image source: TensorFlow)

Type of data

Now that our human landmark detection model has been finalized, the next step is to focus on data. As we are targetting to build an application where we are going to create a kind of digital signature of an individual walking, we are unlikely to have a lot of data of individual people. In practical scenarios, we may be able to get at most 6–10 seconds of a video recording that we will have to work on. Our objective is very similar to saying financial institutions can collect at most 1 or 2 specimen signatures of customers while the latter opens a new account with them.

Going deeper into our problem setting, we will be running our pose estimation model on every frame of the video to generate a time series data of landmark locations in the image as shown below.

Left: Input video, Middle: Detection, Right: Keypoints (GIF by Author)

Now, a natural question arises how are we going to profile the walking pattern of an individual in just 10 seconds of recording. This is the crucial place where we have to fill the gap as data is far superior to machine learning or deep learning models. We will try to mimic as many possible patterns as we can think of.

Data augmentation

Data augmentation is a technique of creating an artificial dataset to supplement the dearth of already available data. The newly generated data simulates the missing parts of patterns in the existing data thereby easing the learning procedure of ML models during training. We had covered this topic in-depth in a previous article specifically targetting data augmentation of images.

In the problem that we are trying to solve, besides less amount of data, we even have to be conscious of the fact that the individual may be walking differently from our reference video and it is illogical to expect that individual will maintain a similar style always. Despite the changes in the style of walking, by and large, the overall dynamics in the movement and coordination of his/her body parts should remain fairly similar over a while (say the past few years). However the missing piece of walking differently due to different situations thereby generating a different style, we will supplement it through data augmentation.

a) Fast walking:
While the individual during the shooting of the reference video may not be bothered about anything else, in real life for sure, he/she may be in a hurry. When the person is walking fast, the point-to-point transition between the frames will not be smooth and continuous. To simulate such situation in the video, we can randomly skip a few frames at regular intervals.

Fast walking (GIF by Author)

b) Slow walking:
Similar to walking fast, a person may have got hurt or maybe have all the luxury of time and hence may be walking slowly. To simulate slow-walking conditions, you need to insert frames between the existing frames. The landmarks in this slow video should be interpolated from the frames near to it so that the motion of a person appears to be continuous and an increased number of frames in the video at a constant fps rate makes the person walk slow.

Slow walking (GIF by Author)

c) Stagnant hand(s):
Many times, while walking, one may be holding a bag in one or both hands. Under such circumstances, the hand(s) holding the bag(s) will not move much and will vibrate within a small range. We simulate such conditions by keeping the landmarks corresponding to chosen hand’s wrist, elbow, and shoulder moving in a small range.

Left: Left hand stagnant, Middle: Right hand stagnant, Right: Both hands stagnant (GIF by Author)

d) Only top posture:
Sometimes, the cameras that are recording the target’s movement may have been installed at a higher altitude or angled at a weird position such that it can capture only the top posture of the subject. We need to simulate such conditions by shifting the landmarks of the person down so that it naturally appears that only the top posture is visible.

Top posture visible (GIF by Author)

e) Random noise:
Another issue with the camera recording the target’s movement could be that it has become old and frames are glitching in it. It could also happen that the lighting condition on the subject isn’t proper. Under such circumstances, some landmarks may randomly disappear, and/or the pose estimation model has made a very low confidence prediction. This type of augmentation also adds a little more regularizing effect than other augmentations to prevent our model from overfitting.

Glitching keypoints (GIF by Author)

f) Walking with pause:
A person may have to pause in between to take a look at his/her cellphone or do some other activity. Under such circumstances, the legs will remain stagnant but other parts of the body like the hands, and the head may move a little.

Walking with stop in between (GIF by Author)

g) Walking left and right:
So far, in all the videos we have assumed that the subject will walk right in front of the camera in a straight line. But the subject may move randomly left or right.

Left: Moving right, Right: Moving left (GIF by Author)

h) Mixed combinations:
The subject does not need to alter from his regular style in one manner however short the recording may be. As a result, we have to combine two or more above conditions to simulate the mixed style either serially or even parallelly.

i) Think more alternatives:
Our model will only be as good as the amount of diverse data it gets to see while undergoing training. While the above set of alternative walking styles will add good variety to our dataset, there is never a hard limit to the types of scenarios that we can think of to provide more data augmentations.

Building model

Building a classification model might sound interesting to directly classify the subject and create each subject as one class. But there is a large downfall in this approach. We are thinking of scaling this solution to several thousand and a classifier approach will result in a large dense matrix (final layer). More importantly, despite all the data augmentation that we have done, we still really don’t have data that can create good boundaries through representation learning for each class.

To counter the above issues, we will be making use of one-shot learning. With one-shot learning, we will be able to evaluate new people’s videos on the fly without even including them as part of the training. One-shot learning makes use of the similarity function to compare how two entities are similar to each other.

Our architecture will include the Siamese network. Since our data is time-series based, we will employ recurrent neural networks (GRU or LSTM) inside the Siamese network. We will pass the time-series data of landmark locations of one reference video and time-series data of landmark locations of test video into the Siamese network. The individual latent representations will then be passed through dense layers to either maximize or minimize the score based on either video belonging to the same person or not respectively.

High level architecture (Image by Author)

Result

In our experiment, we collected data by shooting videos of people walking. We even trimmed freely available videos of people walking that was present in open-source datasets and we also collected several videos where the owners of the videos permitted us to use their resource for our research.

Here is the inference result on one of the test video that is fed to our trained model. The similarity score with the reference video of the subject is displayed inside brackets. The model gave similarity score less than 0.5 with all the incorrect reference videos

Left: Test input video, Middle: Detection, Right: Results with similarity score (GIF by Author)

Is this approach the best?

No, not at all. Our whole aim of this experiment was to generate a new type of digital signature. With every advancement in innovative techniques aimed at strengthening security, the counterfeiters will always find a way to circumvent it. Presently, with the help of vid2vid, a fake video can be generated copying the gait of the victim. Nevertheless, gait analysis combined with facial recognition will be a good mechanism to generate the digital signature of a person. We however believe the simple method showcased in this experiment can form a good baseline model to take things forward.

Conclusion

In this article, we learned how to build on top of existing solutions. We leveraged top of pose estimation solutions to tailor our gait detection by making use of Siamese networks and data augmentation to simulate many possible scenarios of walking. In this article, we have demonstrated how Machine Learning solutions can be re-engineered to solve other use cases.

Code

If you would like to take a look at the code used in this experiment, you may check it out in this GitHub repository.

You may also look into our previous article of explanation of Gaussian Mixture Models and the algorithm of Expectation-Maximization.

--

--

Software developer @ Flipkart. An aspiring data scientist moving ahead with one step at a time.