Using Computer Vision and Machine Learning to Monitor Activity While Working From Home

An introduction to building vision-based health monitoring software on embedded systems

Published in

Towards Data Science

9 min readApr 25, 2020

One of the biggest challenges staying home during this lockdown most of the world is facing is the sudden restriction of physical activities especially when confined in a small space. To help address this, I thought it would be interesting to see if I could use computer vision to help motivate myself to be more active.

In this post, I will share my approach, along with a full sample source code, that illustrates the potential to help monitor and improve the quality of life for many who stay at home. The idea is having a low cost, GPU-enabled, vision-based system ($99 NVIDIA Jetson Nano) that performs most of the computer vision processing on the edge (e.g., face detection, emotion classification, people detection, and pose estimation) all-in-one easy to setup package. Then, this system can process the data it collects locally to a remote cloud for post-processing and redistributed to servers that provide a dashboard for one or many’s health.

To address privacy concerns, I discuss ways to protect a user’s privacy by only uploading the post-processed data, which contains the classified emotion data, facial features, and body pose data instead of the actual camera feed. While there is more you can do to obscure the data, this example will point you in the right direction.

This proof of concept is a simple application that shows a dashboard of my activity based on my body pose and emotion. The entire system runs on a $99 NVIDIA Jetson Nano developer board along with a webcam that costs approximately $40. This setup is something that would help people to focus on their work while tracking their activity.

Final result with the Privacy Mode turned on and thus only processed data is shown.

Let’s take a look at how you can build the same system that I did.

Setup the webcam server

The first setup I took is to establish a webcam server so that I can add as many cameras to the system as needed, and also can process these frames in a distributed and scalable way.

In Linux, we can easily set up automatic webcam streams with a program called ‘Motion’. The program allows you to create a smart security camera that detects motion and more importantly streams motion JPEG (MJPEG) as a server. This feature is very useful in allowing us to run these applications asynchronously without bogging down each camera stream on any particular machine.

To install motion on NVIDIA Jetson Nano, you can simply use this command line.

sudo apt-get install motion

Then, we edit the motion configuration file to set the resolution and the frame rate. By default, the motion application is configured to detect motion and capture frames, and that’s not what we want in our use case.

nano /etc/motion/motion.conf

Here you will need to change the resolution and frame rate to 640x320 and 15 fps for both camera feed (framerate) and streaming feed (stream_maxrate).

If you have a more powerful machine such as the Jetson Xavier, you can set the resolution and frame rate higher for better tracking for far distance objects.

Then, we turn off the auto snapshot and video capture feature (i.e., set output_pictures to off). This is important or your hard drive will be filled with new captures quickly.

After configuring Motion, you’ll need to allow it to run as a service at bootup.

nano /etc/defaultmotion

And you edit this line to

start_motion_daemon=yes

Now you’ll need to edit the main config file, located at /etc/motion/motion.conf, and set the Motion daemon to on. After restarting your computer, Motion should now be running automatically.

The following commands control the Motion service:

Start the Motion service:

sudo service motion start

Stop the Motion service:

sudo service motion stop

Restart the Motion service:

sudo service motion restart

To preview the feed, simply open your browser and go to this link:

http://localhost:8080

At this point, you will be able to see the preview of the camera feed. Now it’s time for us to actually use it.

Machine Learning Tools for Pose Estimation

One of the main goals of this system is to have better insight into one’s activity at home. In order to do this, we will create a pose estimation program that tracks a person’s pose. More importantly, the pose estimation allows you to create a layer of privacy by omitting your camera’s feed and only shows the processed data.

To get started, you will have to install the Tensorflow and the OpenPose libraries. GitHub member karaage0703 did an amazing job putting the setup scripts together for NVIDIA Jetson Nano. Here you can follow the Github instructions to set up the tools, and below is a link to the GitHub repository.

github clone https://github.com/karaage0703/jetson-nano-tools

Particularly, you want to run the ‘install-tensorflow.sh’ and ‘install-pose-estimation.sh’ scripts to install Tensorflow and Pose Estimation Library on your machine.

$ cd ~/jetson-nano-tools
$ ./install-tensorflow.sh
$ ./install-pose-estimation.sh

This process could take up to 30 minutes, and so take a break and stretch after executing the command. Once you have the tools in place, let’s look at the Python script I wrote that allows you to capture, process, and visualize the data.

Pose Estimation (OpenPose)

I created a Python script that uses the toolset just installed, and here is the result with the skeleton overlay on top of the camera feed.

Example output of the OpenPose library running on the NVIDIA Jetson Nano

You can check out the scripts I wrote here:

git clone https://github.com/raymondlo84/nvidia-jetson-ai-monitor

The ‘run_webcam_ip.py’ script has two key functions that are based on the CMU Perceptual Computing Lab’s OpenPose project.

“OpenPose represents the first real-time multi-person system to jointly detect human body, hand, facial, and foot keypoints (in total 135 keypoints) on single images.”

This library is a very powerful tool, with a single camera feed it can detect full human skeletons with GPU acceleration. With such, imagine now how you can track the human body pose that provides you the proper feedback. For example, you can create a neck angle detector and help to fix your body posture during work.

from tf_pose.estimator import TfPoseEstimator
from tf_pose.networks import get_graph_path, model_wh

The script’s main loop performs the inference on each frame captured by the IP camera. Then, these two lines will perform the inference and drawing the results on the frame

humans = e.inference(image, resize_to_default=(w > 0 and h > 0), upsample_size=args.resize_out_ratio)image = TfPoseEstimator.draw_humans(image, humans, imgcopy=False)

The NVIDIA Jetson’s GPU can perform pose estimation at approximately 7–8 fps at 320x160 resolution and using the mobilenet_thin model. That is very impressive for a machine that uses no more than 10W of power at full load. With the body pose estimated, one can now create a prediction on how active one is. This is similar to a FitBit but instead of using an accelerometer or gyroscope, we are using a video feed from our camera.

Running the OpenPose on the NVIDIA Jetson Nano

Face Detection and Emotion Analysis

At this point everything we set out to do has been achieved but why not do even more with the captured video? Let’s see how to detect emotions from facial reactions in order to build a happy meter all with GPUs. For this to work, we’ll need to add in code that determines the face and emotion of a person. The amazing news is that these two systems both utilize Tensorflow on GPUs, and thus we will not require to allocate a significant amount of memory and only requires minimal overheads on top. Again, there are additional optimizations such as smartly selecting the bounding box for face detection based on body pose to get the additional speed up.

Currently, the processing rate is approximately 7 frames per second with face detection and emotion analysis at 320x160 resolution. However, with the distributed setup, I can easily double or triple the rate by offloading the work onto another node on a GPU cluster setup. Even with 3 updates per second, the system will have approximately 259,200 samples per day. This is not a trivial amount of data.

To perform the emotion analysis, I will be including the Keras library and face recognition package this time.

import face_recognition
import keras
from keras.models import load_model
from keras.preprocessing.image import img_to_array

Then, this line extracts the position of the detected face from the image.

face_locations = face_recognition.face_locations(small_image, model=’cnn’)

And the parameter mode=‘cnn’ enables us to use the CUDA GPU-accelerated code and also gives better accuracy at different view angles or under occlusions.

Once it detected the face from the image, the face image runs through the predict function based on our pre-trained model (_mini_XCEPTION.106–0.65.hdf5) and classifies the face under one of the seven categories: angry, disgust, scared, happy, sad, surprised, or neutral.

emotion_dict = [“Angry”, “Disgust”, “Scared”, “Happy”, “Sad”, “Surprised”, “Neutral”]model = load_model(“emotion_detector_models/_mini_XCEPTION.106–0.65.hdf5”, compile=False)model_result = model.predict(face_image)

Basically, the actual work was approximately 5 lines of code, and I also provided additional sample code (webcam_face_detect.py) to run this part separately for your own testing.

Running Pose Estimation, Face Detection, and Emotion Analysis, all together on GPUs.

CPU/GPU Performance

One key takeaway from my experiment is how well these CV+ML algorithms can be stacked together with the GPU setup. By default, a significant amount of memory is taken up by the Tensorflow’s core, and thus it would be wise to make the best use of the resource anyways. Now, with the CPU completely free up from the processing, we have over 300% of CPU processing resources to be utilized for many other tasks such as bookkeeping of the processed data.

Notice we have very low CPU usage even running all algorithms simultaneously.

Final Thoughts

Here we have a low-cost monitoring system that monitors a body pose, does face detection, and returns a rudimentary emotion analysis. These three features represent the foundation of a novel vision-based health dashboard all on GPUs. The best part is that this can be built with a $99 machine. Imagine the possibilities!

With pose estimation, face detection, and emotion analysis, I can extract the motion parameters such as average motion per day, and tracking my own emotional data without having my images shared with the cloud. This data is nowhere near perfect, but this kind of feedback can be proven to be very useful for improving posture during long sessions in front of the computer (e.g., the neck issue). Also, with some minor modifications, this can quickly turn into a sleep monitor and has other potential fitness uses that can be expanded upon.

About Me

Currently, I live in Silicon Valley and serve as the OpenVINO Edge AI Software Evangelist at Intel. Previously, I was the co-founder and CTO of Meta where I shipped two Augmented Reality Developer Kits. As the Head of R&D in Meta, I had worked with hundreds of amazing engineers creating state-of-the-art technology and CV algorithms such as SLAM, 3D hand tracking, and new UX for AR/XR. During my Ph.D. study, I also published and developed real-time GPGPU applications on HDR video processing, and created augmented reality prototypes with 3D depth-sensing cameras. I always believe the accessibility and simplicity of the technology are needed to create disruptive changes. Right now, I see the rapid growth of AI edge computing can bring many smart applications at a very affordable cost. And this post is just the beginning of this ever-growing trend.