End goal and steps:
The idea behind this project is to determine the performance of my work from home and get insides on how to improve my workstation setup using Computer Vision on edge with an IoT Device, AWS Cloud, and a simple Web app.
Additionally, I would like to share the setbacks found during the developments and the walk-around I took to solve them.
So, based on the requirement, my first idea was something like this:

Once the initial setup was defined, I started determining the minimum required tools to get the data, in this case, the camera; since in the LPR Recognition Article I used the SV3C low-cost camera, I decided to use the same one because the integration is already done using HTTP.
These are the initial images:




I used two scripts to get those images, one in Python and one in Bash (Note: all the information regarding this project is available in the following Github repo). The first one is to create the folder structure named create-folder-structure.sh, and the second one is used to collect the data named collect_images.py.
Getting the Input Data:
At this point, I had the setup and 200 images of each position, and I started to think about the way I would use these images, so I ended up using transfer learning with EfficientNetB0 as a feature extractor and training just the last layer. Here is the Colab Link
And just after a few epochs, I was able to achieve more than 80% of accuracy. Nevertheless, I found some setbacks in real-time testing because my idea was to train based on my head position. However, the network actually learned my shoulders and arms positions, so I had a few more options such as getting more data, getting different images, data augmentation, and more complex networks. Still, I wanted to keep it simple, so I researched a little more and found a new approach.
Based on my initial thought, I wanted to track my face, and after some research, I found a neural network named MoveNet released by Google in 2021which is able to detect 17 keypoints in the body (nose, left eye, right eye, left ear, right ear, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle, right ankle) and with the first five keypoint I had some information about my face, or any face, so I decided to try this new option.
Additionally, I changed the camera position. This is the difference between doing a real project rather than testing and learning with datasets. With real projects, you can change the input data and validate if the new data provides you more information than the previous, and I can do this because this is an end-to-end project.



With this information, the approach now is to detect these 5 keypoints and select an ML classifier to determine the position of the head.
Selecting the classifier:
As part of the trending technologies for beginners in ML, there are a couple of AutoML solutions (FLAML and MLJAR), so I decided to test FLAML and get an initial direction on which classifier to use
and just with a few lines of code:
from flaml import AutoML
automl = AutoML()
automl.fit(X_train, y_train, task="classification", time_budget=120)
I got an 84% in the validation set

But 93% in the test set.

Here is the Colab for MoveNet future extraction and the AutoML classifier.
Integration with AWS IoT:
Once I have the setup to get images, get the keypoints with MoveNet and classify the image using FLAML, I want to send the information to AWS. An interesting service to use is the IoT Core of AWS, so this is the basic architecture I will end up with:

Basic Architecture description:
The SV3C camera is OVNIF compliant. OVNIF has some standard ways to get snapshots of images using the HTTP protocol, allowing you to get images whenever you want, without having to deal with video streams. You can see the request in Github.
For communication between the end device and AWS IoT Core I used MQTT, a communication protocol that helps me to communicate and act over a remote device so easily, allowing me to add any intelligent switch in the future to turn off the light once I leave the office.
This part was based on the following AWS blog to use the Python awsiotsdk
Regard the AWS Services I’m using, both are serverless and can scale to billions of devices, and I’m only paying for what I use (Pay as you go). For the DynamoDB, the integration was based on the following doc. I’m also uploading the images to an S3 bucket and storing them in a folder named as the predictions that will help me to validate how the model is behaving.
Then we have Streamlit, which I’m using locally now just to get an overview.
Running it in the JetsonNano:
The real challenge started once I decided to move my model to the Jetson Nano after having it ready and working on my computer. The main reason was that the Jetson was running out of memory and the AutoML library (Flaml) had some problems to be installed.
After some more research, I found the Movenet Model using Tflite which solved the memory problem, but I still had the problem of the Flaml library. However, the solution was mentioned in this blog; I proposed the AutoML to get an initial direction, and based on the output obtained from the Flaml Library I found the specific Library for the model that AutoML proposed, in this case, LGBM as shown in the (Best ML solution*) image.
See Tflite implementation in realtimepredictionmovenetflaml.py
and the initial test in this Colab
See the LGBM training in this Colab



Scripts to run the project:
- To do the real-time inference and send the data to AWS:
python realtimepredictionmovenetflaml.py
- To run locally the Streamlit server:
streamlit run app.py
- To collect images from the camera:
python collect_images.py
- To train the NN, go to the COLAB

Improvements and Comments from Prototype to Product:
There are a lot of points to be improved in this project, such as:
- Change from Streamlit to a SPA could be something similar to React.js
- Train with more images to improve accuracy
- Add data augmentation
- Add infrastructure as a code to configure all AWS services
- Get more metrics
- Add visibility (Logs)