Personal Note: Won Grand Prize at "Deep Learning Superhero Challenge" with Intel organized by Hackster.io for this project. It gives solace to finally hit the bull’s-eye, after a chain of near misses, despite the bounties – US$ 1000 & Intel Movidius NCS 2 sticks. Sure, there is light at the end of the tunnel 🙂
Contest Link: https://www.hackster.io/contests/DLSuperheroes

Interactive public kiosks are now so widely used viz. Banking (ATMs), Airport (Check-in), Government (e-Governance), Retail (Product Catalogue), Healthcare (Appointment), Schools (Attendance), Corporate (Registration), Events (Info) and the list goes on. While businesses move towards kiosks to better service delivery, touch-free interaction of all public devices has become an imperative to mitigate the spread of ubiquitous Corona virus.
Gesture or Speech Navigation might seem to address the above, but such devices are resource constrained to analyze such inputs. Have you noticed your mobile voice assistant, be it Siri or GAssist, gives up when mobile goes offline? Your voice-enabled car infotainment system fails to respond, while you drive remote roads. Even a conventional computer won’t be able to run multiple AI models concurrently.
Ain’t it nice to do it all on your device itself? Imagine a bed-side assistant device which can take visual or voice cues from bedridden patients. This is possible with the advent of Intel Openvino. It enables and accelerates deep learning inference from the edge, by doing hardware-conscious optimizations. OpenVINO supports CPU, iGPU, VPU, FPGA and GNAs. If you wanna get your hands wet, a Raspberry Pi along with Intel Movidius NCS 2 would be your best bet to toy with.
In this blog, we will try to build a Human-Computer Interaction (HCI) module which intelligently orchestrates 5 concurrently-run AI models, one feeding onto another. AI models for face detection, head pose estimation, facial landmarks computation and angle of gaze estimation identify gesture control inputs and trigger mapped actions. A child thread is deployed to run offline speech recognition, which communicates with the parent process to give parallel control commands based on user utterance, to assist and augment gesture control.
If you like the project, kindly give a thumbs up here
The solution source code can be found here.
Architecture Diagram
Each component in the architecture diagram is explained below.

Control Modes
There are 4 control modes defined in the system, to determine the mode of user input. We can switch between control modes using gestures.
-
Control Mode 0: No Control Gesture and Sound Navigation is turned off
-
Control Mode 1: Gaze Angle Control Mouse moves along with angle of eye gaze (faster)
- Control Mode 2: Head Pose ControlMouse moves with changing head orientation (slower)
- Control Mode 3: Sound ControlMouse slides in 4 directions and type based on user utterance
Calibration Step
To translate the 3D gaze orientation angles to 2D screen dimension, the system has to know the yaw and pitch angles corresponding to opposite corners of the screen. Given these 2 angles of opposite corners, we can interpolate the (x, y) location in the screen for intermediate (yaw, pitch) angles.
Therefore, the user will be prompted to look at opposite corners of the screen, when the application is initiated. Such a calibration step is needed to map the variation in gaze angles to the size and shape of the screen, in order for the "gaze mode" to function properly.
Without calibration also the system can function, albeit at the expense of generality. To demonstrate, the relative change in head orientation is taken as the metric to move mouse pointer, when the system is in "head pose" mode.
Gesture Detection Pipeline
Four Pre-trained OpenVINO models are executed on the input video stream, one feeding onto another, to detect a) Face Location b) Head Pose c) Facial Landmarks and d) Gaze Angles.
a) Face Detection: A pruned MobileNet backbone with efficient depth-wise convolutions is used. The model outputs (x, y) coordinates of the face in the image, which is fed as input to steps (b) and (c)
b) Head Pose Estimation: The model outputs Yaw, Pitch and Roll angles of head, taking face image as input from step (a)

c) Facial Landmarks: a custom CNN used to estimate 35 facial landmarks.

This model takes cropped face image from step (a) as input and computes facial landmarks, as above. Such a detailed map is required to identify facial gestures, though it is double as heavy in compute demand (0.042 vs 0.021 GFlops), compared to the Landmark Regression model, which gives just 5 facial landmarks.
d) Gaze Estimation: custom VGG-like CNN for gaze direction estimation.
The network takes 3 inputs: left eye image, right eye image, and three head pose angles – (yaw, pitch, and roll) – and outputs 3-D gaze vector in Cartesian coordinate system.

Post-Processing Model Outputs
To feed one model output to another model as input, the return values of each model need to be decoded and post-processed.
For instance, to determine gaze angle, the head orientation need to be numerically combined with the vector output from gaze model, as below.
Similarly, the facial landmarks model returns ratio of input image size. Hence, we need to multiply output by image width and height to compute (x, y) coordinates of 35 landmarks.
While output of facial landmark and gaze estimation models can be easily post-processed as above, the conversion of head pose estimation model output is a bit more involved.
Euler Angles to Rotation Matrices
Note the "Head Pose Estimation" model outputs only the attitude, i.e. Yaw, Pitch and Roll angles of the head. To obtain the corresponding direction vector, we need to compute the rotation matrix, using attitude.
i) Yaw is a counterclockwise rotation of α about the z-axis. The rotation matrix is given by,

ii) Pitch is a counterclockwise rotation of β about the y-axis. The rotation matrix is given by,

iii) Roll is a counterclockwise rotation of γ about the x-axis. The rotation matrix is given by

We can place a 3D body in any orientation, by rotating along 3 axes, one after the other. Hence, to compute the direction vector, you need to multiply the above 3 matrices.

Eye Wink Detection
So far, we have controlled the mouse pointer using head and gaze. But to use a kiosk, you also need to trigger events, such as ‘Left Click’, ‘Right Click’, ‘Scroll’, ‘Drag’ etc.
In order to do so, a set of pre-defined gestures need to be mapped to each event, and be recognized from the visual input. Two events can be mapped to ‘wink’ event of left and right eye, but they need to be identified as ‘wink’.
You can easily notice that the number of white pixels will suddenly increase when the eyes are open, and decrease when closed. We can just count the white pixels to differentiate open vs closed eye.

But in real world, above logic is not reliable because white pixel value itself can range. We can always use Deep Learning or ML techniques to classify but its advisable to use a numerical solution, in the interest of efficiency, especially when you code for edge devices.
Lets see how to numerically detect winks using signals in 4 steps!
- Calculate frequency of pixels in range 0–255 (histogram)
2. Compute spread of non-zero pixels in the histogram. When an eye is closed, the spread will take a sudden dip and vice-versa.
3. Try to fit a inverse sigmoid curve at the tail-end of the above signal.
- If successful fit is found, then confirm the ‘step down’ shape of fitted curve and declare it as ‘wink’ event. (no curve fit = eye is not winking)
Algorithm Explanation:
If above steps are not clear, then see how the histogram spread graph falls, when an open eye is closed.

Given the above signal, you can imagine that the curve would take shape of ‘S’ when the eye is opened for a few seconds. This can be mathematically parameterized using a sigmoid function.

But since we need to detect ‘wink’ event shown above, the shape of the curve will take the form of an inverse sigmoid function. To flip the sigmoid function about the x-axis, find f(-x)

Take any online function visualizer to plot the above function and change parameters to see how reverse ‘S’ shape is changing (to fit the above ** Fig. Histogram Spread**)

Thus, if any similar shape is found by parametric curve fit algorithm, at the tail end of the histogram spread curve, then we can call it a ‘wink’. The curve fit algo tries to solve a nonlinear least-squares problem.

Note: An efficient way to compute the above can be,
- Consider strip of ‘n’ recent values in non-zero Histogram Spread.
- Compute the median & std of ‘k’ values in the front and tail end of strip.
- If difference in median > threshold and both std < threshold, then detect eye wink event, as it’s most likely an inverse sigmoid shape.
Alternatively, we can also use the below algo to find eye winks.
- Take the first differential of Histogram Spread values
- Find the peak in the first differential values to find sudden spike
- Find reflection of the signal and find peak to find sudden dip
- If peak is found in both the above steps, then its just a blink
- If peak is found only in reflection, then its a wink event.
The above method is more efficient than curve fitting, but can lead to many false positives, as peak detection is not always reliable, especially at low light. Middle of the road approach would be to use median and standard deviation to estimate the shape of the curve.
Mouth Aspect Ratio (MAR)
Eye Aspect Ratio (EAR) is computed in this classic facial landmark paper to determine eye blinks.


We cannot use above formula to determine eye gesture, as our model does not estimate such a dense landmark map. However, inspired by EAR, we can compute MAR based on the available 4 landmarks obtained from OpenVINO model, as below.

Two gesture events can be identified using MAR:
- if MAR > threshold, then person is smiling
- if MAR < threshold, then person is yawning
We have liberty to attach 2 commands corresponding to these two gestures.
Threading and Process-Thread Communication
To enhance control, we can enable sound based navigation also, along with gesture control. However, system then needs to continuously monitor user utterances to identify commands while it is analyzes image frames from input video stream.
Naturally therefore, it is prudent to run the speech recognition model in a different thread and let the child-thread communicate with the parent process. The child thread will recognize vocal commands to move the mouse or to write on the screen and pass it on to the parent using shared Queue data structure in Python (as shown below).
The parent process will run all the above AI models and the computation required for gesture recognition, to enable head and gaze control modes. Thus, it is possible to take gesture and sound control commands in parallel, but for the sake of usability, in this project we chose to take sound commands separately in Control Mode 3.
Speech Recognition
To decode sound waves, we use OpenVINO Feature Extraction & Decoder Library which takes in and transcribe the audio coming from the microphone. We have used the speech library as mentioned here to run speech recognition on the edge, without going online.
As the recognition model is optimized at the expense the accuracy, some tweaks are required to identify spoken command. Firstly, we limit the command vocabulary to say, ‘up’, ‘down’, ‘left’ & ‘right’ only. Secondly, similar sounding synonyms of command words are stored in a dictionary to find the best match. For instance, ‘right’ command could be recognized as ‘write’.
The function is so written that commands and also synonyms can easily be extended. To enable user entry, speech to write function is also enabled. This has enabled to user to type in alphabets and numbers. Eg: PNR number.
Gesture Controls and Mouse Navigation
The gesture control commands are configured as below. However, you can easily change the gesture-command mapping.

Mouse pointer is controlled using pyautogui library. Functions such as move(), moveTo(), click(), drag(), scroll(), write() etc are used to trigger events corresponding to the above gestures.
Stickiness Feature and Optimization
The gaze of an eye or pose of a head will continuously change at least a bit, even if unintended. Such natural motions should not be considered as a command, otherwise the mouse pointer will become jittery. Hence, we introduced a ‘stickiness’ parameter within which the motion is ignored. This has greatly increased the stability and usability of gesture control.
Finally, Intel VTune profiler is used to find hotspots and optimize the application code. A shell script vtune_script.sh is fed into the VTune GUI which initiates the project with suitable arguments.
Conclusion
The project demonstrates the capability of Intel OpenVINO to handle multiple Edge AI models in sequence and in parallel. Many control inputs are also sourced to demonstrate the flexibility. But to deploy a custom solution you can choose controls, as you deem fit.
For instance, Gaze control may be ideal for big screen while head pose control for laptop screen. Either way, Sound Control can help to accept custom form entries or vocal commands. Gesture-action mapping can also be modified. Yet the point you can drive home is the possibility to chain multiple hardware optimized AI models on the Edge, coupled with efficient numerical computing to solve interesting problems.
The solution source code can be found here.
If you have any query or suggestion, you can reach me here
If you like the project, kindly give a thumbs up here
References
[1] Intel OpenVINO Official Docs: https://docs.openvinotoolkit.org
[2] Intel® Edge AI for IoT Nanodegree by Udacity. Idea inspired from Final Course Project. **** https://classroom.udacity.com/nanodegrees/nd131
[3] Real-Time Eye Blink Detection using Facial Landmarks by Tereza Soukupova and Jan Cech, Faculty of E.E., Czech Technical University in Prague.