Why Lip Reading?
Professional lip reading is not a recent concept. It has actually been around for centuries. Obviously, one of the biggest motivation behind lip reading was to provide people with hearing impairment a way to understand what was being said to them.
Nevertheless, with the advancing technologies in the field of Computer Vision and Deep Learning, automated lip reading by machines has become a real possibility now. Notice the growth of this field, shown by the cumulative number of papers on ALR that were published per year.
![Cumulative number of papers on ALR systems published between 2007 and 2017 [1]](https://towardsdatascience.com/wp-content/uploads/2019/06/1GIX4qaUpviwSZ-IEJ1yssA.jpeg)
Such advancements open up various new avenues of discussion regarding the applications of ALR, the ethics of snooping on private conversations and most importantly its implications on data privacy.
Automated Lip Reading May Threaten Data Privacy (But Not for a While)
However, I am not here to discuss that today. This blog is for the curious few who would like to gain a deeper understanding of how these ALR systems work. Anyone with absolutely no previous experience in Deep Learning can superficially follow the blog. Even a rudimentary understanding of Deep Learning is enough to fully appreciate the details.
Is Lip Reading difficult?
Just take a look at this video of bad lip reading of a few short clips from The Walking Dead (keep the sound off). Watch it again, with sound, just for fun 😛
Funny.. right? The dialogues seem to match the video brilliantly, yet clearly something doesn’t feel right. What exactly is wrong? Well, for starters, those are obviously not the actual dialogues. But then why do they seem to fit so perfectly?
That is because there exists no direct one-to-one correspondence between the lip movements and the phonemes (smallest unit of sound in a language). For example, /p/ and /b/ are visually indistinguishable. So, the same lip movements can be a result of a multitude of different sentences.

But how do the professional lip readers do it then? Professional lip reading is a combination of understanding the lip movement, the body language, hand movements and context to interpret what the speaker is trying to say. Sounds complicated.. right? Well, let’s see how are the machines doing it…
What is the difference between ALR, ASR and AV-ASR?
For starters, let’s understand the difference between the 3 seemingly common terms ALR, ASR and AV-ASR
- Automated Lip Reading (ALR) :- Trying to understand what is being spoken based solely on the video (visual).
- Automated Speech Recognition (ASR) :- Trying to understand what is being spoken based solely on the audio. Commonly called speech-to-text systems.
- Audio Visual-Automated Speech Recognition (AV-ASR) :- Using both audio and visual clues to understand what is being spoken.
Alphabet and Digit Recognition
Early work in ALR was focused on simple tasks such as alphabet or digit recognition. These datasets contain small clips of various speakers, with various spatial and temporal resolutions, speaking a single alphabet (or phoneme) or digit. These tasks were popular in the early stages of ALR since they allowed researchers to work in a controlled setting and with a constrained vocabulary.

Word and Sentence Recognition
While the controlled settings of alphabet and digit recognition is useful to analyze the effectiveness of algorithms at early design stages, the resulting models do not have the ability to run on the wild. The aim of ALR systems is to understand natural speech, which is mainly structured is terms of sentences, which has made it necessary the acquisition of databases containing words, phrases and phonetically balanced sentences and models that can work efficiently on these.

A Superficial view of the pipeline
A typical ALR system consist of mainly three blocks
- Lips localization,
- Extraction of Visual Features,
- and Classification into sequences

The first block, focused on face and lips detection, is essentially a Computer Vision problem. The goal of the second block is to provide feature values (mathematical values) to the visual information observable at every frame, again a Computer Vision problem. Finally the classification block aims to map these features into speech units while making sure that the complete decoded message is coherent, which is in the domain of Natural Language Processing (NLP). This final block helps disambiguate between visually similar speech units by using context.
Deep Learning based ALR systems
There has been a significant improvements in the performance of ALR systems in the last few years, all thanks to the increasing involvement of Deep Learning based architecture into the pipeline.
The first two blocks, namely lip localisation and feature extraction are done using CNNs. Some other DL based feature extraction architectures also includes 3D-CNNs or Feed forward networks. And the last layer consists of LSTMs to do final classification, by taking all the individual frame outputs into account. Some other DL based sequence classification architectures include Bi-LSTMs, GRUs and LSTMs with attention.
![An example of a DL based baseline for ALR systems [1]](https://towardsdatascience.com/wp-content/uploads/2019/06/1w-W4NZ6GBjT3bFnJgGEz1Q.jpeg)
What’s next?
In the last few years, a clear technology shift can be seen from traditional architectures to end-to-end DNN architectures, currently dominated by CNN features in combination with LSTMs.
However, in most of these models, the output of the system is restricted to a pre-defined number of possible classes (alphabets, words or even sentences), in contrast to continuous lip-reading where the target is natural speech. Recent attempts to produce continuous lip-reading systems have focused on elementary language structures such as characters or phonemes. Thus, it is not surprising that the main challenges in ALR systems currently is to be able to model continuous lip-reading systems.
This blog is a part of an effort to create simplified introductions to the field of Machine Learning. Follow the complete series here
Or simply read the next blog in the series
References
[1] Fernandez-Lopez, Adriana, and Federico Sukno. "Survey on automatic lip-reading in the era of deep learning." Image and Vision Computing (2018). [2] Chung, Joon Son, and Andrew Zisserman. "Learning to lip read words by watching videos." Computer Vision and Image Understanding 173 (2018): 76–85. [3] K.L. Moll, R.G. Daniloff, Investigation of the timing of velar movements during speech, J. Acoust. Soc. Am. 50 (2B) (1971) 678–684.