Sign Language Recognition with Advanced Computer Vision

Detecting Sign Language Characters in Real Time Using MediaPipe and Keras

Published in

Towards Data Science

8 min readAug 23, 2022

Sign Language is a form of communication used primarily by people hard of hearing or deaf. This type of gesture-based language allows people to convey ideas and thoughts easily overcoming the barriers caused by difficulties from hearing issues.

A major issue with this convenient form of communication is the lack of knowledge of the language for the vast majority of the global population. Just as any other language, learning Sign Language takes much time and effort, discouraging to from being learned by the larger population.

However, an evident solution to this issue is present in the world of Machine Learning and Image Detection. Implementing predictive model technology to automatically classify Sign Language symbols can be used to create a form of real-time captioning for virtual conferences like Zoom meetings and other such things. This would greatly increase access of such services to those with hearing impairments as it would go hand-in-hand with voice-based captioning, creating a two-way communication system online for people with hearing issues.

Many large training datasets for Sign Language are available on Kaggle, a popular resource for data science. The one used in this model is called “Sign Language MNIST” and is a public-domain free-to-use dataset with pixel information for around 1,000 images of each of 24 ASL Letters, excluding J and Z as they are gesture-based signs.

“Cropped image montage panel of various users and backgrounds for American Sign Language letters” from Sign Language MNIST

The first step of preparing the data for training is to convert and shape all of the pixel data from the dataset into images so they can be read by the algorithm.

The code above starts by reshaping all of the MNIST training image files so the model understands the input files. Along with this, the LabelBinarizer() variable takes the classes in the dataset and converts them to binary, a process that greatly speeds up the training of the model.

The next step is to create the data generator to randomly implement changes to the data, increasing the amount of training examples and making the images more realistic by adding noise and transformations to different instances.

After processing the images, the CNN model must be compiled to recognize all of the classes of information being used in the data, namely the 24 different groups of images. Normalization of the data must also be added to the data, equally balancing the classes with less images.

Notice the initialization of the algorithm with the adding of variables such as the Conv2D model, and the condensing to 24 features. We also use batching techniques to allow the CNN to handle the data more efficiently.

Finally, defining the loss functions and metrics along with fitting the model to the data will create our Sign Language Recognition system. It is important to recognize the model.save() command at the end of the statement due to the length of time required to build the model. Re-training the model for every use can take hours of time.

This code has a lot to unpack. Let’s look at it in sections.

Line 1:

The model.compile() function takes many parameters, of which three are displayed in the code. The optimizer and loss parameters work together along with the epoch statement in the next line to efficiently reduce the amount of error in the model by incrementally changing computation methods on the data.

Along with this, the metric of choice to be optimized is the accuracy functions, which ensures that the model will have the maximum accuracy achievable after the set number of epochs.

Line 4:

The function run here fits the designed model to the data from the image data developed in the first bit of code. It also defines the number of epochs or iterations the model has to enhance the accuracy of the image detection. The validation set is also called here, to introduce a testing aspect to the model. The model calculates the accuracy using this data.

Line 5:

Of all of the statements in the code bit, the model.save() function may be the most important part of this code, as it can potentially save hours of time when implementing the model.

The model developed accurately detects and classifies Sign Language symbols with about 95% training accuracy.

Now, using two popular live video processing libraries known as Mediapipe and Open-CV, we can take webcam input and run our previously developed model on real time video stream.

Image of Woman Showing Sign Language from Pexels

To start, we need to import the required packages for the program.

The OS command run at the beginning simply blocks unnecessary warnings from the Tensorflow library used by Mediapipe. This makes the future output provided by the program clearer to understand.

Before we initiate the main while loop of the code, we need to first define some variables such as the saved model and information on the camera for Open-CV.

Each of the variables set here are grouped into one of four categories. The category at the beginning pertains directly to the model that we trained in the first part of this paper. The second and third sections of the code define variables required to run and start Mediapipe and Open-CV. The final category is used primarily to analyze the frame when detected, and create the dictionary used in the cross-referencing of the data provided by the image model.

The next part to this program is the main while True loop in which much of the program runs in.

This section of the program takes the input from your camera and uses our imported image processing library to display the input from the device to the computer. This portion of code focuses on getting general information from your camera and simply showing it back in a new window. However, using the Mediapipe library, we can detect the major landmarks of the hand such as the fingers and palms, and create a bounding box around the hand.

Image of Hand Annotations from Mediapipe, by Author

The idea of a bounding box is a crucial component to all forms of image classification and analysis. The box allows the model to focus directly on the portion of the image needed for the function. Without this, the algorithm finds patterns in wrong places and can cause an incorrect result.

For instance, during the training process, the lack of a bounding box can lead to the model correlating features of an image such as a clock, or a chair, to a label. This may cause the program to notice the clock located in the image and decide what Sign Language character is being shown solely on the fact that a clock is present.

Previous Image with a highlighted clock, by Author

Almost done! The second to last part of the program is capturing a single frame on cue, cropping it to the dimensions of the bounding box.

This code looks very similar to the last portion of the program. This is due mainly to the fact that the process involving the production of the bounding box is the same in both parts. However, in this analysis section of the code, we make use of the image reshaping feature from Open-CV to resize the image to the dimensions of the bounding box, rather than creating a visual object around it. Along with this, we also use NumPy and Open-CV to modify the image to have the same characteristics as the images the model was trained on. We also use pandas to create a dataframe with the pixel data from the images saved, so we can normalize the data in the same way we did for the model creation.

Towards the top of the code, you may notice the odd sequence of variables being defined. This is due to the nature of the camera library syntax. When an image is processed and changed by Open-CV, the changes are made on top of the frame used, essentially saving the changes made to the image. The definition of multiple variables of equal value makes it so that the frame displayed in by the function is separate from the picture on which the model is being ran on.

Finally, we need to run the trained model on the processed image and process the information output.

There is a lot of information being run through this section of the code. We will dissect this part of the code one by one.

The first two lines draw the predicted probabilities that a hand image is any of the different classes from Keras. The data is presented in the form of 2 tensors, of which, the first tensor contains information on the probabilities. A tensor is essentially a collection of feature vectors, very similar to an array. The tensor produced by the model is one dimensional, allowing it to be used with the linear algebra library NumPy to parse the information into a more pythonic form.

From here, we utilize the previously created list of classes under the variable letterpred to create a dictionary, matching the values from the tensor to the keys. This allows us to match each character’s probability with the class it corresponds to.

Following this step, we use list comprehension to order and sort the values from highest to lowest. This then allows us to take the first few items in the list and designate them the 3 characters that closest correspond to the Sign Language image shown.

Finally, we use a for loop to cycle through all of the key:value pairs in the dictionary created to match the highest values to their corresponding keys and print out the output with each character’ probability.

As shown, the model accurately predicts the character being shown from the camera. Along with the Predicted Character, the program also displays the confidence of the classification from the CNN Keras model.

The model developed can be implemented in various ways, with the main use being a captioning device for calls involving video communication like Facetime. To create such an application, the model would have to be running frame-by-frame, predicting what sign is being shown at all times. Using other systems, we can also recognize when an individual is showing no sign, or is transitioning between signs, to more accurately judge the words being shown through ASL. This implementation could be used to string together letters being shown to eventually recognize words and even sentences, creating a fully functioning Sign Language to text translator. Such a device would greatly increase ease-of-access to the benefits of virtual communication for those with hearing imparities.

This program allows for simple and easy communication from Sign Language to English through the use of Keras image analysis models. The code for this project can be found on my GitHub profile, linked below:

mg343/Sign-Language-Detection (github.com)

Sign Language Recognition with Advanced Computer Vision

Detecting Sign Language Characters in Real Time Using MediaPipe and Keras

Written by Mihir Garimella