TL;DR It is presented a dual-cam first-vision translation system using convolutional neural networks. A prototype was developed to recognize 24 gestures. The vision system is composed of a head-mounted camera and a chest-mounted camera and the machine learning model is composed of two convolutional neural networks, one for each camera.
Introduction
Sign language recognition is a problem that has been addressed in research for years. However, we are still far from finding a complete solution available in our society.
Among the works developed to address this problem, the majority of them have been based on basically two approaches: contact-based systems, such as sensor gloves; or vision-based systems, using only cameras. The latter is way cheaper and the boom of Deep Learning makes it more appealing.
This post presents a prototype of a dual-cam first-person vision translation system for sign language using convolutional neural networks. The post is divided into three main parts: the system design, the dataset, and the deep learning model training and evaluation.
VISION SYSTEM
Vision is a key factor in sign language, and every sign language is intended to be understood by one person located in front of the other, from this perspective, a gesture can be completely observable. Viewing a gesture from another perspective makes it difficult or almost impossible to be understood since every finger position and movement will not be observable.
Trying to understand sign language from a first-vision perspective has the same limitations, some gestures will end up looking the same way. But, this ambiguity can be solved by locating more cameras in different positions. In this way, what a camera can’t see, can be perfectly observable by another camera.
The vision system is composed of two cameras: a head-mounted camera and a chest-mounted camera. With these two cameras we obtain two different views of a sign, a top-view, and a bottom-view, that works together to identify signs.

Another benefit of this design is that the user will gain autonomy. Something that is not achieved in classical approaches, in which the user is not the person with disability but a third person that needs to take out a system with a camera and focus a signer while the signer is performing a sign.
DATASET
To develop the first prototype of this system is was used a dataset of 24 static signs from the Panamanian Manual Alphabet.
To model this problem as an image recognition problem, dynamic gestures such as letter J, Z, RR, and Ñ were discarded because of the extra complexity they add to the solution.
Data collection and preprocessing
To collect the dataset it was asked to four users to wear the vision system and perform every gesture for 10 seconds while both cameras were recording in a 640×480 pixel resolution.
It was requested to the users to perform this process in three different scenarios: indoors, outdoors, and in a green background scenario. For the indoors and outdoors scenarios the users were requested to move around while performing the gestures in order to obtain images with different backgrounds, light sources, and positions. The green background scenario was intended for a data augmentation process, we’ll describe later.
After obtaining the videos, the frames were extracted and reduced to a 125×125 pixel resolution.

Data augmentation
Since the preprocessing before going to the convolutional neural networks was simplified to just rescaling, the background will always get passed to the model. In this case, the model needs to be able to recognize a sign despite the different backgrounds it can have.
To improve the generalization capability of the model it was artificially added more images with different backgrounds replacing the green backgrounds. This way it is obtained more data without investing too much time.

During the training, it was also added another data augmentation process consisting of performing some transformations, such as some rotations, changes in light intensity, and rescaling.

These two data augmentation process were chosen to help improve the generalization capability of the model.
Top view and bottom view datasets
This problem was model as a multiclass classification problem with 24 classes, and the problem itself was divided into two smaller multi-class classification problems.
The approach to decide which gestures would be classified whit the top view model and which ones with the bottom view model was to select all the gestures that were too similar from the bottom view perspective as gestures to be classified from the top view model and the rest of gestures were going to be classified by the bottom view model. So basically, the top view model was used to solved ambiguities.
As a result, the dataset was divided into two parts, one for each model as shown in the following table.

DEEP LEARNING MODEL
As state-of-the-art technology, convolutional neural networks was the option chosen for facing this problem. It was trained two models: one model for the top view and one for the bottom view.
Architecture
The same convolutional neural network architecture was used for both, the top view and the bottom view models, the only difference is the number of output units.
The architecture of the convolutional neural networks is shown in the following figure.

To improve the generalization capability of the models it was used dropout techniques between layers in the fully connected layer to improve model performance.
Evaluation
The models were evaluated in a test set with data corresponding to a normal use of the system in indoors, in other words, in the background it appears a person acting as the observer, similar to the input image in the figure above (Convolutional neural networks architecture). The results are shown below.

Although the model learned to classify some signs, such as Q, R, H; in general, the results are kind of discouraging. It seems that the generalization capability of the models wasn’t too good. However, the model was also tested with real-time data showing the potential of the system.
The bottom view model was tested with real-time video with a green uniform background. I wore the chest-mounted camera capturing video at 5 frames per second while I was running the bottom view model in my laptop and try to fingerspell the word fútbol (my favorite sport in Spanish). The entries for every letter were emulated by a click. The results are shown in the following video.
Note: Due to the model performance, I had to repeat it several times until I ended up with a good demo video.
Conclusions
Sign language recognition is a hard problem if we consider all the possible combinations of gestures that a system of this kind needs to understand and translate. That being said, probably the best way to solve this problem is to divide it into simpler problems, and the system presented here would correspond to a possible solution to one of them.
The system didn’t perform too well but it was demonstrated that it can be built a first-person sign language translation system using only cameras and convolutional neural networks.
It was observed that the model tends to confuse several signs with each other, such as U and W. But thinking a bit about it, maybe it doesn’t need to have a perfect performance since using an orthography corrector or a word predictor would increase the translation accuracy.
The next step is to analyze the solution and study ways to improve the system. Some improvements could be carrying by collecting more quality data, trying more convolutional neural network architectures, or redesigning the vision system.
End words
I developed this project as part of my thesis work in university and I was motivated by the feeling of working in something new. Although the results weren’t too great, I think it can be a good starting point to make a better and biggest system.
If you are interested in this work, here is the link to my thesis (it is written in Spanish)
Thanks for reading!