Tutorial: Using Deep Learning and CNNs to make a Hand Gesture recognition model

Filipe Borba
Towards Data Science
7 min readMay 6, 2019


First, here’s the Github repository with the code. The project is in the format of a Jupyter Notebook, which can be uploaded to Google Colaboratory to work without environment issues.

Machine Learning is very useful for a variety of real-life problems. It is commonly used for tasks such as classification, recognition, detection and predictions. Moreover, it is very efficient to automate processes that use data. The basic idea is to use data to produce a model capable of returning an output. This output may give a right answer with a new input or produce predictions towards the known data.

The goal of this project is to train a Machine Learning algorithm capable of classifying images of different hand gestures, such as a fist, palm, showing the thumb, and others. This particular classification problem can be useful for Gesture Navigation, for example. The method I’ll be using is Deep Learning with the help of Convolutional Neural Networks based on Tensorflow and Keras.

Deep Learning is part of a broader family of machine learning methods. It is based on the use of layers that process the input data, extracting features from them and producing a mathematical model. The creation of this said ‘model’ will be more clear in the next session. In this specific project, we’ll be aiming to classify different images of hand gestures, which means that the computer will have to “learn” the features of each gesture and classify them correctly. For example, if it is given an image of a hand doing a thumbs up gesture, the output of the model needs to be “the hand is doing a thumbs up gesture”. Let’s begin.

Loading Data

This project uses the Hand Gesture Recognition Database (citation below) available on Kaggle. It contains 20000 images with different hands and hand gestures. There is a total of 10 hand gestures of 10 different people presented in the data set. There are 5 female subjects and 5 male subjects.
The images were captured using the Leap Motion hand tracking device.

Table 1 — Classification used for every hand gesture.

With that, we have to prepare the images to train the algorithm. We have to load all the images into an array that we will call X and all the labels into another array called y.

X = [] # Image data
y = [] # Labels
# Loops through imagepaths to load images and labels into arrays
for path in imagepaths:
img = cv2.imread(path) # Reads image and returns np.array
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # Converts into the corret colorspace (GRAY)
img = cv2.resize(img, (320, 120)) # Reduce image size so training can be faster

# Processing label in image path
category = path.split("/")[3]
label = int(category.split("_")[0][1]) # We need to convert 10_down to 00_down, or else it crashes
# Turn X and y into np.array to speed up train_test_split
X = np.array(X, dtype="uint8")
X = X.reshape(len(imagepaths), 120, 320, 1) # Needed to reshape so CNN knows it's different images
y = np.array(y)
print("Images loaded: ", len(X))
print("Labels loaded: ", len(y))

Scipy’s train_test_split allows us to split our data into a training set and a test set. The training set will be used to build our model. Then, the test data will be used to check if our predictions are correct. A random_state seed is used so the randomness of our results can be reproduced. The function will shuffle the images it’s using to minimize training loss.

# Percentage of images that we want to use for testing. 
# The rest is used for training.
ts = 0.3
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=ts, random_state=42)

Creating Model

To simplify the idea of the model being constructed here, we’re going to use the concept of Linear Regression. By using linear regression, we can create a simple model and represent it using the equation y = ax + b.
a and b (slope and intercept, respectively) are the parameters that we’re trying to find. By finding the best parameters, for any given value of x, we can predict y. This is the same idea here, but much more complex, with the use of Convolutional Neural Networks.

A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning algorithm which can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image and be able to differentiate one from the other. The pre-processing required in a ConvNet is much lower as compared to other classification algorithms. While in primitive methods filters are hand-engineered, with enough training, CNNs have the ability to learn these filters/characteristics.

Figure 1 — Example of Convolutional Neural Network.

From Figure 1 and imagining the Linear Regression model equation that we talked about, we can imagine that the input layer is x and the output layer is y. The hidden layers vary from model to model, but they are used to “learn” the parameters for our model. Each one has a different function, but they work towards getting the best “slope and intercept”.

# Construction of model
model = Sequential()
model.add(Conv2D(32, (5, 5), activation='relu', input_shape=(120, 320, 1)))
model.add(MaxPooling2D((2, 2)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D((2, 2)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D((2, 2)))
model.add(Dense(128, activation='relu'))
model.add(Dense(10, activation='softmax'))
# Configures the model for training
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# Trains the model for a given number of epochs (iterations on a dataset) and validates it.
model.fit(X_train, y_train, epochs=5, batch_size=64, verbose=2, validation_data=(X_test, y_test))

CNNs apply a series of filters to the raw pixel data of an image to extract and learn higher-level features, which the model can then use for classification. CNNs contains three components:

  • Convolutional layers, which apply a specified number of convolution filters to the image. For each subregion, the layer performs a set of mathematical operations to produce a single value in the output feature map. Convolutional layers then typically apply a ReLU activation function to the output to introduce nonlinearities into the model.
  • Pooling layers, which downsample the image data extracted by the convolutional layers to reduce the dimensionality of the feature map in order to decrease processing time. A commonly used pooling algorithm is max pooling, which extracts subregions of the feature map (e.g., 2x2-pixel tiles), keeps their maximum value, and discards all other values.
  • Dense (fully connected) layers, which perform classification on the features extracted by the convolutional layers and downsampled by the pooling layers. In a dense layer, every node in the layer is connected to every node in the preceding layer.

Testing Model

Now that we have the model compiled and trained, we need to check if it’s good. First, we run ‘model.evaluate’ to test the accuracy. Then, we make predictions and plot the images as long with the predicted labels and true labels to check everything. With that, we can see how our algorithm is working.
Later, we produce a confusion matrix, which is a specific table layout that allows visualization of the performance of an algorithm.

test_loss, test_acc = model.evaluate(X_test, y_test)print('Test accuracy: {:2.2f}%'.format(test_acc*100))

6000/6000 [=====================] — 39s 6ms/step

Test accuracy: 99.98%

predictions = model.predict(X_test) # Make predictions towards the test sety_pred = np.argmax(predictions, axis=1) # Transform predictions into 1-D array with label number# H = Horizontal
# V = Vertical
pd.DataFrame(confusion_matrix(y_test, y_pred),
columns=["Predicted Thumb Down", "Predicted Palm (H)", "Predicted L", "Predicted Fist (H)", "Predicted Fist (V)", "Predicted Thumbs up", "Predicted Index", "Predicted OK", "Predicted Palm (V)", "Predicted C"],
index=["Actual Thumb Down", "Actual Palm (H)", "Actual L", "Actual Fist (H)", "Actual Fist (V)", "Actual Thumbs up", "Actual Index", "Actual OK", "Actual Palm (V)", "Actual C"])
Figure 3 — Confusion matrix showing the predicted outcomes and the actual image label.


Based on the results presented in the previous section, we can conclude that our algorithm successfully classifies different hand gestures images with enough confidence (>95%) based on a Deep Learning model.

The accuracy of our model is directly influenced by a few aspects of our problem. The gestures presented are reasonably distinct, the images are clear and without background. Also, there is a reasonable quantity of images, which makes our model more robust. The drawback is that for different problems, we would probably need more data to stir the parameters of our model into a better direction. Moreover, a deep learning model is very hard to interpret, given it’s abstractions.
However, by using this approach it becomes much more easier to start working on the actual problem, since we don’t have to account for feature engineering. This means that we don’t need to pre-process the images with edge or blob detectors to extract the important features; the CNN does it for us. Also, it can be adapted to new problems relatively easily, with generally good performance.

As mentioned, another approach to this problem would be to use feature engineering, such as binary thresholding (check area of the hand), circle detection and others to detect unique characteristics on the images. However, with our CNN approach, we don’t have to worry about any of these.

Any doubts? Feel free to send questions/issues on the Github repository!


T. Mantecón, C.R. del Blanco, F. Jaureguizar, N. García, “Hand Gesture Recognition using Infrared Imagery Provided by Leap Motion Controller”, Int. Conf. on Advanced Concepts for Intelligent Vision Systems, ACIVS 2016, Lecce, Italy, pp. 47–57, 24–27 Oct. 2016. (doi: 10.1007/978–3–319–48680–2_5)

