The world’s leading publication for data science, AI, and ML professionals.

Human Emotion and Gesture Detector Using Deep Learning: Part-2

Diving Deeper into Human Emotion and Gesture Recognition

Emotion Gesture Detection

Hello everyone! Welcome back to the part-2 of human emotion and gesture detector using Deep Learning. In case you haven’t already, check out part-1 here. In this article, we will be covering the training of our gestures model and also look at a way to achieve higher accuracy on the emotions model. Finally, we will create a final pipeline using computer vision through which we can access our webcam and get a vocal response from the models we have trained. Without further ado let’s start coding and understanding the concepts.

Source: Franck-v-unsplash
Source: Franck-v-unsplash

For training the gestures model, we will be using a transfer learning model. We will use VGG-16 architecture for training the model and exclude the top layer of the VGG-16. Then we will proceed to add our own custom layers to improve the accuracy and reduce the loss. We will try to achieve an overall high accuracy of about 95% on our gestures model as we have a fairly balanced dataset and using the techniques of image data augmentation and the VGG-16 transfer learning model this task can be achieved easily and also in fewer epochs comparatively to our emotions model. In a future article, we will cover how exactly the VGG-16 architecture works but for now let us proceed to analyze the data at hand and perform an exploratory data analysis on the gestures dataset similar to how we performed on the emotions dataset after the extraction of images.

EXPLORATORY DATA ANALYSIS (EDA):

In this next code block, we will look at the contents in the train folder and try to figure out the total number of classes, that we have for each of the categories for the gestures in the train folder.

Train:

We can look at the four sub-folders we have in the train1 folder. Let us visually look at the number of images in these directories.

Bar Graph:

We can notice from the bar graph that each of the directories contains 2400 images each and this is a completely balanced dataset. Now, let us proceed to visualize the images in the train directory. We will look at the first image in each of the sub-directories and then look into the dimensions and number of channels of each of the images which are present in these folders.

The dimension of the images are as follows:

The Height of the image = 200 pixels The Width of the image = 200 pixels The Number of channels = 3

Similarly, we can perform an analysis on the validation1 directory and check how our Validation dataset and the validation images look like.

Validation:

Bar Graph:

We can notice from the bar graph that each of the directories contains 600 images each and this is a completely balanced dataset. Now, let us proceed to visualize the images in the validation directory. We will look at the first image in each of the sub-directories. The dimensions and number of channels of each of the images which are present in these folders are the same as the train directory.

With this our exploratory data analysis (EDA) for our gestures dataset is completed. We can proceed to build the gestures training model for appropriate gestures prediction.

Gestures Train Model:

Let us look at the code block below to understand the libraries we are importing as well as set the number of classes along with their dimensions and their respective directories.

Import all the important required Deep Learning Libraries to train the gestures model. Keras is an Application Programming Interface (API) that can run on top of tensorflow. Tensorflow will be the main deep learning module we will use to build our deep learning model. From tensorflow, we will be referring to a pre-trained model called VGG-16. We will be using VGG-16 with custom convolutional neural networks (CNN’s) i.e. We will be using our transfer learning model VGG-16 alongside our own custom model to train an overall accurate model. The VGG-16 model in keras is pre-trained with the imagenet weights.

The ImageDataGenerator is used for Data augmentation where the model can see more copies of the model. Data Augmentation is used for creating replications of the original images and using those transformations in each epoch. The layers for training which will be used are as follows:

  1. Input = The input layer in which we pass the input shape.
  2. Conv2D = The Convoluional layer combined with Input to provide an output of tensors.
  3. Maxpool2D = Downsampling the Data from the convolutional layer.
  4. Batch normalization = It is a technique for training very deep neural networks that standardizes the inputs to a layer for each mini-batch. This has the effect of stabilizing the learning process and dramatically reducing the number of training epochs required to train deep networks.
  5. Dropout = Dropout is a technique where randomly selected neurons are ignored during training. They are "dropped-out" randomly and this prevents over-fitting.
  6. Dense = Fully Connected layers.
  7. Flatten = Flatten the entire structure to a 1-D array.

The Models can be built in a model like structure as shown in this particular model or can be built in a sequential manner. Here, we will be using a functional API model-like structure which is different from our emotions model which is a sequential model. We can use l2 regularization for fine-tuning. The optimizer used will be Adam as it performs better than the other optimizers on this model. We are also importing the os module to make it compatible with the Windows environment.

We have 4 classes of gestures which are namely punch, Victory, Super and Loser. Each of the images has a height and width of 200 as well as it is a RGB image i.e. a 3-Dimensional image. We will be using a batch_size of 128 for the image Data Augmentation.

We will also specify the train and the validation directory for the stored images. train_dir is the directory that will contain the set of images for training. validation_dir is the directory that will contain the set of validation images.

DATA AUGMENTATION:

We will look at the image data augmentation for the gestures dataset which is similar to the emotions data.

The ImageDataGenerator is used for data augmentation of images. We will be replicating and making copies of the transformations of the original images. The Keras Data Generator will use the copies and not the original ones. This will be useful for training at each epoch.

We will be rescaling the image and updating all the parameters to suit our model:

  1. rescale = Rescaling by 1./255 to normalize each of the pixel values
  2. rotation_range = specifies the random range of rotation
  3. shear_range = Specifies the intensity of each angle in the counter-clockwise range.
  4. zoom_range = Specifies the zoom range.
  5. width_shift_range = specify the width of the extension.
  6. height_shift_range = Specify the height of the extension.
  7. horizontal_flip = Flip the images horizontally.
  8. fill_mode = Fill according to the closest boundaries.

train_datagen.flow_from_directory Takes the path to a directory & generates batches of augmented data. The callable properties are as follows:

  1. train dir = Specifies the directory where we have stored the image data.
  2. color_mode = Important feature which we need to specify how our images are categorized i.e. grayscale or RGB format. The default is RGB.
  3. target_size = The Dimensions of the image.
  4. batch_size = The number of batches of data for the flow operation.
  5. class_mode = Determines the type of label arrays that are returned. "categorical" will be 2D one-hot encoded labels.
  6. shuffle = shuffle: Whether to shuffle the data (default: True) If set to False, sorts the data in alphanumeric order.

In the next code block, we are importing the VGG-16 Model in the variable VGG16_MODEL and making sure we input the model without the top layer. Using the VGG16 architecture without the top layer, we can now add our custom layers. To Avoid training VGG16 Layers we give the command below: layers.trainable = False. We will also print out these layers and make sure their training is set as False.

FINGERS GESTURE MODEL:

Below is the complete code for the custom layers of the fingers gesture model we are building –

The Finger Gesture Model we are building will be trained by using transfer learning. We will be using the VGG-16 model with no top layer. We will be adding custom layers to the top layer of the VGG-16 model and then we will use this transfer learning model for prediction of the finger gestures. The Custom layer consists of the input layer which is, basically the output of the VGG-16 Model. We add a convolutional layer with 32 filters, kernel_size of (3,3), and default strides of (1,1) and we use activation as relu with he_normal as the initializer. We will be using the pooling layer to downsampled the layers from the convolutional layer. The 2 fully connected layers are used with activation as relu i.e. a Dense architecture after the sample is being passed through a flatten layer. The output layer has a softmax activation with num_classes is 4 that predicts the probabilities for the num_classes namely Punch, Super, Victory and Loser. The final Model takes the input as the start of the VGG-16 model and outputs as the final output layer.

The callbacks are similar to the previous emotions model, so let us directly move on the compilation and training of the gestures model.

Compile and fit the model:

We are compiling and fitting our model in the final step. Here, we are training the model and saving the best weights to gesturenew.h5 so that we don’t have to re-train the model repeatedly and we can use our saved model when required. We are training on both the training and validation data. The loss we have used is categorical_crossentropy which computes the cross-entropy loss between the labels and predictions. The optimizer we will be using is Adam with a learning rate of 0.001 and we will compile our model on the metric accuracy. We will fit the data on the augmented training and validation images. After the fitting step, these are the results we are able to achieve on train and validation loss and accuracy.

Graph:

Observation:

The Model is able to perform extremely well. We can notice that the train and validation losses are decreasing constantly and the train as well as validation accuracy increases constantly. There is no over-fitting in the deep learning model and we are able to achieve a validation accuracy of over 95%.

BONUS:

EMOTIONS MODEL-2:

This is an additional model that we will be looking at. With this method, we can achieve higher accuracy with the exact same model. After some research and experimentation, I was able to find out that we could achieve higher accuracy by using the pixels in numpy arrays and then training them. There is a wonderful article where the author has used a similar approach. I would highly recommend users to check out that article as well. Here, we will use this approach with the custom sequential model and see what accuracy we are able to achieve. Import the libraries similar to the previous emotions model. Refer to the GitHub repository at the end of the post for additional information. Below is the code block for the complete preparation of data for the model.

num_classes = Defines the number of classes we have to predict which are namely Angry, Fear, Happy, Neutral, Surprise, Neutral, and Disgust. From the exploratory Data Analysis we know that The Dimensions of the image are: Image Height = 48 pixels Image Width = 48 pixels Number of channels = 1 because it is a grayscale image. We will consider a batch size of 64 for the model.

We will convert the pixels to a list in this method. We split the data by spaces and then take them as arrays and reshape them into 48, 48 shape. We can proceed to expand the dimensions and then convert the labels to the categorical matrix.

Finally, we split the data into train, test, and validation. This approach is slightly different from our previous model’s approach where we only made use of train and validation as we divided the data in an 80:20 ratio. Here, we divide the data in an 80:10:10 format. We will be using the same sequential model as I did in the previous part. Let us have a look at the model once again and see how it performs after training.

The final accuracy, validation accuracy, loss, and validation loss we were able to achieve on all 7 emotions were as follows:

Graph:

Observation:

The Model is able to perform quite well. We can notice that the train and validation losses are decreasing constantly and the train, as well as validation accuracy, increases constantly. There is no over-fitting in the deep learning model and we are able to achieve a validation accuracy of over 65% and an accuracy of almost 70% and reduce the overall losses as well.

Recordings:

In this section, we will be creating the recordings required for the vocal response from the models. We can create custom recordings for each of the models and for each emotion or gesture. In the below code block, I will be showing an example for the recordings for one emotion and one gesture respectively.

Understanding the imported libraries:

  1. gTTS = Google Text-to-Speech is a python library that we can use to convert text to a vocal translation response.
  2. playsound = This module is useful for playing sound directly from a specified path with a .mp3 format.
  3. shutil = This module offers several high-level operations on files and collections of files. In particular, functions are provided which support file copying, moving, and removal.

In this python file, we will be creating all the required voice recordings for both the emotions as well as all the gestures and we will be storing them in the reactions directory. I have shown an example of how to create a custom voice recording in the code block for each emotion or gesture. The entire code for the recordings will be posted in the GitHub repository at the end of this post.

Final Pipeline:

Our final pipeline will consist of loading both our saved models and then using them accordingly to predict emotions and gestures. I will be including 2 python files in the GitHub repository. The final_run.py takes the choice from the user and runs either an emotion or gestures model. The final_run1.py runs both the emotions and gestures model simultaneously. Feel free to use whichever is more convenient for you guys. I will be using the saved models from the first emotions trained model and the trained gestures model. We will be using an additional XML file called _haarcascade_frontalfacedefault.xml for the detection of faces. Let us try to understand the code for the final pipeline from the code block below.

In this particular code block, we are importing all the required libraries which we will be using to obtain a vocal response for the predicted label by the model. The cv2 is the computer vision (open-cv) module which we will be using to access and use our webcam in real-time. We are importing the time module to make sure we get a prediction only after 10 seconds of analysis. We load the saved pre-trained weights of both the emotions and gestures model. We then specify the classifier that we will be used for the detection of faces. We then assign all the emotions and gestures labels which can be predicted by our model.

In the next code block, we will look at a code snippet for the emotions model. For the entire code, refer to the GitHub repository at the end of the article.

In this choice, we will be running the emotions model. While webcam is detected we will read the frames and then we will proceed to draw a rectangle box (similar to a bounding box) when the haar cascade classifier detects a face. We will convert the facial image into a grayscale of dimensions 48, 48 similar to the trained images for better predictions. The Prediction is only made when the np.sum detects at least one face. The keras commands img_to_array converts the image to array dimensions and in case more images are detected we expand the dimensions. The Predictions are made according to the labels and the recordings will be played accordingly.

Let us look at the code snippet for running the gestures model.

In this choice, we will be running the gestures model. While the webcam is detected we will read the frames and then we will draw a rectangle box in the middle of the screen, unlike the emotions model. The User will have to place the fingers in the required box to make the following work. The Prediction is only made when the np.sum detects at least one finger model. The keras commands img_to_array converts the image to array dimensions and in case more images are detected we expand the dimensions. The Predictions are made according to the labels and the recordings will be played accordingly. With this, our final pipeline is completed and we have analyzed all the code required for building the human emotion and gesture detector models. We can now proceed to release the video capture and destroy all windows, which means we can quit running the frame which is being run by the computer vision module.

Conclusion:

We have finally completed going through the entire human emotion and gesture detector. The GitHub repository for the entire code can be found here. I would highly recommend experimenting with the various parameters as well as the layers in all the 3 models we have built and try to achieve better results. The various recordings can also be modified as desired by the user. It is also possible to try out various transfer learning models or build your custom architectures and achieve an overall better performance. Have fun experimenting and trying out different and unique things with the models!

Final Thoughts:

I had great fun in writing this 2-part series and it was an absolute blast. I hope all of you enjoyed reading this as much as I did writing this. I look forward to posting more articles in the future as I find it extremely enjoyable. So, any ideas for future articles or any topic you guys want me to cover would be highly appreciated. Thank you everyone for sticking on till the end and I wish you all a wonderful day!


Related Articles