From raw images to real-time predictions with Deep Learning

Face expression recognition using Keras, Flask and OpenCV

Jonathan Oheix
Towards Data Science

--

Photo by Peter Lloyd on Unsplash

In my opinion, one of the most exciting fields in Artificial Intelligence is computer vision. I find it very interesting how we can now automatically extract knowledge from complex raw data structures such as images.

The goal of this article is to explore a complete example of a computer vision application: building a face expression recognition system with Deep Learning. We will see how to:

  • design a Convolutional Neural Network
  • train it from scratch by feeding batches of images
  • export it to reuse it with real-time image data

Tools

Keras is a high-level Neural Network API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. We will use it to build, train and export out Neural Network.

Flask is a micro web framework written in Python that will allow us to serve directly our model into a web interface.

OpenCV is a computer vision library with C++, Python and Java interfaces. We will use this library to automatically detect faces in images.

Data source

The data comes from the past Kaggle competition “Challenges in Representation Learning: Facial Expression Recognition Challenge”:

https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge

The data consists of 48x48 pixel grayscale images of faces. The faces have been automatically registered so that the face is more or less centered and occupies about the same amount of space in each image. Each image corresponds to a facial expression in one of seven categories (0=Angry, 1=Disgust, 2=Fear, 3=Happy, 4=Sad, 5=Surprise, 6=Neutral). The dataset contains approximately 36K images.

The original data consisted in arrays with a greyscale value for each pixel. We converted this data into raw images and splitted them in multiple folders:

images/
train/
angry/
disgust/
fear/
happy/
neutral/
sad/
surprise/
validation/
angry/
disgust/
fear/
happy/
neutral/
sad/
surprise/

80% of our images are contained inside the train folder, and the last 20% are inside the validation folder.

Quick data visualization

First let’s see how our images look like:

Sample of the training images

Can you guess which expressions are those images related to?

This task is quite easy for a human, but it may be a bit challenging for a predictive algorithm because:

  • the images have a low resolution
  • the faces are not in the same position
  • some images have text written on them
  • some people hide part of their faces with their hands

However all this diversity of images will contribute to make a more generalizable model.

4103 fear images
436 disgust images
4982 neutral images
7164 happy images
3993 angry images
3205 surprise images
4938 sad images

The face expressions in our training dataset are pretty balanced, except for the ‘disgust’ category.

Setup the data generators

Deep learning models are trained by being fed with batches of data. Keras has a very useful class to automatically feed data from a directory: ImageDataGenerator.

Found 28821 images belonging to 7 classes.
Found 7066 images belonging to 7 classes.

It can also perform data augmentation while getting the images (randomly rotating the image, zooming, etc.). This method is often used as a way to artificially get more data when the dataset has a small size.

The function flow_from_directory() specifies how the generator should import the images (path, image size, colors, etc.).

Setup our Convolutional Neural Network (CNN)

We chose to use a Convolutional Neural Network in order to tackle this face recognition problem. Indeed this type of Neural Network (NN) is good for extracting the features of images and is widely used for image analysis subjects like image classification.

Quick reminder of what a NN is:

A Neural Network is a learning framework that consists in multiple layers of artificial neurons (nodes). Each node gets weighted input data, passes it into an activation function and outputs the result of the function:

A node

A NN is composed of several layers of nodes:

A classic NN architecture
  • An input layer that will get the data. The size of the input layer depends on the shape of the input data.
  • Some hidden layers that will allow the NN to learn complex interactions within the data. A Neural Network with a lot of hidden layers is called a Deep Neural Network.
  • An output layer that will give the final result, for instance a class prediction. The size of this layer depends on the type of output we want to produce (e.g. how many classes do we want to predict?)

Classic NNs are usually composed of several fully connected layers. This means that every node of one layer is connected to all the nodes of the next layer.

Convolutional Neural Networks also have Convolutional layers that apply sliding functions to groups of pixels that are next to each other. Therefore those structures have a better understanding of patterns that we can observe in images. We will explain this in more details after.

Now let’s define the architecture of our CNN:

We define our CNN with the following global architecture:

  • 4 convolutional layers
  • 2 fully connected layers

The convolutional layers will extract relevant features from the images and the fully connected layers will focus on using these features to classify well our images. This architecture was inspired by the following work on the subject: https://github.com/jrishabh96/Facial-Expression-Recognition

Now let’s focus on how those convolution layers work. Each of them contain the following operations:

  • A convolution operator: extracts features from the input image using sliding matrices to preserve the spatial relations between the pixels. The following image summarizes how it works:
A convolution operator

The green matrix corresponds to the raw image values. The orange sliding matrix is called a ‘filter’ or ‘kernel’. This filter slides over the image by one pixel at each step (stride). During each step, we multiply the filter with the corresponding elements of the base matrix and sum the results. There are different types of filters and each one will be able to retrieve different image features:

Different filter results
  • We apply the ReLU function to introduce non linearity in our CNN. Other functions like tanh or sigmoid could also be used, but ReLU has been found to perform better in most situations.
  • Pooling is used to reduce the dimensionality of each features while retaining the most important information. Like for the convolutional step, we apply a sliding function on our data. Different functions can be applied: max, sum, mean… The max function usually performs better.
Max pooling operation

We also use some common techniques for each layer:

  • Batch normalization: improves the performance and stability of NNs by providing inputs with zero mean and unit variance.
  • Dropout: reduces overfitting by randomly not updating the weights of some nodes. This helps prevent the NN from relying on one node in the layer too much.

We chose softmax as our last activation function as it is commonly used for multi-label classification.

Now that our CNN is defined, we can compile it with a few more parameters. We chose the Adam optimizer as it is one of the most computationally effective. We chose the categorical cross-entropy as our loss function as it is quite relevant for classification tasks. Our metric will be the accuracy, which is also quite informative for classification tasks on balanced datasets.

Here we define and train our CNN from scratch, but you may want to apply transfer learning methods for problems that require more computational resources. Keras has several pre-trained models ready to use:

Train the model

Everything is set up, let’s train our model now!

Epoch 1/50
225/225 [==============================] - 36s 161ms/step - loss: 2.0174 - acc: 0.2333 - val_loss: 1.7391 - val_acc: 0.2966

Epoch 00001: val_acc improved from -inf to 0.29659, saving model to model_weights.h5
Epoch 2/50
225/225 [==============================] - 31s 138ms/step - loss: 1.8401 - acc: 0.2873 - val_loss: 1.7091 - val_acc: 0.3311

Epoch 00002: val_acc improved from 0.29659 to 0.33108, saving model to model_weights.h5
...Epoch 50/50
225/225 [==============================] - 30s 132ms/step - loss: 0.6723 - acc: 0.7499 - val_loss: 1.1159 - val_acc: 0.6384

Epoch 00050: val_acc did not improve from 0.65221

Our best model managed to obtain a validation accuracy of approximately 65%, which is quite good given the fact that our target class has 7 possible values!

At each epoch, Keras checks if our model performed better than the models of the previous epochs. If it is the case, the new best model weights are saved into a file. This will allow us to load directly the weights of our model without having to re-train it if we want to use it in another situation.

We also have to save the structure of our CNN (layers etc.) into a file:

Analyze the results

We got outputs at each step of the training phase. All those outputs were saved into the ‘history’ variable. We can use it to plot the evolution of the loss and accuracy on both the train and validation datasets:

Evolution of loss and accuracy with the number of training epochs

The validation accuracy starts to stabilize at the end of the 50 epochs between 60% and 65% accuracy.

The training loss is slightly higher than the validation loss for the first epochs which can be surprising. Indeed we are used to see higher validation losses than training losses in machine learning. Here this is simply due to the presence of dropout, which is only applied during the training phase and not during the validation phase.

We can see that the training loss is becoming much smaller than the validation loss after the 20th iteration. This means that our model starts to overfit our training dataset after too much epochs. That is why the validation loss does not decrease a lot after. One solution consists in early-stopping the training of the model.

We could also use some different dropout values and performing data augmentation. Those methods were tested on this dataset, but they did not significantly increase the validation accuracy although they reduced the overfitting effect. Using them slightly increased the training duration of the model.

Finally we can plot the confusion matrix in order to see how our model classified the images:

Our model is very good for predicting happy and surprised faces. However it predicts quite poorly feared faces because it confuses them with sad faces.

With more research and more resources this model could certainly be improved, but the goal of this study was primarily to focus on obtaining a fairly good model compared to what has been done in this field.

Now it’s time to try our model in a real situation! We will use flask to serve our model in order to perform real-time predictions with a webcam input.

Real-time predictions

For this part I re-used some code from the following repositories:

First let’s create a class that will give us the predictions of our previously trained model:

Next we implement a camera class that will do the following operations:

  • get the image stream from our webcam
  • detect faces with OpenCV and add bounding boxes
  • convert the faces to greyscale, rescale them and send them to our pre-trained Neural Network
  • get the predictions back from our Neural Network and add the label to the webcam image
  • return the final image stream

Finally our main script will create a Flask app that will render our image predictions into a web page.

And here are the results!

Our face expression recognition app

It works! Our application is able to detect the face location and predict the right expression.

However the model seems to work poorly in bad conditions (low-light, person not facing the camera, person moving…), but still it’s a good start!

Thanks for reading this article, I hope you enjoyed it!

You can find the full code here:

And find me on LinkedIn here:

--

--