Drowsiness Detection with Machine Learning

How our team built a drowsiness detection system in Python.

Grant Zhong

Published in

Towards Data Science

15 min readDec 13, 2019

Team Members: Grant Zhong, Rui Ying, He Wang, Aurangzaib Siddiqui, Gaurav Choudhary

Introduction

“1 in 25 adult drivers report that they have fallen asleep at the wheel in the past 30 days”

If you have driven before, you’ve been drowsy at the wheel at some point. It’s not something we like to admit but it’s an important problem with serious consequences that needs to be addressed. 1 in 4 vehicle accidents are caused by drowsy driving and 1 in 25 adult drivers report that they have fallen asleep at the wheel in the past 30 days. The scariest part is that drowsy driving isn’t just falling asleep while driving. Drowsy driving can be as small as a brief state of unconsciousness when the driver is not paying full attention to the road. Drowsy driving results in over 71,000 injuries, 1,500 deaths, and $12.5 billion in monetary losses per year. Due to the relevance of this problem, we believe it is important to develop a solution for drowsiness detection, especially in the early stages to prevent accidents.

Additionally, we believe that drowsiness can negatively impact people in working and classroom environments as well. Although sleep deprivation and college go hand in hand, drowsiness in the workplace especially while working with heavy machinery may result in serious injuries similar to those that occur while driving drowsily.

Our solution to this problem is to build a detection system that identifies key attributes of drowsiness and triggers an alert when someone is drowsy before it is too late.

Data Source and Preprocessing

For our training and test data, we used the Real-Life Drowsiness Dataset created by a research team from the University of Texas at Arlington specifically for detecting multi-stage drowsiness. The end goal is to detect not only extreme and visible cases of drowsiness but allow our system to detect softer signals of drowsiness as well. The dataset consists of around 30 hours of videos of 60 unique participants. From the dataset, we were able to extract facial landmarks from 44 videos of 22 participants. This allowed us to obtain a sufficient amount of data for both the alert and drowsy state.

For each video, we used OpenCV to extract 1 frame per second starting at the 3-minute mark until the end of the video.

import cv2
data = []
labels = []
for j in [60]:
   for i in [10]:
      vidcap = cv2.VideoCapture(‘drive/My Drive/Fold5_part2/’ +     str(j) +’/’ + str(i) + ‘.mp4’)
      sec = 0
      frameRate = 1
      success, image = getFrame(sec)
      count = 0
      while success and count < 240:
         landmarks = extract_face_landmarks(image)
         if sum(sum(landmarks)) != 0:
            count += 1
            data.append(landmarks)
            labels.append([i])
            sec = sec + frameRate
            sec = round(sec, 2)
            success, image = getFrame(sec)
            print(count)
         else:
            sec = sec + frameRate
            sec = round(sec, 2)
            success, image = getFrame(sec)
            print(“not detected”)

Each video was approximately 10 minutes long, so we extracted around 240 frames per video, resulting in 10560 frames for the entire dataset.

There were 68 total landmarks per frame but we decided to keep the landmarks for the eyes and mouth only (Points 37–68). These were the important data points we used to extract the features for our model.

Feature Extraction

As briefly alluded to earlier, based on the facial landmarks that we extracted from the frames of the videos, we ventured into developing suitable features for our classification model. While we hypothesized and tested several features, the four core features that we concluded on for our final models were eye aspect ratio, mouth aspect ratio, pupil circularity, and finally, mouth aspect ratio over eye aspect ratio.

Eye Aspect Ratio (EAR)

EAR, as the name suggests, is the ratio of the length of the eyes to the width of the eyes. The length of the eyes is calculated by averaging over two distinct vertical lines across the eyes as illustrated in the figure below.

Our hypothesis was that when an individual is drowsy, their eyes are likely to get smaller and they are likely to blink more. Based on this hypothesis, we expected our model to predict the class as drowsy if the eye aspect ratio for an individual over successive frames started to decline i.e. their eyes started to be more closed or they were blinking faster.

Mouth Aspect Ratio (MAR)

Computationally similar to the EAR, the MAR, as you would expect, measures the ratio of the length of the mouth to the width of the mouth. Our hypothesis was that as an individual becomes drowsy, they are likely to yawn and lose control over their mouth, making their MAR to be higher than usual in this state.

Pupil Circularity (PUC)

PUC is a measure complementary to EAR, but it places a greater emphasis on the pupil instead of the entire eye.

For example, someone who has their eyes half-open or almost closed will have a much lower pupil circularity value versus someone who has their eyes fully open due to the squared term in the denominator. Similar to the EAR, the expectation was that when an individual is drowsy, their pupil circularity is likely to decline.

Mouth aspect ratio over Eye aspect ratio (MOE)

Finally, we decided to add MOE as another feature. MOE is simply the ratio of the MAR to the EAR.

The benefit of using this feature is that EAR and MAR are expected to move in opposite directions if the state of the individual changes. As opposed to both EAR and MAR, MOE as a measure will be more responsive to these changes as it will capture the subtle changes in both EAR and MAR and will exaggerate the changes as the denominator and numerator move in opposite directions. Because the MOE takes MAR as the numerator and EAR as the denominator, our theory was that as the individual gets drowsy, the MOE will increase.

While all these features made intuitive sense, when tested with our classification models, they yielded poor results in the range of 55% to 60% accuracy which is only a minor improvement over the baseline accuracy of 50% for a binary balanced classification problem. Nonetheless, this disappointment led us to our most important discovery: the features weren’t wrong, we just weren’t looking at them correctly.

Feature Normalization

When we were testing our models with the four core features discussed above, we witnessed an alarming pattern. Whenever we randomly split the frames in our training and test, our model would yield results with accuracy as high 70%, however, whenever we split the frames by individuals (i.e. an individual that is in the test set will not be in the training set), our model performance would be poor as alluded to earlier.

This led us to the realization that our model was struggling with new faces and the primary reason for this struggle was the fact that each individual has different core features in their default alert state. That is, person A may naturally have much smaller eyes than person B. If a model is trained on person B, the model, when tested on person A, will always predict the state as drowsy because it will detect a fall in EAR and PUC and a rise in MOE even though person A was alert. Based on this discovery, we hypothesized that normalizing the features for each individual is likely to yield better results and as it turned out, we were correct.

To normalize the features of each individual, we took the first three frames for each individual’s alert video and used them as the baseline for normalization. The mean and standard deviation of each feature for these three frames were calculated and used to normalize each feature individually for each participant. Mathematically, this is what the normalization equation looked like:

Now that we had normalized each of the four core features, our feature set had eight features, each core feature complemented by its normalized version. We tested all eight features in our models and our results improved significantly.

Basic Classification Methods and Results

After we extracted and normalized our features, we wanted to try a series of modeling techniques, starting with the most basic classification models like logistic regression and Naive Bayes, moving on to more complex models containing neural networks and other deep learning approaches. It’s important to note the performance-interpretability tradeoff here. Although we prioritize top-performing models, interpretability is also important to us if we were to commercialize this solution and present its business implications to stakeholders who are not familiar with the machine learning lingo. In order to train and test our models, we split our dataset into data from 17 videos and data from 5 videos respectively. As a result, our training dataset contains 8160 rows and our test dataset contains 2400 rows.

How do we introduce sequence to basic classification methods?

One challenge we faced during this project was that we were trying to predict the label for each frame in the sequence. While complex models like LSTM and RNN can account for sequential data, basic classification models cannot.

The way we dealt with this problem was to average the original prediction results with the prediction results from the previous two frames. Since our dataset was divided into training and test based on the individual participants and the data points are all in the order of time sequence, averaging makes sense in this case and allowed us to deliver more accurate predictions.

Introducing Sequence to Basic Classification Models

From the different classification methods we tried, K-Nearest Neighbor (kNN, k = 25) had the highest out-of-sample accuracy of 77.21%. Naive Bayes performed the worst at 57.75% and we concluded that this was because the model has a harder time dealing with numerical data. Although kNN yielded the highest accuracy, the false-negative rate was quite high at 0.42 which means that there is a 42% probability that someone who is actually drowsy would be detected as alert by our system. In order to decrease the false-negative rate, we lowered the threshold from 0.5 to 0.4 which allowed our model to predict more cases drowsy than alert. Although the accuracies for some of the other models increased, kNN still reported the highest accuracy at 76.63% (k = 18) despite a decline in its own accuracy.

**Left**: Original Results **| Right**: Results after lowering threshold from 0.5 -> 0.4

Feature Importance

We wanted to get a sense of feature importance so we visualized the results from our Random Forest model.

Mouth Aspect Ratio after normalization turned out to be the most important feature out of our 8 features. This makes sense because when we are drowsy, we tend to yawn more frequently. Normalizing our features exaggerated this effect and made it a better indicator of drowsiness in different participants.

Convolutional Neural Network (CNN)

Convolutional Neural Networks (CNN) are typically used to analyze image data and map images to output variables. However, we decided to build a 1-D CNN and send in numerical features as sequential input data to try and understand the spatial relationship between each feature for the two states. Our CNN model has 5 layers including 1 convolutional layer, 1 flatten later, 2 fully connected dense layers, and 1 dropout layer before the output layer. The flatten layer flattens the output from the convolutional layer and makes it linear before passing it into the first dense layer. The dropout layer randomly drops 20% of the output nodes from the second dense layer in order to prevent our model from overfitting to the training data. The final dense layer has a single output node that outputs 0 for alert and 1 for drowsy.

Long Short-Term Memory (LSTM) Networks

Another method to deal with sequential data is using an LSTM model. LSTM networks are a special kind of Recurrent Neural Networks (RNN), capable of learning long-term dependencies in the data. Recurrent Neural Networks are feedback neural networks that have internal memory that allows information to persist.

How can RNNs have an internal memory space while processing new data ?

The answer is that when making a decision, RNNs consider not only the current input but also the output that it has learned from the previous inputs. This is also the main difference between RNNs and other neural networks. In other neural networks, the inputs are independent of each other. In RNNs, the inputs are related to each other. The formula is as below:

We chose to use an LSTM network because it allows us to study long sequences without having to worry about the gradient vanishing problems faced by traditional RNNs. Within the LSTM network, there are three gates for each time step: Forget Gate, Input gate, and Output Gate.

Forget Gate: as its name suggests, the gate tries to “forget” part of the memory from the previous output.

Input Gate: the gate decides what should be kept from the input in order to modify the memory.

Output Gate: the gate decides what the output is by combining the input and memory.

First, we converted our videos into batches of data. Then, each batch was sent through a fully connected layer with 1024 hidden units using the sigmoid activation function. The next layer is our LSTM layer with 512 hidden units followed by 3 more FC layers until the final output layer as displayed below.

After hyperparameter tuning, our optimized LSTM model achieved an overall accuracy of 77.08% with a much lower false-negative rate of 0.3 compared to the false-negative rate of our kNN model (0.42).

Transfer Learning

Transfer learning focuses on using the knowledge gained while solving one problem and applying it to solve a different but related problem. It is a useful set of techniques especially for cases when we have limited time to train the model or limited data to fully train a neural network. Since the data we were working with had very few unique samples, we believed this problem would be a good candidate for using transfer learning. The model we decided to use is VGG16 with the Imagenet dataset.

VGG16 is a convolutional neural network model which was proposed by K. Simonyan and A. Zisserman from the University of Oxford in their paper “Very Deep Convolutional Networks for Large-Scale Image Recognition”. The model managed to achieve 92.7% top-5 test accuracy in ImageNet, which is a dataset of over 14 million images belonging to 1000 classes.

ImageNet is a dataset with over 15 million labeled high-resolution images belonging to about 22,000 different categories. The images were collected from the internet and labeled by human labelers using Amazon’s crowd-sourcing tool, Mechanical Turk. Since 2010, as part of the Pascal Visual Object Challenge, a competition called the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) is held annually. ILSVRC uses a smaller set of ImageNet with roughly 1000 images in each of 1000 categories. There are approximately 1.2 million training images, 50,000 validation images, and 150,000 testing images. ImageNet consists of images with different resolutions. Therefore, the resolution of images needs to be changed to a fixed value of 256×256. The image is rescaled and cropped out and the central 256×256 patch forms the resulting image.

The input to cov1 layer is a 224 x 224 RGB image. The image is passed through a stack of convolutional layers, where the filters are used with a very small receptive field: 3×3. In one of the configurations, the model also utilizes 1×1 convolution filters, which can be seen as a linear transformation of the input channels followed by non-linear transformations. The convolution stride is fixed to 1 pixel; the spatial padding of convolutional layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is 1-pixel for 3×3 convolutional layers. Spatial pooling is carried out by five max-pooling layers, which follow some of the convolutional layers. Not all the conv. layers are followed by max-pooling. Max-pooling is performed over a 2×2 pixel window, with a stride of 2.

Three Fully-Connected (FC) layers follow a stack of convolutional layers: the first two have 4096 channels each, the third performs 1000-way ILSVRC classification and therefore contains 1000 channels. The final layer is a soft-max layer. The configuration of the fully connected layers is the same in all networks.

All hidden layers are equipped with the rectification (ReLU) non-linearity. It is also noted that barring one none of the networks contain Local Response Normalisation (LRN), because such normalization does not improve the performance of the model, but leads to increased computation time.

We split the training videos into 34,000 images which were screenshots taken every 10 frames. We fed these images to the VGG16 model. We believed that the number of images was sufficient to train the pre-trained model. We got the following accuracy scores after training the model for 50 epochs. Our results are shown below.

It was clear that the model was overfitting. A possible explanation for this is that images that we passed through the model were of 22 respondents sitting virtually motionless in front of a camera with undisturbed backgrounds. So despite taking a large number of frames (34,000) into our model, the model was essentially trying to learn from 22 sets of virtually identical images. Hence the model didn’t really have enough training data in a true sense.

Conclusion

We learned quite a few things throughout this project. First, simpler models can be just as efficient at completing tasks as more complex models. In our case, the K-Nearest Neighbor model gave an accuracy similar to the LSTM model. However, because we do not want to misclassify people who are drowsy as alert, ultimately it is better to use the more complex model with a lower false-negative rate than a simpler model that may be cheaper to deploy. Second, normalization was crucial to our performance. We recognized that everybody has a different baseline for eye and mouth aspect ratios and normalizing for each participant was necessary. Outside of runtime for our models, data pre-processing and feature extraction/normalization took up a bulk of our time. It will be interesting to update our project and look into how we can decrease the false-negative rate for kNN and other simpler models.

Future Scope

Moving forward, there are a few things we can do to further improve our results and fine-tune the models. First, we need to incorporate distance between the facial landmarks to account for any movement by the subject in the video. Realistically the participants will not be static on the screen and we believe sudden movements by the participant may signal drowsiness or waking up from micro-sleep. Second, we want to update parameters with our more complex models (NNs, ensembles, etc.) in order to achieve better results. Third and finally, we would like to collect our own training data from a larger sample of participants (more data!!!) while including new distinct signals of drowsiness like sudden head movement, hand movement, or even tracking eye movements.

Product Preview

We wanted to include a few screenshots of our system in action!

First, we need to calibrate the system to the participant as shown below.

Now, the system should automatically detect whether the participant is drowsy or alert. Examples are shown below.

Thank you so much for reading through our entire blog! Feel free to reach out to any of us on LinkedIn with any questions or suggestions on how we can improve our system.

Full project and code can be viewed on GitHub!

Acknowledgment

We would like to give a special “Thank You” to Dr. Joydeep Ghosh who was able to provide incredibly valuable guidance throughout this project.

References

UTA-RLDD

The University of Texas at Arlington Real-Life Drowsiness Dataset (UTA-RLDD) was created for the task of multi-stage…

sites.google.com

https://arxiv.org/abs/1904.07312

https://pypi.org/project/opencv-python/

https://towardsdatascience.com/understanding-rnn-and-lstm-f7cdf6dfc14e

https://neurohive.io/en/popular-networks/vgg16/