The world’s leading publication for data science, AI, and ML professionals.

Face Detection with Haar Cascade

Exploring a bit older algorithm which proves to be challenging even in the Deep Learning times

Getting Started

Image by Girija Shankar Behera
Image by Girija Shankar Behera

Face Detection, a widely popular subject with a huge range of applications. Modern day Smartphones and Laptops come with in-built face detection softwares, which can authenticate the identity of the user. There are numerous apps that can capture, detect and process a face in real time, can identify the age and the gender of the user, and also can apply some really cool filters. The list is not limited to these mobile apps, as Face Detection also has a wide range of applications in Surveillance, Security and Biometrics as well. But the origin of its Success stories dates back to 2001, when Viola and Jones proposed the first ever Object Detection Framework for Real Time Face Detection in Video Footage.

This article is about taking a gentle look on the Viola-Jones Face Detection Technique, popularly known as Haar Cascades, and exploring some of the interesting concepts proposed by them. This piece of work was done long before the Deep Learning Era had even started. But it’s an excellent work in comparison to the powerful models that can be built with the modern day Deep Learning Techniques. The algorithm is still found to be used almost everywhere. It has fully trained models available on GitHub. It’s fast. It’s pretty accurate (at least when I try it).


According to Wikipedia… Woody Bledshoe, Helen Chan Wolf, and Charles Bisson were the first ones to do the first ever Face Detection on a Computer back in the 1960s. A person had to manually pinpoint the coordinates of facial features such as the pupil centers, the inside and outside corner of eyes, and the widows peak in the hairline. The coordinates were used to calculate 20 distances, including the width of the mouth and of the eyes. A human could process about 40 pictures an hour in this manner and so build a database of the computed distances. A computer would then automatically compare the distances for each photograph, calculate the difference between the distances and return the closed records as a possible match.


So what is Haar Cascade? It is an Object Detection Algorithm used to identify faces in an image or a real time video. The algorithm uses edge or line detection features proposed by Viola and Jones in their research paper "Rapid Object Detection using a Boosted Cascade of Simple Features" published in 2001. The algorithm is given a lot of positive images consisting of faces, and a lot of negative images not consisting of any face to train on them. The model created from this training is available at the OpenCV GitHub repository https://github.com/opencv/opencv/tree/master/data/haarcascades.

The repository has the models stored in XML files, and can be read with the OpenCV methods. These include models for face detection, eye detection, upper body and lower body detection, license plate detection etc. Below we see some of the concepts proposed by Viola and Jones in their research.


Features

Fig. A sample of Haar features used in the Original Research Paper published by Viola and Jones.
Fig. A sample of Haar features used in the Original Research Paper published by Viola and Jones.

The first contribution to the research was the introduction of the haar features shown above. These features on the image makes it easy to find out the edges or the lines in the image, or to pick areas where there is a sudden change in the intensities of the pixels.

Fig. The rectangle on the left is a sample representation of an image with pixel values 0.0 to 1.0. The rectangle at the center is a haar kernel which has all the light pixels on the left and all the dark pixels on the right. The haar calculation is done by finding out the difference of the average of the pixel values at the darker region and the average of the pixel values at the lighter region. If the difference is close to 1, then there is an edge detected by the haar feature.
Fig. The rectangle on the left is a sample representation of an image with pixel values 0.0 to 1.0. The rectangle at the center is a haar kernel which has all the light pixels on the left and all the dark pixels on the right. The haar calculation is done by finding out the difference of the average of the pixel values at the darker region and the average of the pixel values at the lighter region. If the difference is close to 1, then there is an edge detected by the haar feature.

A sample calculation of Haar value from a rectangular image section has been shown here. The darker areas in the haar feature are pixels with values 1, and the lighter areas are pixels with values 0. Each of these is responsible for finding out one particular feature in the image. Such as an edge, a line or any structure in the image where there is a sudden change of intensities. For ex. in the image above, the haar feature can detect a vertical edge with darker pixels at its right and lighter pixels at its left.

The objective here is to find out the sum of all the image pixels lying in the darker area of the haar feature and the sum of all the image pixels lying in the lighter area of the haar feature. And then find out their difference. Now if the image has an edge separating dark pixels on the right and light pixels on the left, then the haar value will be closer to 1. That means, we say that there is an edge detected if the haar value is closer to 1. In the example above, there is no edge as the haar value is far from 1.

This is just one representation of a particular haar feature separating a vertical edge. Now there are other haar features as well, which will detect edges in other directions and any other image structures. To detect an edge anywhere in the image, the haar feature needs to traverse the whole image.

Fig. The GIF shows how a haar feature traverses on an image from its left towards its right.
Fig. The GIF shows how a haar feature traverses on an image from its left towards its right.

The haar feature continuously traverses from the top left of the image to the bottom right to search for the particular feature. This is just a representation of the whole concept of the haar feature traversal. In its actual work, the haar feature would traverse pixel by pixel in the image. Also all possible sizes of the haar features will be applied.

Depending on the feature each one is looking for, these are broadly classified into three categories. The first set of two rectangle features are responsible for finding out the edges in a horizontal or in a vertical direction (as shown above). The second set of three rectangle features are responsible for finding out if there is a lighter region surrounded by darker regions on either side or vice-versa. The third set of four rectangle features are responsible for finding out change of pixel intensities across diagonals.

Now, the haar features traversal on an image would involve a lot of mathematical calculations. As we can see for a single rectangle on either side, it involves 18 pixel value additions (for a rectangle enclosing 18 pixels). Imagine doing this for the whole image with all sizes of the haar features. This would be a hectic operation even for a high performance machine.

Fig. The GIF shows the making of an Integral Image. Each pixel in an Integral image is the sum of all the pixels in its left and above.
Fig. The GIF shows the making of an Integral Image. Each pixel in an Integral image is the sum of all the pixels in its left and above.

To tackle this, they introduced another concept known as The Integral Image to perform the same operation. An Integral Image is calculated from the Original Image in such a way that each pixel in this is the sum of all the pixels lying in its left and above in the Original Image. The calculation of a pixel in the Integral Image can be seen in the above GIF. The last pixel at the bottom right corner of the Integral Image will be the sum of all the pixels in the Original Image.

Fig. Integral Image is used here to calculate the haar value.
Fig. Integral Image is used here to calculate the haar value.

With the Integral Image, only 4 constant value additions are needed each time for any feature size (with respect to the 18 additions earlier). This reduces the time complexity of each addition gradually, as the number of additions does not depend on the number of pixels enclosed anymore.

In the above image, there is no edge in the vertical direction as the haar value is -0.02, which is very far from 1. Let’s see one more example, where there might be an edge present in the image.

Fig. Haar calculation from Integral Image. This is a case where there is a sudden change of pixel intensities moving vertically from the left towards the right in the image.
Fig. Haar calculation from Integral Image. This is a case where there is a sudden change of pixel intensities moving vertically from the left towards the right in the image.

Again repeating the same calculation done above, but this time just to see what haar value is calculated when there is a sudden change of intensities moving from left to right in a vertical direction. The haar value here is 0.54, which is closer to 1 in comparison to the case earlier.


AdaBoost

Okay, so this was all about the features and the representation of the image used in the original Haar Cascade research. Now, it’s time to explore some of the implementation details.

So basically, what we understood was there’s a set of features which would capture certain facial structures like eyebrows or the bridge between both the eyes, or the lips etc. But originally the feature set was not limited to this. The feature set had an approx. of 180,000 of them, which got reduced to 6000. We will discuss this in more detail below.

A majority of these features won’t work well or will be irrelevant to the facial features, as they will be too random to find anything. So here they needed a Feature Selection technique, to select a subset of features from the huge set which would not only select features performing better than the others, but also will eliminate the irrelevant ones. They used a Boosting Technique called AdaBoost, in which each of these 180,000 features were applied to the images separately to create Weak Learners. Some of them produced low error rates as they separated the Positive images from the Negative images better than the others, while some didn’t. These weak learners are designed in such a way that they would misclassify only a minimum number of images. They can perform better than only a random guess. With this technique, their final set of features got reduced to a total of 6000 of them.


Attentional Cascade

Now comes the Cascading part. The subset of all 6000 features will again run on the training images to detect if there’s a facial feature present or not. Now the authors have taken a standard window size of 24×24 within which the feature detection will be running. It’s again a tiresome task.

To simplify this, they proposed another technique called The Attentional Cascade. The idea behind this is, not all the features need to run on each and every window. If a feature fails on a particular window, then we can say that the facial features are not present there. Hence, we can move to the next windows where there can be facial features present.

  • Features are applied on the images in stages. The stages in the beginning contain simpler features, in comparison to the features in a later stage which are complex, complex enough to find the nitty gritty details on the face. If the initial stage won’t detect anything on the window, then discard the window itself from the remaining process, and move on to the next window. This way a lot of processing time will be saved, as the irrelevant windows will not be processed in the majority of the stages.
  • The second stage processing would start, only when the features in the first stage are detected in the image. The process continues like this, i.e. if one stage passes, the window is passed onto the next stage, if it fails then the window is discarded.
Fig. A sample 2 stage feature detection, where the haar features are applied on the image in a 4x4 window. The first stage has 2 simpler features, and the second stage has only 1 complex feature. The first stage is applied first on the 4x4 windows in the image, if it passes, then only the stage is applied.
Fig. A sample 2 stage feature detection, where the haar features are applied on the image in a 4×4 window. The first stage has 2 simpler features, and the second stage has only 1 complex feature. The first stage is applied first on the 4×4 windows in the image, if it passes, then only the stage is applied.

I tried to show a visual representation of this with just 2 stages. The first stage consists of two simpler features, and the second one consists of a single complex feature. This might not exactly depict the approach, but with a huge set of features in a number of stages, this technique would reduce the workload on the later stages, as most of the windows will get rejected in the initial stages only.

In the Viola – Jones research, they had a total of 38 stages for something around 6000 features. The number of features in the first five stages are 1, 10, 25, 25, and 50, and this increased in the subsequent stages. The initial stages with simpler and lesser number of features removed most of the windows not having any facial features, thereby reducing the false negative ratio, whereas the later stages with complex and more number of features can focus on reducing the error detection rate, hence achieving low false positive ratio.

Fig. Feature Detection on an Image containing a face
Fig. Feature Detection on an Image containing a face

So this is how the detection of features takes place in stages. You can notice that, when the window is at a non-face region, only the first stage with two rectangle features are running, and as they discard the window before the second stage starts. Only one window which actually contains a face, runs both the stages and detects the face.


Conclusion

Haar Cascade Detection is one of the oldest yet powerful face detection algorithms invented. It has been there since long, long before Deep Learning became famous. Haar Features were not only used to detect faces, but also for eyes, lips, license number plates etc. The models are stored on GitHub, and we can access them with OpenCV methods.

I’m writing another post on the same topic which mostly will be in code to see how these work, unlike this one which was purely about visualizing the features and the implementation. The original work and the concepts are complex and not so easy to imagine, as this already proved to be so powerful and accurate. But in this article, I just tried to understand and explore the methodologies in a simple way.

Feel free to leave a comment below or any questions/suggestions for improvement.


References

  1. Rapid Object Detection using a Boosted Cascade of Simple Features
  2. Cascade Classifier
  3. Face Detection using Haar Cascades
  4. Cascade Haar Explained

Related Articles