The world’s leading publication for data science, AI, and ML professionals.

Creating a Snapchat-Style Filter with Python

A Practical Demonstration of Haar Cascades and Homography

Photo by Ethan Robertson on Unsplash
Photo by Ethan Robertson on Unsplash

The introduction of augmented reality (AR) in smart phones ushered a novel approach to entertainment. From playing games like Pokémon Go to making funny faces on Snapchat, AR has become a commonplace phenomenon.

While these technologies seem advanced enough to borrow from science fiction, creating a fun, Snapchat-style filter in Python is surprisingly straightforward. In this instance, I’ll create a filter that places a pair of sunglasses on a face.

I drew these pair of sunglasses for this project. I won't quit my day job to pursue art.
I drew these pair of sunglasses for this project. I won’t quit my day job to pursue art.

This filter, as with any AR, relies on two fundamental steps. First, it must determine where in the video frame to project an image. In this example, whatever algorithm I use must correctly identify the location of my eyes. Second, it must transform the projected image so that it’s proportional to the rest of the video frame. In this case, the image is a pair of sunglasses which must fit on a pair of eyes when projected.

While both of these challenges sound daunting, Python’s implementation to OpenCV make it fairly easy.

Basic Overview

The first challenge to overcome is to detect a pair of eyes. The topic of facial detection and facial landmark detection are both huge topics within Computer Vision with many unique approaches, but this method will use Haar Cascades.

First introduced in a paper in 2001 by Paul Viola and Michael Jones, Haar Cascades are a generalized supervised machine learning technique designed specifically for quick facial detection. Instead of training from scratch, however, OpenCV provides a number of pre-built models, including face and eye detection.

The second challenge is to transform a projected image to ensure that its proportional to the face. Homography, which concerns itself with isomorphism in projected spaces, offers a solution. While the concept itself sound scary, we’ll use it to project the sunglasses onto the video frame so that it appears natural.

Obviously, both Haar Cascades and Homorgraphy are deep concepts whose details go beyond the scope of this article, but a basic understanding of what they are and what they do will help understand the code implementation.

Detecting the Eyes

import numpy as np
import cv2
# Import the pre-trained models for face and eye detection
face_cascade = cv2.CascadeClassifier("haarcascade_frontalface_default.xml")
eye_cascade = cv2.CascadeClassifier("haarcascade_eye.xml")

Before doing much, NumPy is imported for efficient numerical calculation and Opencv is imported for image processing.

Next, OpenCV’s built-in cascade classifier method is called. Nothing is classified, yet, but this serves to initialize the models that will be used. The XML files passed as their arguments are actually pre-trained models that specialize in detecting a frontal view of a face and eyes specifically.

These pre-trained models come built-in with OpenCV, but can be downloaded separately here. Please note, however, that these pre-trained models are licensed to Intel and may have restrictions for use.

# Global variable to cache the location of eyes
eye_cache = None

A global variable is declared to create a cache. This is done for two reasons:

  1. The Haar Cascades classifier will not be able to identify the eyes in every single frame, which will create a flickering effect. By using a cache, however, the program can refer to the location of the eyes from the last successful identification and place the glasses accordingly. This will remove the flickering and make the program run more smoothly.
  2. Sometimes the Haar Cascades will falsely identify more than two eyes, which will mess up glasses placement. Like before, using the cache, this can be corrected by referring to a previous location of the eyes.

While this makes the program run smoother, it comes at a cost. By continually referring to the location of the eyes in previous frames, the position of the glasses can lag. For somebody sitting calmly or even swaying slowly, the effect isn’t very noticeable, but a quickly moving face will see the impact.

# Capture video from the local camera
cap = cv2.VideoCapture(0)
while True:

    # Read the frame
    ret, frame = cap.read()
    # Check to make sure camera loaded the image correctly
    if not ret:
        break

This first line captures video. Note that passing 0 will use the 0th camera on the computer. If multiple are used, then passing any integer will use the nth camera.

Additionally, if a pre-recorded video needs to be used instead, then passing a string to the video’s location will also work.

Next, a infinite loop is initialized and the data from the camera is read. The variable "ret" is simply a boolean that denotes if video was actually captured from the camera and "frame" is the current frame from the camera. If "ret" is false, the loop will break.

    # Convert to grayscale
    gray_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    # Detect faces
    faces = face_cascade.detectMultiScale(gray_frame, 1.3, 5)

The frame is converted into grayscale and then passed into the Haar Cascades to detect any face. The face is detected before the eye, because any eye detected should be located within the face. If an eye is located outside the face, then either something is wrong, such as a false positive, or something is very wrong.

The method called to detect faces will return a matrix of faces with their own (x, y)-coordinates and the width and height.

    for (x, y, w, h) in faces:
        # Draw a rectangle around the face
        roi_gray = gray_frame[y:y+h, x:x+w]
        # Detect eyes within the grayscale region of interest
        eyes = eye_cascade.detectMultiScale(roi_gray)

For every face found, a subsection of the video frame is taken and called the region of interest. This region of interest is then passed through the Haar Cascades classifier that specializes in eyes. With the eyes detected, the next steps can be processed.

        # Only detect 2 eyes
        if len(eyes) == 2:
            # Store the position of the eyes in cache
            eye_cache = eyes
        # If 2 eyes aren't detected, use the eye cache
        elif eye_cache is not None:
            eyes = eye_cach

As mentioned previously, if the eyes aren’t detected or more than two eyes are detected, the cache will be used instead. If two eyes are detected as expected, the cache is updated accordingly.

Projecting an Image onto Frame

To project the image of the sunglasses onto the video frame so that they’re proportional, two things are needed: the (x, y)-coordinates of the sunglasses and the (x, y)-coordinates of where they’ll be projected onto the video frame. Both will be organized into simple matrices. The former will be source matrix, because it is the source of the image, and the latter will be the destination matrix, because its the destination of image.

Once both matrices are developed, a third, homography matrix, will be calculated, which gives directions on how to "stretch" the image of the sunglasses so that it fits around a pair of eyes.

    # read the image and get its dimensions
    img = cv2.imread("glasses_transparent.png", -1)
    img_h = img.shape[0]
    img_w = img.shape[1]
    # Create source matrix
    src_mat = np.array([[0,0], [img_w, 0],  [img_w, img_h], [0, img_h]])

To start, the image of the sunglasses is read into memory. Normally, images have three channels: red, green, and blue. These denote the default color space; however, some images have a fourth channel, called the alpha channel, which denotes transparency.

When I drew the pair of sunglasses, I ensured it contained a transparent background. Normally, OpenCV would ignore this, but by passing -1 into the imread method, it reads this fourth channel.

After reading the image, the dimensions are noted and the source matrix is composed of the coordinates of the top-left corner, top-right corner, bottom-right corner, and the bottom-left corner. The matrix must be in this order!

        # define the destination matrix based on eye detected order.
        # Order of points must be top-left, top-right, bottom-left,
        # and bottom-right
        if eyes[0][0] < eyes[1][0]:
            dst_mat = np.array([
                [x + eyes[0][0], y + eyes[0][1]],
                [x + eyes[1][0] + eyes[1][2], y + eyes[1][2]],
                [x + eyes[1][0] + eyes[1][2], y + eyes[1][1] + eyes[1][3]],
                [x + eyes[0][0], y + eyes[0][1] + eyes[0][3]]
            ])
        else:
            dst_mat = np.array([
                [x + eyes[1][0], y + eyes[1][1]],
                [x + eyes[0][0] + eyes[0][2], y + eyes[0][2]],
                [x + eyes[0][0] + eyes[0][2], y + eyes[0][1] + eyes[1][3]],
                [x + eyes[1][0], y + eyes[1][1] + eyes[1][3]]
            ])

This is where things get a little complicated. While the eyes are detected, there’s no way of knowing the order of the eyes in advance. Sometimes the right eye will be detected first and sometimes the left eye will. To test, the x-coordinate is compared between the two eyes, and then the proper destination matrix can be composed.

The destination matrix must contain the coordinates of the corners of the eyes in the same order as the source matrix. In other words, the (x, y)-coordinates of the top-left corner, top-right corner, bottom-right corner, and bottom-left corner.

Failing to do so will give unexpected results. In my first attempt, the glasses folded over themselves and looked like a graphical glitch.

    # Get the dimensions of the frame
    face_h = frame.shape[0]
    face_w = frame.shape[1]
    # Find the Homography matrix
    hom = cv2.findHomography(src_mat, dst_mat)[0]
    # Warp the image to fit the homegraphy matrix
    warped = cv2.warpPerspective(img, hom, (face_w, face_h))

After quickly getting the dimensions of the video frame, the homography matrix is found using the built-in OpenCV method.

Next, the source image is warped, so now the glasses will be proportional to the face wearing them. Now the only thing left to do is project this warped image onto the video frame and display it.

    # Grab the alpha channel of the warped image and create a mask
    mask = warped[:,:,3]
    # Copy and convert the mask to a float and give it 3 channels
    mask_scale = mask.copy() / 255.0
    mask_scale = np.dstack([mask_scale] * 3)
    # Remove the alpha channel from the warped image
    warped = cv2.cvtColor(warped, cv2.COLOR_BGRA2BGR)

The mask for image will the alpha channel of the original image. A mask could also be created on a non-transparent image by using thresholding, but using a transparent image makes the process quicker and easier.

The mask is then normalized and then transformed into three channels so that it can multiplied against other matrices.

At this point, the alpha channel has served its purpose and keeping it will actually cause more problems than it solves. As a result, the final line converts the image back into a normal three channel image.

    warped_multiplied = cv2.multiply(mask_scale, warped.astype("float"))
    image_multiplied = cv2.multiply(frame.astype(float), 1.0 -     mask_scale)
    output = cv2.add(warped_multiplied, image_multiplied)
    output = output.astype("uint8")

The resulting mask is multiplied against the warped image, which will create a warped image on a "transparent" background. The video frame itself is then multiplied, which creates a gap in the image of where the glasses will be. These two images are added together and the result is a face wearing a pair of sunglasses.

    cv2.imshow("SnapTalk", output)

    if cv2.waitKey(60) &amp; 0xff == ord('q'):
        break
cap.release()
cv2.destroyAllWindows()

Finally, the image is displayed. An exit condition is written so that when the "q" key is pressed it breaks the loop.

Once the loop is broken, the camera is turned off and any open windows are closed.

The Results

With any luck the final results will look something like this:

I wore my finest pleather jacket to maximize the cool factor for this demonstration
I wore my finest pleather jacket to maximize the cool factor for this demonstration

The program automatically and accurately detects my eyes and projects the glasses onto my face in real time. Even as I change my expression and move my head across the screen, the glasses follow without much issue.

Because I use cache to record the position of the glasses even when my eyes aren’t detected, the glasses remain on the screen. Consequently, I can do the trick of pretending to hold my glasses while I wipe eyes.

The glasses themselves tend to shift in size, which is a result of the Haar Cascades classifier. While it manages to find my eyes, it constantly changes the classified size. Paired with the flat image of the sunglasses, the composite looks comically cartoonish. Additional work, such as using a rolling average for the size of the glasses may accommodate for this, but it does create even more lag.

The lag in the program becomes quite noticeable in quick movements.
The lag in the program becomes quite noticeable in quick movements.

As the program runs, there is a noticeable lag when the subject moves quickly. As I shift around the screen, the glasses momentarily hover before the program finds my eyes again.

Compared to the previous demonstration, however, my movements are more rapid than a what a normal person might naturally do. Ideally, the program would find my eyes even at this speed, but it ultimately serves it purpose well enough.

Conclusions

While there are many tutorials and articles about face detection, using it for AR drives the possibilities of these technologies. In this article, the Haar cascades was used for eyes, but various other pre-trained model exist for different parts of the body which escalates the number of applications that might be used. Likewise, as a general machine learning technique, a custom Haar cascades may be trained and used on something not available, yet.

Paired with the novelty of AR, a simple Snapchat-style filter can be created. Homography provides a wonderful tool for projecting 2D items into video stills, producing a fun effect.


Related Articles