Vehicle Detection and Tracking From a Front-Face Camera

Alberto Escarlate

Published in

Towards Data Science

6 min readOct 13, 2017

Udacity Self-Driving Engineer Nanodegree — term 1, assignment 5.

Goal

This is the report created for the fifth and final assignment of the first term of Udacity Self-Driving Car Engineer Nanodegree. The challenge was to create an algorithm that detects other vehicles on the road, using video acquired using a front-face camera.

This is the Github repository.

Feature Extraction

In order to detect vehicles — or any other objects — we need to know what differentiates them from the rest of the image captured by the camera. Colors and gradients are good differentiators but the most important features will depend on the appearance of the objects.

Color alone as a feature can be problematic. Relying on the distribution of color values (or color histograms) may end up finding matches in unwanted regions of image. Gradients can offer a more robust presentation. Their presence in specific directions around the center could translate into a notion of shape. However a problem with using gradient is that they make the signature too sensitive

HOG — Histogram of oriented gradients

HOG is a Computer Vision technique that counts occurrences of gradient orientation in localized portions of an image. If we compute the gradient
magnitudes and directions at each pixel and then group them up into small cells we can use this “star” histogram to establish the dominant gradient direction for that cell.

With that, even small shape variations will keep the signature unique for matching. The various parameters like the number of orientation bins, size of grids (of cells) , sizes of the cells, overlap between cells, are great to fine tune the algorithm.

HOG, color space, spatial and color binning parameters

In the final feature extraction function we applied a color transformation to YCrCb . We then concatenated the binned color feature vector, the histograms of color vector with the HOG feature vector. The HOG computation was done on all color channels.

color_space = ‘YCrCb’ 
spatial_size = (16, 16)               # Spatial binning dimensions
hist_bins = 32                        # Number of histogram bins
orient = 9                            # HOG orientations
pix_per_cell = 8                      # HOG pixels per cell
cell_per_block = 2                    # HOG cells per block
hog_channel = “ALL”                   # Can be 0, 1, 2, or “ALL”

Training the classifier

Once the features were extracted it was time to build and train the classifier. The approach — as introduced in the lessons — is to build a classifier that can make a distinction between car and non-car images. This classifier is then ran across the entire picture by sampling small patches. Each patch is then classified as car or non-car.

Datasets

I used labeled data for vehicle and non-vehicle examples in order to train the classifier. These example images come from a combination of the GTI vehicle image database, the KITTI vision benchmark suite, and examples extracted from the project video itself.

The data was split into a training set and a test set after being randomly shuffled to avoid possible ordering effects
in the data.

Training is basically extracting the features vectors for every image in the training set. These vectors and their respective labels (car or non-car) feed the training algorithm which iteratively changes the model until the error between predicted and actual labels is small enough. (Or the error stops decreasing after a number of iterations.)

Normalizing magnitude of feature vectors

Before starting to train the classifier we normalized the feature vectors to zero mean and unit variance. This is necessary because there’s a difference in magnitude between the color-based and gradient-based features and this can cause problems.

# Fit a per-column scaler
X_scaler = StandardScaler().fit(X)# Apply the scaler to X
scaled_X = X_scaler.transform(X)

Support Vector Machine classifier

As suggested in the lessons we used support vector machines to classify the data. We implemented a Decision Tree classifier as well, however the accuracy wasn’t looking promising so we decided to continue fine tuning the SVM.

3.26 Seconds to train SVC...
Test Accuracy of SVC =  0.9926

The Sliding Window approach

Now that we trained the classifier we will have to search for cars in the frame. The premise is that we define patches in the image and then run the classifier against each patch. The classifier will then decide if that patch “is” a car or not.

In the sliding window technique, we define a grid onto the image and move across it extracting the trained features. The classifier will give a prediction at each step and tell if that grid element contains car features.

Given the perspective of the images, objects far from the car camera will appear smaller and cars close by will appear larger. So it makes sense that the grid subdivision consider different sizes depending on the position on the image. We also disconsider any area above the horizon, ignoring regions where there’s sky, mountains and trees.

Tracking Issues

It not surprising that the classifier will return a good number of false positives. These are regions with no cars but lighting or texture in the image will fool the classifier into calling it a car. This is obviously a big issue for a self-driving car application. False positives can cause the car to change direction or activate the brakes, potential cause for accidents.

To filter out the false positives we record the positions of all detections for each frame and compare with detections found in subsequent frames. Clusters of detections are likely an actual car. On the other side, detections that appear in one frame and not again in the the next, will be false positives.

Heatmap and bounding boxes

The search find_cars returns an array of hot boxes that classifier has predicted contains a vehicle. We created a heatmap to identify the clusters of overlapping boxes. With apply_threshold we then remove assumed false positives.

import collections
heatmaps = collections.deque(maxlen=10) 
    
def process_frame(source_img):     out_img, boxes = find_cars(source_img, ystart, 
                           ystop, scale, svc, 
                           X_scaler, orient, pix_per_cell, 
                           cell_per_block, spatial_size, 
                           hist_bins, False)    current_heatmap =   
                 np.zeros_like(source_img[:,:,0]).astype(np.float)
    current_heatmap = add_heat(current_heatmap, boxes)
    heatmaps.append(current_heatmap)
    heatmap_sum = sum(heatmaps)
  
    heat_map = apply_threshold(heatmap_sum, 2)
    heatmap = np.clip(heat_map, 0, 255)
    labels = label(heatmap)  
    labeled_box_img = draw_labeled_bboxes(source_img, labels)return labeled_box_img

The examples above are for individual frames. As mentioned before we want to take into account the time sequence to avoid one-offs to be considered. We used a deque data structure to always keep the boxes from the last 10 frames.

Pipeline and video

Summarizing our overall tracking pipeline:

Work on labeled datasets to define training and testing sets.
Extract features from dataset images.
Train the classifier.
For each video frame: run a search using a sliding window technique and filter out false positives.

Generated video

Considerations

It’s noticeable that some detections were of cars on the opposite direction of the highway. While this isn’t a bad thing we could avoid them by narrowing the area of consideration.

Instances when a car enters the frame of the video and detection occurs with a little delay can be addressed by tweaking the number of frames considered in the average box calculation.

More recent techniques using deep neural networks can improve feature detection by increasing accuracy, reducing the occurrence of false positives and boosting performance. Tiny YOLO seems to be a common approach.

Janai et al. (2017) published an assessment of how techniques have improved and challenges still in need of a solution.

This was a great project to wrap term 1. I can see how much I learned from the very first assignment of finding lanes using OpenCV and I’m looking forward to more learning on term 2. Thanks to the awesome Udacity team for making this possible.