Object Detection using Google AI Open Images

Learn to build your own self-driving car!!!….just kidding

Atindra Bandi
Towards Data Science

--

By Atindra Bandi, Alyson Brown, Sagar Chadha, Amy Dang, Jason Su

Source

When was the last time you logged into your phone using nothing but your face? Or clicked a selfie with some friends and used a Snapchat filter that put some fancy dog ears on your face? Did you know that these cool features are enabled by a fancy neural network that not only recognizes that there is a face in the photo but also detects where the ears should go. Your phone, in a sense, can ‘see’ you and it even knows what you look like!

The technology that helps computers ‘see’ is called “computer vision”. In recent years, computer vision applications are becoming increasingly commonplace due to an explosion in computing power making deep learning models faster and more feasible. Many companies such as Amazon, Google, Tesla, Facebook, and Microsoft are investing heavily in this technology and its applications.

Computer Vision Tasks

We focus on two main computer vision tasks — image classification and object detection.

  1. Image Classification focuses on grouping an image into a predefined category. To achieve this, we need to have multiple images with the class that is of interest to us and train a computer to essentially convert pixel numbers to symbols. This is just saying that the computer sees a photo of a cat and says that there is a cat in it.
  2. Object detection utilizes an image classifier to figure out what is present in an image and where. These tasks have been made easier through the use of Convolutional Neural Networks (CNNs) which have made it possible to detect multiple classes in a single pass of the image.
For more details on the difference in such tasks, please reference the following article.

Computer vision is cool!

Recognizing that many interesting data science applications in the future would involve working with images, my team of budding data scientists and I decided to try our hands at the Google AI Open Image challenge hosted on Kaggle. We thought of this as the perfect opportunity to get our hands dirty with neural networks and convolutions, and potentially impress our professors and classmates. This challenge provided us with 1.7 million images with 12 million bounding box annotations (their X and Y coordinates relative to the image) of 500 object classes. You can find the data here.

We highly recommend going through Andrew Ng’s Coursera course on Convolutional Neural networks to anyone who wants to read about CNNs.

Getting Our Hands Dirty!

Exploratory Data Analysis — As with all data analyses, we began exploring what images we had and the type of objects we needed to detect.

Frequency of Classes in the Training Dataset

A quick look at the training images revealed that certain objects had more of a presence than others in terms of how many times they appeared. The chart above shows the distribution of the top 43 classes. It is clear that there is a huge disparity and it would need to be resolved somehow. In the interest of time and money (GPU costs are high :( ) we chose the aforementioned 43 object classes and a subset of ~ 300K images with these objects. We had about 400 images for each object class in the training data.

Choosing the Object Detection Algorithm

We considered various algorithms such as VGG, Inception, but ultimately chose the YOLO algorithm because of its speed, computational power and the abundance of online articles that could guide us through the process. Faced with computational and time restraints, we made two key decisions -

  1. Use a YOLO v2 model which was trained to identify certain objects.
  2. Leverage transfer learning to train the last convolutional layer to recognize previously unseen objects such as guitar, house, man/woman, bird, etc.

Inputs for YOLO

The YOLO algorithm requires some specific inputs -

  1. Input image size — YOLO network is designed to work with specific input image sizes. We sent in images with a size of 608 * 608.
  2. Number of Classes — 43. This is required to define the dimensions of the output of the YOLO.
  3. Anchor box — The number and dimensions of anchor boxes to be used.
  4. Confidence and IoU thresholds — Thresholds to define which anchor boxes to choose and how to pick between anchor boxes.
  5. Image names with bounding box information — For each image we need to provide YOLO with what’s in it in a specific format as shown below
Sample input for YOLO

Below is the code snippet for YOLO inputs

Inputs into YOLO

YOLO v2 Architecture

The architecture is as shown below — it has 23 convolution layers each with its own batch normalization, Leaky RELU activation and max pooling.

Representation of the actual YOLO v2 architecture.

These layers try to extract multiple important features from images so that the various classes can be detected. For the purpose of object detection, the YOLO algorithm divides the input image into a 19*19 grid each with 5 different anchor boxes. It then tries to detect classes within each of these grid cells and assigns an object to one of the 5 anchor boxes for each grid cell. The anchor boxes differ in shape and are intended to capture differently shaped objects for each grid cell.

The YOLO algorithm outputs a matrix (shown below) for each of the defined anchor boxes-

Given that we had to train the algorithm for 43 classes, we got output dimensions of:

These matrices give us the probabilities of observing an object for each anchor box and also the probability of what class that object is. To filter out anchor boxes that don’t have any classes or have the same object as some other box, we use two thresholds — IoU threshold to filter out anchor boxes capturing the same object and confidence threshold to filter out boxes that don’t contain any class with a high confidence.

Below is the illustration of the last few layers of the YOLO v2 architecture:

Last few layers of YOLO v2 architecture (Only for illustration purposes)

Transfer Learning

Transfer learning is the idea of obtaining a neural network that has already been trained to classify images and using it for our specific purpose. This saves us computation time since we don’t need to train a lot of weights — for instance, the YOLO v2 model we used has about 50 million weights — training which would have taken 4–5 days easily on the Google cloud instance we were using.

To successfully implement transfer learning, we had to make a few updates to our model:

  • Input image size — The model that we downloaded used input images of size 416*416. Since some of the objects we were training for were very little — birds, footwear- we didn’t want to squish the input image so much. For this reason, we used input images of size 608*608.
  • Grid size — We changed the dimensions of the grid size so that it divides the image into 19*19 grid cells instead of 13*13 which was the default for the model we downloaded.
  • Output Layer — Since we were training on a different number of classes 43 versus 80 that the original model was trained on, the output layer was changed to output the matrix dimension as discussed above.

We re-initialized the weights of YOLO’s last convolution layer to train it on our dataset which eventually helped us identify unique classes. Below is the code snippet for the same -

Re-initializing the last convolution layer of YOLO

Cost Function

In any object detection problem, we want to identify the right object at the right place with a high confidence in an image. There are 3 major components to the cost function:

  1. Classification Loss: It is the squared error of class conditional probability if an object is detected. Thus the loss function only penalizes classification error only if an object is present in a grid cell.
  2. Localization Loss: It is the squared error in the predicted boundary boxes location and size with the ground truth boxes, if the boxes are responsible for detecting the object. In order to penalize the loss from bounding box coordinate predictions, we use a regularization parameter (ƛcoord). Further, to make sure that small deviations in larger boxes matter less than in smaller boxes the algorithm uses the square root of bounding box width and height.
  3. Confidence Loss: It is the squared error of the bounding box’s confidence score. Most of the boxes are not responsible for detecting an object and therefore the equation is split into two parts, one for the boxes detecting an object and another one for the rest boxes. A regularization term λnoobj (default: 0.5) is applied to the latter part to weigh down the boxes not detecting an object.

Please feel free to refer to the original YOlO paper for a detailed look at the cost function.

The beauty of YOLO is that it uses errors that are easy to optimize using optimization functions such as Stochastic Gradient Descent (SGD), SGD with momentum, or Adam etc. Below code snippet shows the parameters we used for optimizing the cost function.

Training algorithm for YOLO (Adam optimizer)

Output Accuracy — mean Average Precision (mAP Score):

There are many metrics to evaluate models in object detection, and for our project we decided to use the mAP score, which is the average of the maximum precision at different recall values over all IoU thresholds. In order to understand mAP, we’ll do a quick review of precision, recall, and IoU (intersection over union).

Precision & Recall

Precision measures the percentage of positive predictions that are correct. Recall is the proportion of true positives out of all possible outcomes. These two values are inversely related and are also dependent on the model score threshold that you set for the model (in our case, it is the confidence score). Their mathematical definitions are presented below:

Source

Intersection over Union (IoU)

IoU measures how much overlap there is between two regions, which is equal to the area of overlap over the area of union. This measures how well your predictions are (from your object detector) compared with the ground truth (true object boundary). To summarize, the mAP score is the mean AP over all IoU thresholds.

Results

Conclusion

Object detection is different from other computer vision tasks. You can use a pre-trained model and edit as needed to meet your needs. You’ll prob need GCP or another platform that allows for higher computing power. Math is hard, read others’ articles and fail fast.

Lessons Learned

In the beginning, we found that the model was not able to predict many of the classes because many of them only had a few training images, which caused an imbalance training dataset. Therefore, we decided to just use the most popular 43 classes, which is not a perfect approach, but each class had at least 500 images. However, our predictions’ confidence scores were still pretty low. To solve this problem, we selected images that contained our target classes.

Object detection is a very challenging topic, but don’t be scared and try to learn as much as possible from the various open sources online, like Coursera, YouTube instructional videos, GitHub, and Medium. All these free wisdom can help you succeed in this amazing field!

Future Work — Continuations or Improvements

  1. Train the model on more classes to detect a greater variety of objects. To reach this goal, we need to first solve the problem of imbalanced data. A potential solution is that we can collect more images with these rarer classes.

a. Data Augmentation — Change existing images slightly to create new images

b. Image duplication — We can use the same image multiple times to train the algorithm on the specific rare class

c. Ensemble — Train one model on the popular classes and train another for the rare classes and use predictions from both.

2. In addition, we can try an ensemble of different models, such as MobileNet, VGG, etc. which are convolutional neural networks algorithms also used for object detection.

If you’d like to take a detailed look into our team’s code, here’s the GitHub link. Please feel free to provide any feedback or comments!

--

--