In-Depth Analysis

How I won the Flipkart ML challenge

Published in

Towards Data Science

10 min readMar 21, 2019

Flipkart recently hosted its month-long annual machine learning challenge for students of Indian engineering colleges, with a total prize pool of ₹5,20,000 (~7,500 USD). My team won, out of about 6700 participants. Before I tell you how we did it, let’s keep in mind that I’d probably struggle to keep within the top 10%, had this challenge not been confined to such a small target demographic. As an aside, this blog will mostly be theoretical.

The problem statement

Given an image, localize the predominant object in the image. In other words, it’s object detection without the classification part. We just had to predict the bounding boxes.

There was one catch however, pre-trained models were prohibited but using standard architectures was allowed. There were 24000 training images and 24045 testing images.

The metric for scoring was mean intersection over union (IOU).

Custom metric to track mean IOU

I didn’t know tf.metrics.mean_iou existed at the time, so I created my own function, which is probably a lot slower.

Rudimentary analysis

At first glance, the dataset had a lot of images with plain backgrounds, so I tried Otsu thresholding, watershed, contours, blah and blah using OpenCV. These didn’t give great results obviously so I’ll spare you the details.

Before trying out standard object detection architectures like Yolo, RetinaNet SSD and RCNNs, let’s remind ourselves that they have complex ways of balancing the localization and classification losses and are overkill for the problem at hand. Most importantly, I wasn’t confident I could fine-tune them well, so I didn’t spend a lot of time on those. But all of these use standard architectures like VGG or ResNet as feature extractors.

This problem is very similar to a classification problem. For a network to predict the exact bounding boxes of objects, it should be able to recognize what those objects look like, since we don’t have Imagenet weights to help us out. Ideally, if we have a good localization network, we should be able to make it a good classification network by replacing and training just the last layer.

So I used standard architectures, replaced the last layer with a dense layer with four outputs (x1, y1 and x2, y2 being co-ordinates of the top left corner and the bottom right corner), removed activations and used a classical regression loss.

Data augmentation

24,000 images wouldn’t cut it, so I applied brightness-contrast augmentations, flips along both axes, lateral shifts, etc to finally get about 480,000 images. I didn’t use inplace methods like generators that spit out different versions of images each epoch, as I wanted to make sure the network sees all the data on every epoch. In retrospect, a generator would have worked just fine too.

Imagenet claims to have about 50 million images and half a million isn’t enough you might think, but the images in the training set could be broadly classified into about 10 classes, nowhere near the number of classes in Imagenet, so we’ll be fine.

Training techniques

Before discussing my best architecture, here are the training techniques we used.

Progressive size increments: Popularized by Jeremy Howard, this is a technique where we pass in downsized images first and progressively increase the size as the learning begins to stagnate. This prevents the network from overfitting on the noise present in larger images and helps converge faster. We used images resized to 64, 96, 128 and 224 in order. This is particularly helpful when starting with random weights. By the time we reach 224, we already have a network that works well for smaller images, almost like transfer learning.
Tempering: This is a technique where we use the original dataset and a large batch size to train. Once the network starts overfitting, we swap out the smaller dataset with the augmented one and reduce the batch size to introduce regularization by virtue of gradient noise. In a way, we are getting the network acquainted with the kind of data it should expect before overwhelming it.
Stochastic Gradient Descent with Restarts: This is a learning rate scheduler (not an optimizer) which we used along with Adam. It gradually decreases the learning rate for a fixed number of epochs before resetting it again so that the network may easily pop out of local minima if stuck. The cycle length can be changed for a gradual descent into wider, more generalizable minima. This blog explains it well.

Snapshot ensembling: As a consequence of using SGDR, the network will converge to several different minima throughout the training routine. Saving a checkpoint of the weights at each minima and ensembling them together gives great results as explored in the paper Snapshot ensembles: Train 1 get M free.

Save a checkpoint at the flagged points and ensemble them together for better results

Learning rate finder: Finding a good learning rate is crucial for convergence, obviously. Use a standard learning rate finder and set the upper limit of SGDR to the value with the steepest decline in the loss landscape. Set the lower limit somewhere back where the loss is still decreasing somewhat.
Getting rid of the GAP layer: Not really a training technique, but it was observed that networks with the Global Average Pooling layer before the last layer performed about 3% worse than the ones without it, on average. Mainly because the last layer now had less inputs to learn from.

ad71/Flipkart-Grid

Keras helper functions for object localization.

github.com

Keras callbacks discussed here can be found in this repository. PyTorch users can use a higher level library like fastai which has all these already implemented.

Network architectures

ResNet: The first thing I try for any vision related task is train a small ResNet as a benchmark. ResNets solve the vanishing gradient problem of deep CNNs by introducing the skip connection, which makes it easier for a network to learn an identity function. Using all the techniques mentioned above, we trained a ResNet34 to get 93.41% mean IOU on the public leaderboard.
ResNeXt: This is a later generation ResNet with some improvements and a new hyperparameter. A layer in a ResNeXt works similar to a neuron. The activations from the previous layer is received through C different paths, each with the same operations being carried out, which are then added together. Here, C is called the cardinality of the network. The paths are mutually disconnected, which increases sparsity somewhat and helps reduce overfitment.

Residual block in ResNet vs residual block in ResNeXt with C = 32

Higher cardinalities are observed to outperform lower ones and a ResNeXt with a large C is better than having a wide ResNet. Since we just have one new hyperparameter, this network is a lot easier to train and fine-tune for a new dataset than any network from the Inception family. A ResNeXt 101 with a cardinality of 32 was our best performing model. This blog explains ResNeXts in detail.

Activation function

On our first try, the ResNeXt101 performed better than the ResNet34, but not by a great margin. Learning had stagnated even though our validation loss was somewhat lower than our training loss. We saved a checkpoint and quickly hooked up a new classification layer to one of the intermediate layers in the network and tried training it. Ideally, the loss should still decrease even though it won’t go as low as the last layer loss. But this wasn’t the case. Our new classifier stagnated as well. This is quite indicative of the dying ReLU problem.

When using ReLU activations, large negative gradients kill the neuron over time as ReLU outputs 0 for negative inputs. Due to this, that particular neuron absolutely stops responding to gradient changes. Dying ReLUs isn’t necessarily a problem, because dead neurons are useful for calculating biases, even though they have lost their ability to learn. However, if you have a lot of dead neurons, learning stagnates. In our case, this happened probably because there isn’t enough variation in the data to fit a model with so many parameters.

To solve this, we could use modifications of ReLU like leaky ReLU or SELU. These functions also have the mean of the output activations closer to zero, self-regularizing the network in a way.

The SELU function, where (alpha) = 1.6732 and (lambda) = 1.0507 for *standard scaled inputs*.

Due to lack of time, we risked it with SELU, retrained the whole thing, and it seemed to work well. This boosted our final mean IOU by about 0.5%.

Loss function

Until this point, I have been trying all well known methods, but understanding why standard loss functions are flawed for what we’re trying to optimize, was key to squeezing out the final decimal places of accuracy.

Consider a case where the target object is 100 pixels wide and 100 pixels in height. Assume we have a network capable of predicting the dimensions correctly but offsets the predicted bounding box one pixel to the right and one pixel above.

Case 1: 100 x 100 bounding boxes. Target box marked in green, prediction marked in red

In this case, the total mean absolute error for four co-ordinates is 4 * 1.414, the total mean squared error is 4 * 2 while the IOU is 96.097%

For this case, the IOU is 0.96097, the L1 loss is 1.414 and the L2 loss is 2, for each co-ordinate

Consider another case where the target object is 400 pixels wide and 400 pixels in height. Let’s assume the same network predicts the dimensions accurately, but offsets the coordinates by one pixel in each direction.

Case 2: 400 x 400 bounding boxes. Target box marked in green, prediction marked in red

In this case, the total mean absolute error and mean squared error are the same as before, but the IOU is higher at 99.006%

For this case, the IOU is 0.99006, while L1 and L2 losses are the same as before

These inconsistencies show us that for the same MAE or MSE, the IOU can be different. Hence, we want a loss function that penalizes the network more for errors on bounding boxes with smaller target areas.

To further justify the discrepancy, let’s consider a more extreme case.

For the same L1 loss and L2 loss, the IOU is very different this time

Though a pixel distance of 70.71 isn’t possible, hopefully you get the point. For the exact same MAE and MSE, we can find vastly different IOU scores. This is a problem.

To perform well on this metric, we need to redefine our loss function. Let’s see why IOU based functions don’t work very well either.

1 — IOU: Ranges between 0 and 1. Not enough variation between values and gradients are always very small, causing very slow convergence.
1 / IOU: Plenty of variation between values, large gradients, but the function is unbound. For boxes with no intersection, loss explodes to infinity. Adding a small value to the denominator is just a hack and doesn’t really solve the problem.

This is the loss function that I finally came up with:

Where U is the Union, I is the intersection and At is the area of the target bounding box

U — I gives the extra area that we want to minimize. At is the factor by which we want to scale it, so that we penalize the networks more for errors in smaller objects. Let’s call this function Scaled Loss for lack of a better term.

The scaled loss function

This new loss function gives a loss value of 0.0398 for the 100 * 100 pixel case and a loss value of 0.0099 for the 400 * 400 pixel case, about 4 times less. This function has a large gradient and will never explode to infinity (assuming the area of the target bounding box will never be zero).

Sometimes, during training, the loss function would suddenly explode to inf and IOU would become nan. This can happen only when target area is zero. Analyzing the training set labels, I found that x1 was greater than x2 for a few images. Correcting these mislabeled examples got rid of the problem.

There’s one drawback of this function. In case of no intersection, this loss function gives the same loss no matter how far the two non-intersecting boxes lie, because this is an area based function. Also, the loss landscape formed by this function isn’t very smooth.

To overcome both these problems, we used this composite loss function after the network reached about 90% mean IOU on MAE alone.

The composite loss function

Conclusion

That’s about it. Using all these techniques, we were able to create a network that predicts 23 images per second when fed one by one and 73 images per second when fed in mini-batches. We achieved rank 12 on the public leaderboard with a mean IOU of 94.8813% and rank 1 on the private leaderboard with a mean IOU of 94.7079%, which I believe was the least amount of overfit.

ad71/Flipkart-Grid

Keras helper functions for object localization.

github.com

All custom metrics, losses and callbacks are uploaded to this repository for anyone to use.

Thank you for your time.

In-Depth Analysis

How I won the Flipkart ML challenge

The problem statement

Rudimentary analysis

Data augmentation

Training techniques

ad71/Flipkart-Grid

Keras helper functions for object localization.

Network architectures

Activation function

Loss function

Conclusion

ad71/Flipkart-Grid

Keras helper functions for object localization.

Written by Aman Deep