ConvNets Series. Actual Project Prototyping with Mask R-CNN

Published in

Towards Data Science

8 min readApr 23, 2018

Intro

This ConvNets series progresses from a toy dataset of German traffic signs to a more practical real life problem I was asked to solve: “Is it possible to implement a piece of deep learning magic that distinguishes good quality dishes from those of bad quality using only photos as a single input?”. In a nutshell, business wanted this:

When a business looks at ML through pink glasses, they imagine this

This is an example of an ill-imposed problem: it is impossible to figure out whether a solution exists and that the solution is unique and stable as the very definition of done is extremely vague (let alone the implementation). While this post is not about efficient communications or project management, one remark is necessary: you should never commit to badly scoped projects. One of the proven ways to cope with this ambiguity is to build up a well-defined prototype first and to structure the rest of the task afterwards. That was the strategy we took.

Problem Definition

In my prototype, I focused on a single item in the menu — an omelette—and constructed an extensible data pipeline that outputs the perceived “quality” of the omelette. It can be summarized as this:

Problem type: multiclass classification, 6 discrete classes of quality: [good, broken_yolk, overroasted, two_eggs, four_eggs, misplaced_pieces].
Dataset: 351 manually collected DSLR camera photos of various omelettes. Train/val/test: 139/32/180 shuffled photos.
Labels: a subjective quality class is assigned to each of the photos.
Metric: categorical cross-entropy.
Necessary domain knowledge: a “good” omelette is defined as an omelette of three eggs with unbroken yolks, some bacon, no burnt pieces, a single piece of parsley in the center. Also, its composition should be visually correct, e.g., there should be no scattered pieces.
Definition of done: best possible cross-entropy on the test set after two weeks of prototyping.
Results visualization: t-SNE of the test set’s low-dimensional data representation.

Input images as they are captured with camera

The main goal was to obtain and combine extracted signals using a neural network classifier and to let the classifier make its softmax predictions regarding the class probabilities of the items in the test set. Such a goal would render this prototype viable and applicable for later usage. Here are the signals we extracted and found useful:

Key ingredients masks (Mask R-CNN): Signal #1.
Key ingredients counts grouped by each ingredient (which is basically a matrix of distinct ingredients counts): Signal #2.
RGB crops of plates with omelettes with background removed. For simplicity, I decided not to add them to the model for now. It might have been the most obvious signal: just train a ConvNet classifier on these images using some fancy loss function and take a L2 distance from a chosen paragon image to the current one in a low-dimensional embedding. Unfortunately, I had no chance to test this hypothesis as I was limited to only 139 samples in the training set.

General 50K Pipeline Overview

I am omitting several important stages such as data discovery and exploratory analysis, baseline solutions and active labeling (which is my own fancy name for a semi-supervised instance annotation inspired by Polygon-RNN demo video) pipeline for Mask R-CNN (more on this in later posts). To embrace the overall pipeline, here its 50K feet view:

We are mostly interested in the Mask R-CNN and classification stages of the pipeline

For the rest of this post, I’m focusing on three stages: [1] Mask R-CNN for ingredient masks inference, [2] Keras-based ConvNet classifier, [3] results visualization with t-SNE.

Stage 1: Mask R-CNN and Masks Inference

Mask R-CNN (MRCNN) got a lot of coverage and hype recently. Starting from the original Facebook’s paper and moving forward to the Data Science Bowl 2018 on Kaggle, Mask R-CNN proved itself as a powerful architecture for instance segmentation (object aware segmentation). The Keras-based Matterport’s implementation of MRCNN that I used is an absolute pleasure to work with. The code is well-structured, nicely documented and works right out of the box, though slower than I expected.

MRCNN in one paragraph:

MRCNN consists of two definitive parts: the backbone network and the network head thus inheriting the Faster R-CNN architecture. The convolutional backbone network, which is based either on Feature Pyramid Network (FPN) or ResNet101, works as features extractor over the whole image. On top of this lies Region Proposal Network (RPN) that samples multiscale RoI (regions of interest) for the head. The network head does bounding box recognition and mask prediction that is applied to each RoI. In between, RoIAlign layer finely aligns RPN-extracted multiscale features with the input.

MRCNN framework as presented in the original paper

For practical applications, especially for prototyping, having a pretrained ConvNet is critical. In many real life scenarios, a data scientist has a very limited annotated dataset, or even has no annotations whatsoever. In contrast, ConvNets require large labeled datasets to converge (e.g., ImageNet dataset contains 1.2M labeled images). This is where transfer learning helps: one strategy is to freeze the weights of convolutional layers and only retrain the classifier. Conv layers weights freeze is important for a small dataset in order to save the model from overfitting.

Here is the sample of what I got after a single epoch of training:

The result of instance segmentation: all key ingredients are detected

The next stage (Process Inferenced Data for Classifier in my 50K pipeline view) is to crop the image part that contains the plate and to extract 2D binary masks for each ingredient from that crop:

Cropped image with the target dish and its key ingredients as binary masks

These binary masks are then combined into a 8-channel image (as I defined 8 mask classes for MRCNN)—that is my Signal #1:

Signal #1: 8-channel image composed of binary masks. Colors are just for better visualization

For the Signal #2, I calculated the counts of each ingredient from the MRCNN inference and just packed them into a feature vector for each crop.

Stage 2: Keras-based ConvNet classifier

The CNN classifier has been implemented from scratch using Keras. The goal I had in mind was to fuse several signals (Signal #1 and Signal #2, as well as to add more data in the future) and let the network make its predictions regarding the quality class of the dish. The following architecture is experimental and it is far from ideal:

Several observations and comments about the classifier architecture:

Multiscale convolutions module: initially I selected a 5x5 kernel for convolutional layers, but that decision got me only to a satisfactory score. A better result has been achieved by AveragePooling2D of several convolutional layers with various kernels: 3x3, 5x5, 7x7, 11x11. Additional 1x1 convolutional layer has been added in front of each layer to reduce dimensionality. This component slightly resembles the Inception module, though I restrained myself from building a deep network.
Larger kernels: I used larger kernel sizes as the larger scale features could be easily extracted from the input image (which itself could be viewed as an activation layer with 8 filters—each ingredient’s binary mask is basically a filter).
Signals fusion: my naive implementation just used a single layer of non-linearity to merge two feature sets: processed binary masks (Signal #1) and ingredients counts (Signal #2). Despite its naivety, adding Signal #2 provided a nice boost to the score (improved cross-entropy from 0.8 to [0.7, 0.72])
Logits: in terms of TensorFlow, this is the layer on which tf.nn.softmax_cross_entropy_with_logits is applied to calculate the batch loss.

Stage 3: Results visualization with t-SNE

For test set results visualization I used t-SNE, a manifold learning technique for data visualization. t-SNE minimizes KL-divergence between joint probabilities of a low-dimensional embedded data points and original high-dimensional data using quite a notable non-convex loss function. You should definitely read the original paper, it is extremely informative and well-written.

To visualize the test set classification results, I inferenced the test set images, extracted the logits layer of the classifier and applied t-SNE to this dataset. Though I should have played with different perplexity values, the results look pretty nice anyway. Animated GIF:

t-SNE of the test set predictions by the classifier

While not perfect, such an approach works indeed. There is, though, lots of improvements to be made:

More data. ConvNets require lots of data, while I had only 139 samples for training. Tricks like data augmentation work great (I used a D4, or dihedral, symmetry group augmentation that resulted in 2K+ augmented images), but more real data is critical for good performance.
Suitable loss function. For simplicity I used categorical cross-entropy loss that works out-of-box. I would switch to a more suitable loss function, the one that better leverages intra-class variance. A good option to start with might have been a triplet loss (see FaceNet paper for details).
Better overall classifier architecture. The current classifier is basically a prototype which goal is to interpret input binary masks and to combine multiple feature sets into a single inference pipeline.
Better labeling. I was quite sloppy on manual image labeling (6 classes of quality): the classifier outperformed myself for a dozen of test set images!

Outro and introspection. It is very common in practice (and we should stop denying this fact) that businesses have no data, no annotations and no clear and well-articulated technical task to be accomplished. This is a good thing (otherwise, why would they need you?): it is your job to have the tools, enough pieces of multi-GPU hardware, a combination of both business and technical expertise, pretrained models and everything else needed to bring value to business.

Start small: a working prototype that can be built from LEGO blocks of code can boost the productivity of further conversations—and it is your job as a data scientist to suggest such an approach to business.