Hands-on Tutorials

Practical Guide to Semantic Segmentation

A baseline approach to detecting documents in images for further processing — Optical Character Recognition, Document Type Detection, Named Entity Extraction and similar tasks.

Alex Simkiv
Towards Data Science
7 min readDec 29, 2020

--

Photo by Daniil Silantev.

Introduction

Object detection and extracting, especially with semantic segmentation, is a well-studied problem with a developed set of almost standard solutions. So if you’re a seasoned data scientist, most likely you won’t find anything new or special in this article. But if you’re relatively new to data science and image processing, in particular, this article may become your practical tutorial on making your first semantic segmentation model. Please note, that we’re not going to discuss any theoretical basis for semantic segmentation. This is only an example of a code\solution that was developed as a baseline approach during a proof of concept.

Our journey begins with a simple customer request. She needed to detect documents of a kind in images. To deliver this we need a dataset of sample documents. Most datasets that contained photos of documents do not provide the ground truth data we needed to train and evaluate our PoC. Fortunately, we found a dataset suitable for this particular problem: MIDV500 : A Dataset for Identity Documents Analysis and Recognition on Mobile Devices in Video Stream. All source document images used in MIDV-500 are either in the public domain or distributed under public copyright licenses.

OpenCV Approach

Before we start showing and explaining the code, we would like to point out that it usually takes a few futile attempts and trying out things that don’t work in order to get to the working result, it is a necessary learning curve. This time we started our efforts performing some straight actions with OpenCV, like:

  • filtering
  • edges detection
  • brightness\contrast,etc.
  • thresholding the image
  • extracting contours
  • selecting the one that fits our given criteria.

There were, of course, a few images, where this relatively simple approach worked well (like the one in the bottom right corner of the bottom image). But for most of the images, we had to conclude that their colors, lightning, shadows, blur, and other conditions were so versatile, that one simple image processing algorithm can’t cover it all.

Although we believed that there might exist an algorithm that can perform well under these conditions, we saw no point in continuing to work on this approach. So we switched to a little more complex approach: neural networks.

Semantic Segmentation Approach

Specifically, we decided to try semantic segmentation. That’s mostly because we have created a few of them, that developing a new one took only a few hours to write generators and train the model. We didn’t even tune hyperparameters, since we achieved our purpose on the very first try.

To create a model we need to prepare the data and implement the model itself. Generally, most NN models may be split into 3 main parts: the generator, the predictor, and the neural network itself. Additionally, there might be some trainers.

Data Preparation

We converted the data in our dataset (mentioned above) into the following format:

Here:

  • path — stands for the path to an image inside a dataset
  • x0, y0, x1, y1, x2, y2, x3, y3 — ground truth quadrangle coordinates
  • part — a number specifying a part of the dataset (the number that goes first in the name of each folder)
  • group — background used on the image (name of the folder containing the image)

The Generator

The purpose of a generator is to provide batches for training. Therefore, it should provide both input and output images. An example of initialization of such a generator may look like this:

Here the samples parameter is simply a part of the dataframe we prepared beforehand, data_path — a path to the dataset. The idea is simple: iterate through all of the images, resize them, draw the ground truth on a blank image, and store them in the corresponding variables. We believe the code above to be self-explanatory (just read it sequentially and don’t jump into the middle). Feel free to contact me if any doubts :-). Then, we can return a batch using __getitem__ method, as simple as that:

#create you generator
your_sample_generator = BatchGenerator(your_df,
your_data_path,
batch_size)
#return a batch
next(your_sample_generator)
#return a butch number i
your_sample_generator[i]

Neural Network Architecture

Now let’s create our neural network architecture. The provided architecture has proved to work well enough for a great variety of similar tasks.

The idea behind this model is the next: we downsample the input image to the size 8x8 while learning some features about most of the regions. Then, we pass those features to a few dense layers that, in fact, make a decision on whether there is an ID card on the image and if yes, where it is located. And finally, we use that decision as well as the features calculated during the downsampling part (all those Concatenate layers are implementing so-called skip-connections in CNN) to clarify the exact shape of the prediction. Despite this decision layer inside the model, it is still a semantic segmentation network that produces a probability map to define whether a pixel belongs to an ID card or not.

Training The Model

Now we proceed to the training of our model. First, let’s split the data into training and testing sets:

and initialize the generators:

Build the model and compile it:

We got the next model architecture

Well, the model appears to be not too big (832Kb). So, perhaps, it can even fit into a phone :-)

Finally, we can train our model:

We used a few simple callbacks to monitor training process, delivering easy to interpret plots and storing training results for further usage. If you are interested in actual code please let me know.

To analyze model results we plot accuracy against IoU threshold.

We can see that for an IoU threshold of 0.8 we reached a ~71% accuracy. It’s not perfect, but we didn’t even perform any optimizations for this model. There is room for improvement. Please find an example of the model input, ground truth, and its prediction:

Looks promising, right?

The Predictor

With all that said and done, we can finally extract ID cards from images. For this let’s prepare a few helper functions.

And use them in our final model:

Explanation: after NN makes its prediction on the resize of a given image, we threshold the result by 0.5 and search it for all the contours. After we smooth each contour, we check whether it has 4 edges and if it occupies the minimum allowed area. If yes, it is checked to be the biggest among other such contours. The selected contour is resized correspondingly to an input image and the rectangle is extracted using OpenCV tools. The result looks like this:

That is, of course, not even close to what can be called a solution. There are still many things that can be improved and tested. Here are a few of them:

  1. Trying object detection instead of semantic segmentation (YOLO, perhaps)
  2. Tuning the hyperparameters, as a potentially better data split strategy
  3. Analyzing the nature of mistakes
  4. Improving the prediction processing algorithm to better select hardcoded values

This turns into an even greater list when you need to generalize this solution to process arbitrary documents. But still, this may be a good starting point for the task.

Acknowledgments

I want to thank my colleagues Andy Bosyi, Mykola Kozlenko, Volodymyr Sendetskyi, Viach Bosyi and Nazar Savchenko for fruitful discussions, cooperation, and helpful tips as well as the entire MindCraft.ai team for their constant support.

Alex Simkiv,

Data Scientist, MindCraft.ai

Information Technology & Data Science

--

--