Borderless Tables Detection with Deep Learning and OpenCV

Way to build your own object detector and turn semi-structured blocks of data in an image into a machine-readable text

Volodymyr Holomb
Towards Data Science

--

Image by author

Document parsing

Document parsing is an initial step for transforming information into valuable business data. That information is often stored within commercial documents in tabular format or incidentally in data blocks without distinctive graphical borders. A borderless table may help to simplify the visual perception of semi-structured data for us, humans. From the machine-reading point of view, such presenting information on a page has quite a few shortcomings which make it difficult to separate the data belonging to a presumptive table structure from the surrounding textual context.

Tabular data extraction as a business challenge may have several ad-hoc or heuristiс rules-based solutions which definitely will fail with a table of a bit different layout or style. On a large scale, one should use a more general approach for identifying table-like structures in an image, more specifically a deep learning-based object detection approach.

Scope of this tutorial:

  • Deep learning-based object detection
  • Installation and setup of TF2 Object Detection API
  • Data preparation
  • Model configuration
  • Model training and saving
  • Table detection and cell recognition in a real-life image

Deep learning-based object detection

Adrian Rosebrock, a known CV researcher, states in his “Gentle guide to deep learning object detection” that: “object detection, regardless of whether performed via deep learning or other computer vision techniques, builds on image classification and seeks to localize precisely an area where an object appears”. One approach to build a custom object detector, as he suggests, is to choose any classifier and precede it with an algorithm to select and provide regions of an image that may contain an object. Within this method, you are free to decide whether to use a traditional ML algorithm for image classification (utilising or not CNN as a feature extractor) or train a simple neural network to handle arbitrary large datasets. Despite its proven efficiency, this two-stage object detection paradigm, known as R-CNN, still relies on heavy computations and is not suitable for real-time application.

It is further said in the abovementioned post that “another approach is to treat a pre-trained classification network as a base (backbone) network in a multi-component deep learning object detection framework (such as Faster R-CNN, SSD, or YOLO)”. Thus you will benefit from its complete end-to-end trainable architecture.

Whatever be the choice it will put you further to an issue of overlapping bounding boxes. Hereinafter we will touch upon performing non-maximum suppression for this purpose.

Meanwhile please refer to the transfer-learning flow-chart (see in interactive view) of an object detector for an arbitrary new class:

Image by author

Due to its being quicker, less tedious and more accurate in general, the second approach has become widely adopted for table-like structures recognition in commercial and scientific papers. As an example, you can easily find implementations using YOLO, RetinaNet, Cascade R-CNN and other frameworks for the tabular data extraction from PDF documents.

Moving forward with this tutorial you’ll learn how to use tools like TensorFlow (TF2) Object Detection API to build your custom object detectors using pre-trained state-of-the-art models with ease.

Before you start

Be aware it will not be an exhausting introduction to deep learning object detection, but rather a phase-by-phase description of interacting with TF2 Object detection API (and other tools) for solving a pronounced business problem (such as borderless table detection) within a specific development environment (Anaconda/Win10). Throughout the rest of this post, we will cover some aspects and the results of our modelling process in greater detail than others. Nonetheless, you’ll find the essential code examples to follow our experiment. To proceed you should have Anaconda and Tesseract installed and protobuf downloaded and added to PATH.

Installation and setup of TF2 Object Detection API

Under a path of your choice create a new folder, that we will refer to hereinafter as a ‘project’s root folder’. From your terminal window run one-by-one the following commands:

# from <project’s root folder>
conda create -n <new environment name> \
python=3.7 \
tensorflow=2.3 \
numpy=1.17.4 \
tf_slim \
cython \
git
conda activate <new environment name>git clone https://github.com/tensorflow/models.gitpip install git+https://github.com/philferriere/cocoapi.git#subdirectory=PythonAPIcd models\research
# from <project’s root folder>\models\researchprotoc object_detection\protos\*.proto — python_out=.copy object_detection\packages\tf2\setup.py .python setup.py installpython object_detection\builders\model_builder_tf2_test.pyconda install imutils pdf2image beautifulsoup4 typeguardpip install tf-imagecopy object_detection\model_main_tf2.py ..\..\workspace\.copy object_detection\exporter_main_v2.py ..\..\workspace\.cd ..\..

It will install core and some helper libraries into your local environment needed to use a TF2 Object Detection API and take care of your training dataset. From this step on you should be able to download a pretrained model from TF2 Model Garden and get inferences from it for respective pretrained classes.

Data preparation

I hope you’ve succeeded so far! Please bear in mind that our final goal is to perform transfer learning using a pretrained model to detect a single ‘borderless’ class, which the model has no idea about while initial training. If you have studied our transfer-learning flow-chart you should have noticed that our starting point for the whole process is a dataset, whether annotated or not. If you need annotation, there are tons of solutions available. Pick the one that will give you annotations in XML, that is compatible with our example.

The more annotated data we have the better (important: all table images for this post were selected from open data sources like this one and annotated/re-annotated by the author). But as soon as you’ll try your hands in manual data labelling you’ll understand how tedious this work is. Unfortunately, none of the popular python-libraries for image augmentation takes care of the selected bounding boxes. It is in our interest to multiply the initial dataset without the high cost of collecting and annotating new data. That is the case when a tf-image package will become handy.

The above script will randomly transform the original image along with the object’s bounding boxes and save both the new image and corresponding XML file to disk. That is how our dataset looks after three-fold expansion:

Image by author

The next steps will include splitting the data into train and test sets. Models based on the TF2 Object Detection API need a special format for all input data, called TFRecord. You’ll find corresponding scripts to split and convert your data in the Github repository.

Model configuration

At this step, we’ll create a Label Map file (.pbtxt) to link our class label (‘borderless’) to some integer value. The TF2 Object Detection API needs this file for training and detection purposes:

item {
id: 1
name: ‘borderless’
}

The actual model configuration is happening in the corresponding pipeline.config file. You can read an intro to model configuration and decide whether to configure the file manually or by running a script from the Github repository.

By now your project’s root folder might look something like this:

📦borderless_tbls_detection
┣ 📂images
┃ ┣ 📂processed
┃ ┃ ┣ 📂all_annots
┃ ┃ ┃ ┗ 📜…XML
┃ ┃ ┗ 📂all_images
┃ ┃ ┃ ┗ 📜…jpg
┃ ┣ 📂splitted
┃ ┃ ┣ 📂test_set
┃ ┃ ┃ ┣ 📜…jpg
┃ ┃ ┃ ┗ 📜…XML
┃ ┃ ┣ 📂train_set
┃ ┃ ┃ ┣ 📜…jpg
┃ ┃ ┃ ┗ 📜…XML
┃ ┃ ┗ 📂val_set
┃ ┗ 📜xml_style.XML
┣ 📂models
┃ ┗ 📂…
┣ 📂scripts
┃ ┣ 📜…py
┣ 📂train_logs
┣ 📂workspace
┃ ┣ 📂data
┃ ┃ ┣ 📜label_map.pbtxt
┃ ┃ ┣ 📜test.csv
┃ ┃ ┣ 📜test.record
┃ ┃ ┣ 📜train.csv
┃ ┃ ┣ 📜train.record
┃ ┃ ┣ 📜val.csv
┃ ┃ ┗ 📜val.record
┃ ┣ 📂models
┃ ┃ ┗ 📂efficientdet_d1_coco17_tpu-32
┃ ┃ ┃ ┗ 📂v1
┃ ┃ ┃ ┃ ┗ 📜pipeline.config
┃ ┣ 📂pretrained_models
┃ ┃ ┗ 📂datasets
┃ ┃ ┃ ┣ 📂efficientdet_d1_coco17_tpu-32
┃ ┃ ┃ ┃ ┣ 📂checkpoint
┃ ┃ ┃ ┃ ┃ ┣ 📜checkpoint
┃ ┃ ┃ ┃ ┃ ┣ 📜ckpt-0.data-00000-of-00001
┃ ┃ ┃ ┃ ┃ ┗ 📜ckpt-0.index
┃ ┃ ┃ ┃ ┣ 📂saved_model
┃ ┃ ┃ ┃ ┃ ┣ 📂assets
┃ ┃ ┃ ┃ ┃ ┣ 📂variables
┃ ┃ ┃ ┃ ┃ ┃ ┣ 📜variables.data-00000-of-00001
┃ ┃ ┃ ┃ ┃ ┃ ┗ 📜variables.index
┃ ┃ ┃ ┃ ┃ ┗ 📜saved_model.pb
┃ ┃ ┃ ┃ ┗ 📜pipeline.config
┃ ┃ ┃ ┗ 📜efficientdet_d1_coco17_tpu-32.tar.gz
┃ ┣ 📜exporter_main_v2.py
┃ ┗ 📜model_main_tf2.py
┣ 📜config.py
┗ 📜setup.py

Model training and saving

We’ve done a lot of work to get here and set all the things ready to start the training. Here is how to do that:

# from <project’s root folder>
tensorboard — logdir=<logs folder>
set NUM_TRAIN_STEPS=1000set CHECKPOINT_EVERY_N=1000set PIPELINE_CONFIG_PATH=<path to model’s pipeline.config>set MODEL_DIR=<logs folder>set SAMPLE_1_OF_N_EVAL_EXAMPLES=1set NUM_WORKERS=1python workspace\model_main_tf2.py \
— pipeline_config_path=%PIPELINE_CONFIG_PATH% \
— model_dir=%MODEL_DIR% \
— checkpoint_every_n=%CHECKPOINT_EVERY_N% \
— num_workers=%NUM_WORKERS% \
— num_train_steps=%NUM_TRAIN_STEPS% \
— sample_1_of_n_eval_examples=%SAMPLE_1_OF_N_EVAL_EXAMPLES% \
— alsologtostderr
# (optionally in parallel terminal window)
python workspace\model_main_tf2.py \
— pipeline_config_path=%PIPELINE_CONFIG_PATH% \
— model_dir=%MODEL_DIR% \
— checkpoint_dir=%MODEL_DIR%

Now you can monitor the training process in your browser at http://localhost:6006:

Image by author

To export your model after training is done just run the following command:

# from <project’s root folder>
python workspace\exporter_main_v2.py \
— input_type=image_tensor \
— pipeline_config_path=%PIPELINE_CONFIG_PATH% \
— trained_checkpoint_dir=%MODEL_DIR% \
— output_directory=saved_models\efficientdet_d1_coco17_tpu-32

Table detection and cell recognition in an image

NMS and IoU

As we have our newly fine-tuned model saved we can begin to detect tables in documents. Earlier we have mentioned an inescapable issue of an object detection system - overlapping bounding boxes. Considering the over-segmented nature of borderless tables we are dealing with, our model will occasionally output more bounding boxes for a single object than you would expect. After all, it’s a sign that our object detector is firing correctly. To handle the removal of overlapping bounding boxes (that refer to the same object) we can use non-maximum suppression.

Неre is how inferences of our detector look like originally and after performing non-maximum suppression:

It seems like we have successfully solved those issues with predicted overlapping rectangles enclosed an object, but our detections are still falling short of the ground-truth bounding boxes. It’s going to happen as no model is perfect. We can measure the accuracy of our detector with the Intersection over Union (IoU) ratio. As the numerator, we compute the area of overlap between the predicted bounding box and the ground-truth bounding box. As the denominator, we compute the area encompassed by both the predicted bounding box and the ground-truth bounding box. An IoU score > 0.5 is normally considered a ‘good’ prediction [Rosenbrock, 2016].

For some images from our test set we have the following metrics:

Cell recognition and OCR

These will be the final steps of our three-part algorithm: after the (1) table is detected, we are going to (2) recognize its cells with OpenCV (as the table is borderless) and thoroughly allocate them to proper rows and columns, to proceed further with (3) text extraction from each allocated cell through Optical Character Recognition (OCR) with pytesseract.

Most cell recognition algorithms are based on the line structure of the table. Clear and detectable lines are necessary for the proper identification of cells. As our table has none, we will manually reconstruct the table grid searching for white vertical and horizontal gaps on a thresholded and resized image. This approach is somewhat similar to one utilised here.

After this step is done we can find contours with OpenCV (i.e. our cells borders), sort and allocate them into a table-like structure with:

The whole work-flow is shown on the chart:

Image by author

At this point, we have all our boxes and their values sorted in the right order. It only remains to take every image-based box, prepare it for OCR by dilating and eroding and let pytesseract recognize the containing strings:

Final thoughts

Ough, it’s been a long walk! Our custom object detector can recognize semi-structured blocks of information (aka borderless tables) in a document to further transform them into machine-readable text. Though this model turns out to be not as accurate as we might have expected. Hence we have lots of room for its improvement:

--

--

As an ML Engineer at RBC Group, I transform raw data with passion and creativity to unlock valuable insights and empower businesses to make informed decisions