Digitizing handwritten vaccination records in Nigeria with AI and ML

Piotr Krosniak
Towards Data Science
7 min readMay 17, 2019

--

Introduction

Working in UNICEF Nigeria as Polio Data Scientist I came to the problem of errors in vaccination cards that were delivered by 20,000 Polio Volunteers and even bigger issue number of the vaccination cards to check. Simple digitizing and giving everyone a tablet was not an option. After research, I decided to use AI/ML and computer vision to “read” information from the cards and then provide feedback mechanism about the most common errors and predict correct information.

In this tutorial, You will see how to achieve this and what are the results and recommendations for future optimizing. I will be using python libraries TensorFlow and OpenCV mainly and some support libraries.

Installation

Installation using TensorFlow varies with the OS and hardware you are going to use. Refer this article for general instructions here

For this tutorial, I will be using the following packages:

OS: Linux _X64 (Arch Linux)
Python package manager: Anaconda or Miniconda (Installation instructions here)
CUDA 10.1.105
CuDNN 7.5.0
Python Tensorflow Api V1
Opencv-python

Using miniconda (or anaconda), follow these steps to install the required python libraries

Creating conda environment

conda create -n pyocr
conda activate pyocr

Installing required packages

conda install tensorflow
conda install opencv
conda install -c lightsource2-tag pyzbar
pip install editdistance

Preserving library version for future replication

conda env export > <environment-name>.yml

Recreate the environment on another machine

To recreate the environment on another machine, use this after creating and activating the environment in another machine

conda env create -f <environment-name>.yml

Recognizing text using Tensorflow

The first thing to understand is that the accuracy of this model is dependant on the samples you are going to use for training. More samples are needed for better accuracy. This also means that if you need to recognize written text by multiple people, you have to include sufficient text samples written by them

The entire tutorial code is uploaded in the GitHub repository. Clone this repository using git clone if you need the final code

git clone git@github.com:PiotrKrosniak/ocrbot.git pyocr

Inputs

Check out the Inputs folder in the folder above. Keep the images you want to run the script one here(for better organization)

Fig 1: Github folder structure for input folder

Get Training Data

  1. Get IAM dataset
  2. Register at: http://www.fki.inf.unibe.ch/databases/iam-handwriting-database
  3. Download ascii/words.txt.
  4. Put words.txt into the data/ directory.
  5. Download words/words.tgz.
    . Create the directory data/words/.

a. Put the content (directories a01, a02, …) of words.tgz into data/words/
i.
For the linux terminal — in folder data, run the linux command tar xvf words.tgz -C words)

  1. Run checkDirs.py for a rough check on the files

Check if dir structure looks like this:

data
— test.png
— words.txt
— words
— — a01
— — — a01–000u
— — — — a01–000u-00–00.png
— — — — …
— — — …
— — a02
— — …

Training the model

Extract the model first. Unzip the model.zip fil into the same folder (<root>/model)Then, run the training in the src directory. The script here will build upon the previously trained model and improve its accuracy based on your data

python main.py — train

This may take a long while to run the training — more like 16–18 hours without a GPU. The script runs the training batches called epochs till there is no appreciable increase in text recognition accuracy between consecutive batches. After completion, You will see files generated under the model folder.

Fig 2: The Model folder with the TensorFlow trained models

The snapshots and checkpoints will be generated as above

Running the OCR script

Now that the model is generated in the code folder, let us run the code to get a text from our images. Make sure you have your input files in the Input folder

Fig 3: Input folder with your images

Run the code in the src folder(inside a terminal)

Python Demo.py

The code will run on the input images. You will see the output in the terminal as below

Fig 4: Sample Terminal output on running inference

Once the code has completed running, outputs will be present in the Output folder:

Fig 5: Output folder after running OCR script

The folders will contain the table cells with each cell as a separate image. We will get to use these generated images to further improve our accuracy in the next section

However, based on your current models, the recognized text will be saved in the CSV files with the same names as the input images. These CSV files can be opened in spreadsheet software like Microsoft Excel or google sheets

Improving the Accuracy

The individual table cells from your images are saved as separate images in the Output folder. These images can help the model recognize the handwriting -> text mapping for your own data set. Typically, this is necessary if you have a lot of uncommon English words like names or the handwriting style in the images differ largely from the IAM default dataset the model was trained on

To use these table cell images to train your dataset, follow the steps below:

  1. Preprocess the images to make it IAM dataset compliant. This is absolutely necessary for the script to get properly trained with your images. On a higher level, the following steps are performed:

a. Thickening faint lines in the text

b. Removing extra spaces around the word with word segmentation (refer this code)

c. Improving contrast through a technique for thresholding

  1. Renaming and copying the images in the data folder in the format used by the Dataloader.py module:

For example, A file c01–009–00–00.png should be saved in the following folder hierarchy

| Words
| — a01
| — — c01–009
| — — — c01–009–00–00.png

However, you can change these folder hierarchy/file naming conventions by editing the DataLoader.py module

3. Edit the words.txt file in the data module to include these images

The following code performs operation 1a and b

import numpy as np
import cv2
# read
img = cv2.imread(‘in.png’, cv2.IMREAD_GRAYSCALE)
# increase contrast
pxmin = np.min(img)
pxmax = np.max(img)
imgContrast = (img — pxmin) / (pxmax — pxmin) * 255
# increase line width
kernel = np.ones((3, 3), np.uint8)
imgMorph = cv2.erode(imgContrast, kernel, iterations = 1)
# write
cv2.imwrite(‘out.png’, imgMorph)

To write the words.txt file, follow the conventions in below format as applicable to your images:

Sample line: a01–000u-00–00 ok 154 1 408 768 27 51 AT A

  • a01–000u-00–00 -> word id for line 00 in form a01–000u. This is also the file name of the image you are mapping
  • ok -> result of word segmentation
  • ok: word was correctly
  • er: segmentation of word can be bad
  • 154 -> graylevel to binarize the line containing this word. This is the contrast stretching/Thresholding step.
  • 1 -> number of components for this word
  • 408 768 27 51 -> bounding box around this word in x,y,w,h format
  • AT -> the grammatical tag for this word, see the

file tagset.txt for an explanation

  • A -> the transcription for this word describing the text contents of the image

The above will custom tailor the model for your images. To improve the accuracy of the model itself, refer the improving accuracy section of this page

Explanation of the approach

The code perform three major steps:

  1. Match template and rotate image
  2. Recognize rows in the table and crop
  3. Recognize text using python-tensorflow

The recognition algorithm is based on the simplified version of HTR system of text recognition. If you are interested in the mechanism, you can refer this paper

It consists of 5 CNN layers, 2 RNN (LSTM) layers and the CTC loss and decoding layer

  • The input image is a gray-value image and has a size of 128x32
  • 5 CNN layers map the input image to a feature sequence of size 32x256
  • 2 LSTM layers with 256 units propagate information through the sequence and map the sequence to a matrix of size 32x80. Each matrix-element represents a score for one of the 80 characters at one of the 32 time-steps
  • The CTC layer either calculates the loss value given the matrix and the ground-truth text (when training), or it decodes the matrix to the final text with best path decoding or beam search decoding (when inferring)
  • Batch size is set to 50

Fig 5: Mechanisms involved in the OCR step using tensorflow

Conclusion

Following this tutorial, you now have a way to automate the digitization of handwritten texts in tabular format tables. Countless hours can be saved once you train the model to recognize your handwriting and customize according to your needs. However, be careful as the recognition is not 100% accurate. So, a round of high-level proofreading after the spreadsheet generation might be needed before you are ready to share the final spreadsheet

Reference:

  1. Code Reference:https://github.com/PiotrKrosniak/ocrbot
  2. Handwriting recognition using google TensorFlow: https://towardsdatascience.com/build-a-handwritten-text-recognition-system-using-tensorflow-2326a3487cd5
  3. Handling edge cases: https://towardsdatascience.com/faq-build-a-handwritten-text-recognition-system-using-tensorflow-27648fb18519
  4. Dataset to start with: http://www.fki.inf.unibe.ch/databases/iam-handwriting-database
  5. https://github.com/githubharald/SimpleHTR

--

--

DataScience and #GIS specialist. Love #triathlon #dataviz and #opensource tech. Dad of two work in UNICEF