Introduction
Working in UNICEF Nigeria as Polio Data Scientist I came to the problem of errors in vaccination cards that were delivered by 20,000 Polio Volunteers and even bigger issue number of the vaccination cards to check. Simple digitizing and giving everyone a tablet was not an option. After research, I decided to use AI/ML and computer vision to "read" information from the cards and then provide feedback mechanism about the most common errors and predict correct information.
In this tutorial, You will see how to achieve this and what are the results and recommendations for future optimizing. I will be using python libraries TensorFlow and OpenCV mainly and some support libraries.
Installation
Installation using TensorFlow varies with the OS and hardware you are going to use. Refer this article for general instructions here
For this tutorial, I will be using the following packages:
OS: Linux _X64 (Arch Linux) Python package manager: Anaconda or Miniconda (Installation instructions here) CUDA 10.1.105 CuDNN 7.5.0 Python Tensorflow Api V1 Opencv-python
Using miniconda (or anaconda), follow these steps to install the required python libraries
Creating conda environment
conda create -n pyocr
conda activate pyocr
Installing required packages
conda install tensorflow
conda install opencv
conda install -c lightsource2-tag pyzbar
pip install editdistance
Preserving library version for future replication
conda env export > <environment-name>.yml
Recreate the environment on another machine
To recreate the environment on another machine, use this after creating and activating the environment in another machine
conda env create -f <environment-name>.yml
Recognizing text using Tensorflow
The first thing to understand is that the accuracy of this model is dependant on the samples you are going to use for training. More samples are needed for better accuracy. This also means that if you need to recognize written text by multiple people, you have to include sufficient text samples written by them
The entire tutorial code is uploaded in the GitHub repository. Clone this repository using git clone if you need the final code
git clone [email protected]:PiotrKrosniak/ocrbot.git pyocr
Inputs
Check out the Inputs folder in the folder above. Keep the images you want to run the script one here(for better organization)

Get Training Data
- Get IAM dataset
- Register at: http://www.fki.inf.unibe.ch/databases/iam-handwriting-database
- Download ascii/words.txt.
- Put words.txt into the data/ directory.
-
Download words/words.tgz. . Create the directory data/words/.
a. Put the content (directories a01, a02, …) of words.tgz into data/words/ i. For the linux terminal – in folder data, run the linux command tar xvf words.tgz -C words)
- Run checkDirs.py for a rough check on the files
Check if dir structure looks like this:
data
— test.png
— words.txt
— words
— - a01
— - - a01–000u
— - - - a01–000u-00–00.png
— - - - ...
— - - ...
— - a02
— - ...
Training the model
Extract the model first. Unzip the model.zip fil into the same folder (/model)Then, run the training in the src directory. The script here will build upon the previously trained model and improve its accuracy based on your data
python main.py - train
This may take a long while to run the training – more like 16–18 hours without a GPU. The script runs the training batches called epochs till there is no appreciable increase in text recognition accuracy between consecutive batches. After completion, You will see files generated under the model folder.

The snapshots and checkpoints will be generated as above
Running the OCR script
Now that the model is generated in the code folder, let us run the code to get a text from our images. Make sure you have your input files in the Input folder

Run the code in the src folder(inside a terminal)
Python Demo.py
The code will run on the input images. You will see the output in the terminal as below

Once the code has completed running, outputs will be present in the Output folder:

The folders will contain the table cells with each cell as a separate image. We will get to use these generated images to further improve our accuracy in the next section
However, based on your current models, the recognized text will be saved in the CSV files with the same names as the input images. These CSV files can be opened in spreadsheet software like Microsoft Excel or google sheets
Improving the Accuracy
The individual table cells from your images are saved as separate images in the Output folder. These images can help the model recognize the handwriting -> text mapping for your own data set. Typically, this is necessary if you have a lot of uncommon English words like names or the handwriting style in the images differ largely from the IAM default dataset the model was trained on
To use these table cell images to train your dataset, follow the steps below:
- Preprocess the images to make it IAM dataset compliant. This is absolutely necessary for the script to get properly trained with your images. On a higher level, the following steps are performed:
a. Thickening faint lines in the text
b. Removing extra spaces around the word with word segmentation (refer this code)
c. Improving contrast through a technique for thresholding
- Renaming and copying the images in the data folder in the format used by the Dataloader.py module:
For example, A file c01–009–00–00.png should be saved in the following folder hierarchy
| Words
| - a01
| - - c01–009
| - - - c01–009–00–00.png
However, you can change these folder hierarchy/file naming conventions by editing the DataLoader.py module
- Edit the words.txt file in the data module to include these images
The following code performs operation 1a and b
import numpy as np
import cv2
# read
img = cv2.imread('in.png', cv2.IMREAD_GRAYSCALE)
# increase contrast
pxmin = np.min(img)
pxmax = np.max(img)
imgContrast = (img - pxmin) / (pxmax - pxmin) * 255
# increase line width
kernel = np.ones((3, 3), np.uint8)
imgMorph = cv2.erode(imgContrast, kernel, iterations = 1)
# write
cv2.imwrite('out.png', imgMorph)
To write the words.txt file, follow the conventions in below format as applicable to your images:
Sample line: a01–000u-00–00 ok 154 1 408 768 27 51 AT A
- a01–000u-00–00 -> word id for line 00 in form a01–000u. This is also the file name of the image you are mapping
- ok -> result of word segmentation
- ok: word was correctly
- er: segmentation of word can be bad
- 154 -> graylevel to binarize the line containing this word. This is the contrast stretching/Thresholding step.
- 1 -> number of components for this word
- 408 768 27 51 -> bounding box around this word in x,y,w,h format
- AT -> the grammatical tag for this word, see the
file tagset.txt for an explanation
- A -> the transcription for this word describing the text contents of the image
The above will custom tailor the model for your images. To improve the accuracy of the model itself, refer the improving accuracy section of this page
Explanation of the approach
The code perform three major steps:
- Match template and rotate image
- Recognize rows in the table and crop
- Recognize text using python-tensorflow
The recognition algorithm is based on the simplified version of HTR system of text recognition. If you are interested in the mechanism, you can refer this paper
It consists of 5 CNN layers, 2 RNN (LSTM) layers and the CTC loss and decoding layer
- The input image is a gray-value image and has a size of 128×32
- 5 CNN layers map the input image to a feature sequence of size 32×256
- 2 LSTM layers with 256 units propagate information through the sequence and map the sequence to a matrix of size 32×80. Each matrix-element represents a score for one of the 80 characters at one of the 32 time-steps
- The CTC layer either calculates the loss value given the matrix and the ground-truth text (when training), or it decodes the matrix to the final text with best path decoding or beam search decoding (when inferring)
- Batch size is set to 50
Fig 5: Mechanisms involved in the OCR step using tensorflow
Conclusion
Following this tutorial, you now have a way to automate the digitization of handwritten texts in tabular format tables. Countless hours can be saved once you train the model to recognize your handwriting and customize according to your needs. However, be careful as the recognition is not 100% accurate. So, a round of high-level proofreading after the spreadsheet generation might be needed before you are ready to share the final spreadsheet
Reference:
- Code Reference:https://github.com/PiotrKrosniak/ocrbot
- Handwriting recognition using google TensorFlow: https://towardsdatascience.com/build-a-handwritten-text-recognition-system-using-tensorflow-2326a3487cd5
- Handling edge cases: https://towardsdatascience.com/faq-build-a-handwritten-text-recognition-system-using-tensorflow-27648fb18519
- Dataset to start with: http://www.fki.inf.unibe.ch/databases/iam-handwriting-database
- https://github.com/githubharald/SimpleHTR