Simple OCR with Tesseract

How to train Tesseract to read your unique font

Andreas M M
Towards Data Science

--

Photo by Angel-Kun on Pixabay

In this article, I want to share with you how to build a simple OCR using Tesseract, “an optical character recognition engine for various operating systems”. Tesseract itself is free software, originally developed by Hewlett-Packard until 2006 when Google took over the development. It is arguably the best out of the box OCR engine until today, with support for more than 100 languages. It’s one of the most popular OCR engines, as it’s easy to install and use.

Now, suppose that you were given a task by your boss to be able to convert the below picture into a machine language, or in simpler words, “Build an OCR model to be able to read these fonts in some pictures!”. “Alright, no problemo,” you said, but suddenly your supervisor said to you “And I want it done today, in the next 3–5 hours.”

for example, we named this picture file_0.png

“Welp, How am I able to build an OCR model that fast?” you said to yourself. But don’t you worry, that is what the Tesseract is for! First thing first, you have to install it on your computer.

  • If you are using a Windows Operating System, go to https://github.com/UB-Mannheim/tesseract/wiki and install Tesseract using an installer (you can choose the latest stable version, or in my case, I used Tesseract 4.0.0). Follow the instructions. Then, go to edit environment variables and add a new path to your tesseract installation folder just as the picture below

Then click “OK”

  • If you use Ubuntu OS, then open the terminal and run
    sudo apt-get install tesseract-ocr

After you are successfully installing Tesseract on your computer, open command prompt for windows or terminal if you are using Ubuntu, and then run:

tesseract file_0.png stdout

Where file_0.png is the filename of the above picture. We want Tesseract to read any words it found in the above image. You should see these outputs in your terminal :

Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 147
40293847 S565647386e2e91L0

Oh no! It seems like Tesseract cannot read the words in the above picture perfectly. Seems like it misread some character, probably because the font in the image was unique and strange. Luckily, you can train your Tesseract so it can read your font easily. Just follow my steps!

Disclaimer, as stated in the Tesseract’s wiki, it is recommended to use the default “language” which was already trained on so many data for tesseract, and train your own language for the very last resort (means, that you should try to preprocess the image, thresholding and other image preprocessing method before jumping to training). This was because Tesseract itself is quite accurate on generally clean images, and it’s quite difficult to make Tesseract’s training prediction more accurate, EXCEPT if your font is quite different and unique (like in our cases) or if you try to read some demonic language.

Installation and Data Preparation

To train your font, first, you need to :

  1. Install Tesseract (you don’t say)
  2. Install jTessBoxEditor
    This tool is used for creating and editing the ground truth to train the Tesseract. Note that you need Java Runtime to be able to open it which you can download https://www.java.com/en/download/. After you install Java then install jTessBoxEditor (not the FX ones) on https://sourceforge.net/projects/vietocr/files/jTessBoxEditor/
    you can open jTessBoxEditor by extracting the zip files, and run train.bat if you use Windows, or
    jTessBoxEditor.jar if you use Ubuntu
  3. [optional] A working Word Office (Windows) or LibreOffice (Ubuntu) and the .tiff file of your font. For example in the case above, I was using OCR-A Extended font type. You can easily download your font from google (just search font_name .tiff download). Install your font (just double click the .tiff file) Or, it’s better that you have a collection of images that you want to predict later as training data.

After you have prepared all the installation steps above, you are ready to train your Tesseract. Tesseract use “language” as its model for OCR. There are many default languages, like eng (English), ind (Indonesian), and so on. We try to create a new language for Tesseract to be able to predict our Font, by creating some training data consisting of random numbers using our Font. There are 2 ways to do just that. First, if you have a collection of images consisting of just your fonts, then you can use that or, the second way, that is to type any number (or character) you want on word using your font, and use snipping tools (windows) or shift key + PrintScreen (Ubuntu) to capture and save it on a folder.

Training data example
Training data example
Training data example
Training data example for multiple lines

In my experience, 10–15 data was enough to produce an accurate (subjectively) model which is sufficiently accurate for both clean and some noisy images. Note that you should try to create as balanced data as possible, and as close as real case as possible. If you want to predict some images with a blue background, red font, then you should create training data with a blue background and red font.

Training the Tesseract

In general, the training step of Tesseract is :

  1. Merge training data to .tiff file using jTessBoxEditor
  2. Create a training label, by creating a .box files containing predictions of the Tesseract from .tiff file and fix each inaccurate predictions
  3. Train the tesseract

Step 1. Merge training data

After you are done creating some data, open the jTessBoxEditor. At the top bar, go to “Tools” → “Merge Tiff” (or you can just use shortcut Ctrl + M ). Go to the folder where you have saved your training images. Change the filter to PNG (or any extension your images have), select all images, and click “Ok”. Then in the selection panel, type in font_name.font.exp0 where font_name is any name you want (this will be the name for your own new Tesseract’s language).

Step 2. Create a Training Label

Open terminal, navigate to the folder where you saved your training images and .tiff file. As we now have the training data, how do we get the training label? Afraid not, you should not label each image manually, as we can use Tesseract and jTessBoxEditor to aid us. In the terminal, run below command :

tesseract --psm 6 --oem 3 font_name.font.exp0.tif font_name.font.exp0 makebox

Wait, why suddenly there are psm and oem? What will happen when I type the command above? If you run :

tesseract --help-psm
#or
tesseract --help-oem

You will see that psm means Page Segmentation Modes, meaning how the tesseract treats the image. If you want the tesseract to treat each image it sees as a single word, you can choose psm 8. In our case, as our images in .tiff file are a collection of single-line text, we choose psm 6. As for OEM, it means Ocr Engine Modes, as for tesseract there are legacy engine that works by recognizing character patterns, or using Neural Nets and LTSM engines (if you want to use LTSM, install tesseract version> 4.0.0 ).

Using the above command, we want the tesseract to produce the bounding boxes and the prediction of each image in the .tiff file and save it to font_name.font.exp0.box text file. In case you didn’t know, the.tiff file that we produced earlier contains your training images segmented by “page”. By using the above command, it will produce a .box file containing a prediction, and bounding box of each word in the .tiff file. with the name font_name.font.exp0.box

Now open jTessBoxEditor, navigate to the box editor tab, and click open and select the .tiff file. You should see that each image on each page has its bounding boxes and prediction. Your job now is to fix each bounding box and its char prediction in the .box file. (yes, this is the dullest part)

Step 3. Training the tesseract

After you have created the already fixed .box file and .tiff file. Create a new text document contains

font 0 0 0 0 0

Save it as font_properties into the same folder as the .tiff file and .box file. Now you are ready to begin the training process! (finally). Inside the folder, you should have all these files:

Now run this command on the terminal :

# Create a .tr file (training file)
tesseract num.font.exp0.tif font_name.font.exp0 nobatch box.train
# Create a unicharset file
unicharset_extractor font_name.font.exp0.box
# Create a shapetable file
shapeclustering -F font_properties -U unicharset -O font_name.unicharset font_name.font.exp0.tr
# Create a pffmtable, intemp file
mftraining -F font_properties -U unicharset -O font_name.unicharset font_name.font.exp0.trecho Clustering..
# Create a normproto file
cntraining font_name.font.exp0.tr

If you meet an error, you might need to use tesseract.exe, unicharset_extractor.exe, and cntraining.exe (for windows users). You will see some outputs in your terminal, and most importantly in the shapeclustering part. If your training images contain all the necessary characters, you will see that the Number of Shapes = {Number of class that you want}. For example, if I want to train the tesseract to be able to read the digits number correctly, then the Number of shapes equals to 10 (which is 0,1,2,3 ,… , 9).

Master shape_table:Number of shapes = 10 max unichars = 1 number with multiple unichars = 0

If your number of shapes does not equal to the number of class that you want, you should go back to create training data, and try to create cleaner data

If you have done everything correctly, you will see 4 major files in your folder. shapetable, normproto, intemp, and pffmtable. Rename those files into font_name.shapetable, font_name.normproto, font_name.intemp, font_name.pffmtable.

Then run:

combine_tessdata font_name.

After you run all the command above, you will see these files in your folder

Now copy font_name.traineddata to :

C:/Program Files x86/tesseract/4.0/tessdata  # Windowssudo cp /usr/shared/tesseract/4.0/tessdata # ubuntu

And you are done! Yep, because we use a small amount of data, the training itself doesn’t take hours, just seconds or maybe minutes. Compared to if you have to train a deep learning model (probably using an object detection model) from scratch, it’s much much faster. Now, the next time you run the Tesseract, you could specify your new trained language by using

tesseract -l font_name file0.png

Remember that using the default language before, the result of the above picture using the default Tesseract engine was 40293847 S565647386e2e91L0. Using our new trained language, the result was

Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 147
10293847 565647382910

As you can see the result was much more accurate. Yay! With just a few training data and a relatively short amount of time, you have created an OCR model capable to read unique and strange font!

To further check the model’s result, you can create another .tiff file by using another image or by using the previous .tiff file. Open terminal and again, run :

tesseract -l font_name --psm 6 --oem 3 font_name.font.exp0.tif font_name.font.exp0 makebox

Now open jTessBoxEditor → box editor → open and select your .tiff file. Check if your model give more accurate prediction than the previous one. You should see that the prediction improved a lot.

Final Thoughts

“But Is Tesseract the only way to go if you want an out of the box and fast OCR engine?” you may ask. Well of course not, there are a ton of OCR API providers out there if you are willing to take out some cash. In my honest opinion, Tesseract is good if your images are really-really clean (for example, a word document, a cashier bill, and so on. If your images data contains many noises, you can use thresholding to differentiate the background and the noise from the font itself. In my experience, using as little as 10–20 data, Tesseract was able to compete even with the state of the art object detection model like Faster R-CNN which was trained using a lot more data (with a lot of augmentation as well). BUT if your images data have some noises (random dots, dirty mark) with the same color of your font, Tesseract will not be able to predict your images correctly. I say you should use Tesseract if you want to build OCR model as fast as possible, or you have a limited amount of training data.

One of the main weaknesses (well I think) of the Tesseract is that it is quite unstable. I could get a different result by just using a larger crop of the same image. Also for some reason, if I use more than 50 data, the Tesseract performs worse. Well, I’m also still learning myself. If you find some mistakes or misconceptions in this article, feel free to contact me. Thank you for reading, Happy Learning!

References :

--

--

AI Researcher at Astra Digital. Deep Learning, and Computer Vision enthusiast