Google’s Tesseract OCR: How Good Is It on Documents?

Arvind Rajan
Towards Data Science
4 min readFeb 19, 2021

--

Photo by Finn Mund on Unsplash

Tesseract Optical Character Recognition (OCR) engine by Google is arguably the most popular out-of-the-box solution for OCR. Recently, I was tasked to build an OCR tool for documents. I am aware of its robustness, however, out of curiosity, I wanted to investigate its performance on documents, specifically.

As always, the starting point was sourcing for a reliable ground truth before thinking about synthesising one of my own. Luckily, I found one: DDI-100 dataset by the Machine Intelligence Team from Moscow Institute of Physics and Technology. It has about 30GB of data with character-level ground truth data, which is sweet! However, keep in mind that some of the books are in Russian. For this study, only Books 33 and 34 were used. These books combined has over 300 pages, and with augmentation, precisely 5145 pages.

It is Friday night at the time of writing, so I am going to keep the discussion as succinct as I possibly can. The codes to generate the results can be found in my repo here.

Design of experiment

Language data. Tesseract 4.0.0 comes with three language models, namely: tessdata, tessdata_best, and tessdata_fast. All three models will be used in this study.

Pre-processing. Each text from the dataset is put through a pre-processing step, which does the following in sequence:
1. Pads with 5 pixels around the text.
2. Resizes to a target height of 30 pixels.
3. Performs Otsu binarisation.
4. Inverts the image (bitwise) if background is dark. Tesseract gives optimum result for texts with dark foreground and light background.

Performance metrics. All the abovementioned models will be assessed based on the following criteria:
1. Direct match (lowercased).
2. Levenshtein distance.
3. Porcessing time.

Hardware. My laptop is running on Intel(R) Core(TM) i7–7500U CPU @ 2.70GHz with 16 GB RAM. I ran the codes on Linux (WSL) on 2 cores (to avoid any throttling just in case).

Other information. These are additional information on tesseract and leptonica versions: tesseract 4.0.0-beta.1 and leptonica-1.75.3.

Results and discussion

Accuracy based on direct match

The result above would be of interest to you if recognising every texts is of paramount importance to your use case. To my surprise, I can’t tell them apart. In fact, tessdata_fast appears to be more accurate.

This does not agree with the description given in Tesseract’ documentation here about their models. This leads me to think that, perhaps, tessdata_best may exhibit better performance in terms of Levenshtein distance.

Levenshtein distance after removing texts with a score of 1.0

The result above is the boxplot of Levenshtein distance without all the texts with a score of 1.0 and outliers. This is done to verify my hypothesis of tessdata_best is the better performer when it is assessed based on editable distance; however, that is not the case either.

Processing time per text

The figure above shows that tessdata_best can be up to 4 times slower than tessdata, which comes with the tesseract-ocr package on Linux. tessdata_fast, as the name suggests, is faster than both tessdata and tessdata_best.

Conclusion

Google’s widely used OCR engine is highly popular in the open-source community. Here, I made a quick experiment to assess its performance on documents. All three of their language models were compared.

tessdata and tessdata_best appears to exhibit comparable performance in terms of recognition accuracy. tessdata_fast, on the other hand, is marginally better than the former two models. And as expected, this model is also the fastest.

Given the performance of all three Tesseract models, the next natural question is: are the “superior” models worth employing? Based on the results above, it is a No from me. However, I have to stress that this decision is only applicable to my use case. The results may differ significantly for other languages and/or other image types.

--

--