
Modern OCR systems
OCR (Optical Character Recognition) systems transform an image containing valuable information (presumably in text format) into machine-readable data. In most cases, performing OCR through some available means is the initial step for data extraction from paper or scan-based PDF documents.
Whereas after a short search on the web, you can find plenty of links to various open-source and commercial tools, Google Vision and Tesseract as OCR engines have got a long start over their competitors, especially in recent years.
Tesseract is an offline and open-source text recognition engine with a fully-featured API that can be easily implemented into any business project via some wrapper modules for Python, pytesseract is one example.
On the contrary, Google Vision does not run locally, but rather on remote Google’s servers. To start using Google Vision API in your project, you have to make some setup steps, including providing valid credentials according to the official guide. Moreover, you can be charged for above-free-limit text recognition requests as stated in Google’s pricing policy.
Despite fundamental differences in terms of usage and set of options both tools are of virtually the same interest to web users judging by Google Trends:
Moving forward we will try to perform Ocr in Python with both engines incidentally comparing their performance on real-life images (either recreated or scanned by the author to imitate documents of different initial quality).
Approach for OCR comparison: an overview
To achieve as comparable as possible results we will execute a ‘reversal’ approach. It means that we will initially perform OCR on a text image without any preprocessing onwards trying to machine-read chars from the same image repeatedly applying different degrading filters to it. On each step, we will assess the OCR performance as a fraction of properly read chars compared to their number successfully and equally read by both tools on the initial step.
But as a starting point, we are going to taste some quirks of each tool you should be warned about while performing OCR in Python with Google Vision and Tesseract.
Tesseract: joins what you might expect to be splitted
Things might look quite natural when you apply Tesseract‘s _image_todata method to images of adequate quality (i.e. hard and sharp enough). But with blurred images, this tool intends to determine text bounding boxes improperly. That’s what exactly happens.
And more specifically (note how some bounding boxes overlap each other):
To retrieve such an odd behaviour we have to manually iterate over the bounding boxes, find those with significant horizontal intersection areas, and correct their right boundaries. In an overwhelming majority of cases, it will make the width of the bounding box precisely match the length of the contained string. This algorithm is already incorporated in the _get_tes_ocrdata method from DemoOCR class and allow us to get adjusted data at each step of performing OCR to a modified image.
Google Vision: splits what you might expect to be joined
As opposed to Tesseract, Google Vision provides far more fragmented bounding boxes for recognised text entities. Note, how helpfully and implicitly it separates chars being read as punctuation marks from the preceding words. But this might be considered as undesirable behaviour in some contexts! In our particular case taking out a monetary unit from the bounding box of the subsequent number make it difficult to directly compare OCR engines performance in terms of chars quantity within a given area.
Approach for OCR comparison: finer points
It follows from above that we should take into consideration not only the number of chars recognised at each step image’s modification but as well their proper places, i.e. bounding boxes determined at the initial step. As we have significantly different outputs from OCR engines we need to manually correct bounding boxes, this time via merging all overlapping polygons in the following way.
Onwards we will count the number of recognised chars precisely within these baseline borders. For illustrative purposes, the number of properly read chars will be shown over the corresponding area on the image with an opaque colourful horizontal bar. The shorter those stacked bars are the less number of chars were successfully read within the given area by the corresponding OCR engine.
Putting it all together
By comparing the OCR results of both tools on each step of an image’s modification with filters of different intensiveness we hopefully will get a sense of Google Vision and Tesseract specific features. Here is an example of an intermediary step and its OCR results by both tools.
Here’s our comparison approach in action:

And another example:

Yet one more.

Last one for now.

Take-home notes
As it was stated above, both Google Vision and Tesseract are mature tools and historically have found their way into a lot of business projects. They both perform quite a sufficient OCR on text images of passable quality even without their preprocessing. Though such a preprocessing with OpenCV or pillow seems to significantly improve the results of OCR for Tesseract. Also, you should have noticed how erratically both tools perform on images with textual background. This finding of our analysis corresponds in a specific way with the insights of another comparison, neither by no means being an exhaustive one. All in all, both engines should be considered as easy-to-setup-and-use OCR tools and the right choice for your projects is heavily depends on the external requirements and budget.
See the full code with a bunch of handy methods for OCR and image preprocessing on Github.