The Future Of Text Recognition Is Artificial Intelligence

A gentle introduction to Google’s Tesseract-OCR engine

Published in

Towards Data Science

8 min readMay 13, 2020

The 21st century is digitally connected. We no longer send handwritten letters and rarely use printed texts for the simple reason that we became too dependent on our computers to process data and make life easier. These reasons made us need to find a way to digitalize physical papers in a way to be electronically edited, manipulated, searched, managed, stored, and especially interpreted by machines.

Optical Character Recognition made it possible to convert text captured in images of typed, handwritten, or printed text into digitized and usable machine-encoded text.

Optical Character Recognition is a field of research in Artificial Intelligence and Computer Vision that consists in extracting text from images.

Today OCR is knowing an unprecedented revolution thanks to using Artificial Intelligence tools.OCR became not only an image-to-text traditional conversion process but also a human mistakes checker.

An OCR tool can convert text captured in images into text usable by computers.

The tool we will be talking about in this article is Tesseract.

Tesseract: The research project that changed the way we perceive optical character recognition

Tesseract is an open-source OCR engine written in C and C++ that was originally developed at HP between 1984 and 1994. Tesseract began as a Ph.D. research project in HP Labs, Bristol, and was a possible software/hardware add-on for HP’s scanners. Commercial OCR engines of the day were weak and failed on achieving a satisfying accuracy. In 1995 Tesseract was in the top three OCR engines in terms of character accuracy. The engine was then sent to UNLV for the 1995 Annual Test of OCR Accuracy, where it proved its worth against the commercial engines of the time. In late 2005, HP released Tesseract for open source. It is now available at http://code.google.com/p/tesseract-ocr.

In 2006, Google began sponsoring and maintaining the tool and has since released updated versions of Tesseract with different architectures and support for over 100 languages.

Version 4.0 of Tesseract uses a neural network system based on LSTMs and then improved astonishingly its accuracy. It is trained on 123 languages which made from Tesseract a great tool and a wonderful example of how Artificial Intelligence can engage in text recognition research.

Now let’s have some practice and learn how to use the Tesseract system within Python to perform Optical Character Recognition!

To access Tesseract via Python programming language we use the Python-tesseract package. Python-tesseract is a wrapper for Google’s Tesseract-OCR engine and provides an interface to the tesseract system binary.

Let’s begin by importing the Python-tesseract package.

import pytesseract

We import the Python Imaging Library (PIL) to process the image we want to apply OCR on.

Image module is required so that we can load our image in PIL format. Pytesseract requires this format.

import PIL
from PIL import Image

We load our image using PIL.Image.open function

image=Image.open(‘tesseract.JPG’)

Let’s display our image

display(image)

Screenshot from a google search on Tesseract

Now we can apply OCR on our image using pytesseract.image_to_string function. This function returns the result of the Tesseract OCR run on the image to string.

text=pytesseract.image_to_string(image)

Let’s print the text variable to see the result of the OCR

print(text)

Result of extracting text from image using Tesseract

We can see that we obtained perfect results. The pytesseract.image_to_string function extracted exactly the text captured in the image.

The image we applied OCR on was clear but unfortunately, real-life situations are far from being perfect and images are usually noisy which makes getting a very clear separation of the foreground text from the background a real challenge.

Let’s try running Tesseract on a noisy image.

We will extract text captured in this image as an example :

We process in the same previous way. We load our image as a PIL. Image object and pass it to pytesseract.image_to_string function to extract text from it.

You can see that the function returned an empty string because it didn’t succeed in extracting the text from the image.

We need then to preprocess our image!

Here, I will suggest 3 methods you can use to improve the performance of Tesseract on noisy images. These tricks aren’t trivial, and they highly dependent on the content within your image.

First method: Converting image to grayscale

A grayscale image is one in which the value of each pixel is a single sample representing only an amount of light. Grayscale images can then be the result of measuring the intensity of light at each pixel.

A pixel in a grayscale image can be any value between 0 and 255. The image is then composed of 256 shades of gray.

To convert a PIL image to grayscale we use :

gray= image.convert(‘L’)

PIL.Image.convert method returns a converted copy of our image.

Let’s save this copy :

gray.save(‘gray.jpg’)

Let’s now load it and display it to see what a grayscale image looks like :

grayscale_image=Image.open(‘gray.jpg’)
display(grayscale_image)

Grayscale images look like black-and-white images, but they are different. One-bit bi-tonal black-and-white images, are images with only two colors: black and white. Grayscale images have many shades of gray in between.

Now let’s run Tesseract on the grayscale version of our image.

text=pytesseract.image_to_string(grayscale_image)
print(text)

We obtain the result :

We can see then that grayscaling the image improves the performance of Tesseract on it.

Second method: Image Binarization

In this method, we will extract text from our image by binarizing it.

Image binarization is the process of taking an image and converting it to black-and-white, it then reduces the information contained within the grayscale version of the image from 256 shades of gray to 2: black and white, a binary image.

We can binarize our image using :

binary= image2.convert(‘1’)

This makes a binary copy of the image and stores it in binary PIL.Image object.

We save the copy, load it and display it :

binary.save('binary.jpg')
binary_image=Image.open(‘binary.jpg’)
display(binary_image)

Now, let’s run Tesseract on it :

Hum, The results don’t look like we expected. Why and how can we improve our binarization?

The process of binarization works by choosing a threshold value. In the grayscale version of the image, pixels less than the chosen threshold value are set to 0 (black), and pixels greater than the threshold value are set to 255 (white).

The results we obtained may be due to a bad value of threshold since we used the default function for binarization. Let’s build our own function, a function to which we pass an image and a threshold value and it returns the result of binarization using the chosen threshold.

def binarize(image,threshold):
    binary_image=image.convert(“L”)
    for x in range(binary_image.width):
      for y in range(binary_image.height):
        if obinary_image.getpixel((x,y))< threshold: 
          binary_image.putpixel( (x,y), 0 )
        else:
          binary_image.putpixel( (x,y), 255 )
    return output_image

This function :

1. Converts the image to grayscale

2. Loops through the image’s pixels,

3. Compares the value of the pixel to the threshold: If the pixel is less than the threshold, it changes its value to 0(black), and to 255(white) if not.

Now, let’s run our function with different values of threshold and see how it affects the performance of Tesseract :

Threshold= 0: We obtained a white image because all pixels in a grayscale image are included between 0 and 255. Tesseract then couldn’t detect any text.

Threshold= 64: We obtained a clear image where only the text is visible. Our function removed the noisy background from the image which made it easier for Tesseract to extract the text from the image.

Threshold= 192: Our function covered a part of the text and the OCR run on the image returned an empty string.

Threshold= 256: We obtained a black image because all pixels in a grayscale image are included between 0 and 255. Tesseract then couldn’t detect any text.

Third method: Resizing your image

Sometimes, a bad Tesseract’s performance can be due to the size of the image. Detecting text in a large or tiny image can be sometimes difficult. PIL has a special function to help you resize your images with a multitude of options and filters :

PIL.Image.resize

This function returns a resized copy of this image.

It takes two parameters:

size: The requested size in pixels, as a 2-tuple: (width, height).
resample: An optional resampling filter. This can be one of PIL.Image.NEAREST (use nearest neighbor), PIL.Image.BILINEAR (linear interpolation), PIL.Image.BICUBIC (cubic spline interpolation), or PIL.Image.LANCZOS (a high-quality downsampling filter). If omitted, or if the image has mode “1” or “P”, it is set PIL.Image.NEAREST.

Wrap-Up

In this article, you learned about Google’s Tesseract-OCR engine, how it began and how it contributed to making text recognition more accurate than ever. You learned also how to access Tesseract via Python using the Python-tesseract package. You also learned how to apply OCR on images and improve Tesseract’s performance on noisy images. Now that you have all the basic skills, why not build a Scanner Mobile App that takes an image as input, runs OCR on it and write results to a text file. This would be your biggest step into learning more about Tesseract!

Check out the Github repository I made for the tutorial here!