The world’s leading publication for data science, AI, and ML professionals.

Top 5 Python OCR Libraries for Extracting Text from Images

Understand and master OCR tools for text localization and recognition

Photo by Anna Sullivan on Unsplash
Photo by Anna Sullivan on Unsplash

Optical Character Recognition is an old, but still challenging problem that involves the detection and recognition of text from unstructured data, including images and PDF documents. It has cool applications in banking, e-commerce and content moderation in social media.

But as with everything topic in Data Science, there is a huge amount of resources when trying to learn how to solve the OCR task. This is why I am writing this tutorial, which can help you on getting started.

In this article, I am going to show some Python libraries that can allow you to fastly extract text from images without struggling too much. The explanation of the libraries is followed by a practical example. The dataset used is taken from Kaggle. To simplify the concepts, I am just using an image of the film Rush.

Let’s get started!

Image from textOCR dataset. Source.
Image from textOCR dataset. Source.

Table of contents:

  1. pytesseract
  2. EasyOCR
  3. Keras-OCR
  4. TrOCR
  5. docTR

1. pytesseract

It is one of the most popular Python libraries for optical character recognition. It uses Google’s Tesseract-OCR Engine to extract text from images. There are multiple languages supported. Check here if you want to see if your language is supported. You just need a few lines of code to convert the image into text:

# installation
!sudo apt install tesseract-ocr
!pip install pytesseract

import pytesseract
from pytesseract import Output
from PIL import Image
import cv2

img_path1 = '00b5b88720f35a22.jpg'
text = pytesseract.image_to_string(img_path1,lang='eng')
print(text)

This is the output:

We can also try the bounding box coordinates for each item detected from the image.

# boxes around character
print(pytesseract.image_to_boxes(img_path1))

This is the result:

~ 532 48 880 50 0
...
A 158 220 171 232 0
F 160 220 187 232 0
I 178 220 192 232 0
L 193 220 203 232 0
M 204 220 220 232 0
B 228 220 239 232 0
Y 240 220 252 232 0
R 259 220 273 232 0
O 274 219 289 233 0
N 291 220 305 232 0
H 314 220 328 232 0
O 329 219 345 233 0
W 346 220 365 232 0
A 364 220 379 232 0
R 380 220 394 232 0
D 395 220 410 232 0
...

As you can notice, it estimates the bounding box for each character, not each word! In case, we want to extract the box for each word, there is another method that should be used instead of image_to_boxes:

# boxes around words
print(pytesseract.image_to_data(img_path1))
Illustration by Author
Illustration by Author

The result returned is not so perfect. For example, it interpreted "AFILM" as a unique word. Moreover, it didn’t detect and recognize all the words from the input image.

2. EasyOCR

Screenshot from web demo
Screenshot from web demo

It’s the turn of another open-source Python library: EasyOCR. Similarly to pytesseract, it supports 80+ languages. You can try it fastly and easily without writing any code from a web demo. It uses the CRAFT algorithm to detect the text and the CRNN as recognition model. Moreover, these models are implemented using Pytorch.

If you work on Google Colab, I recommend you set up the GPU, which helps speed up this framework.

These are the following code lines to exploit this tool:

# installation
!pip install easyocr

import easyocr

reader = easyocr.Reader(['en'])
extract_info = reader.readtext(img_path1)

for el in extract_info:
   print(el)
Illustration by Author
Illustration by Author

Without any effort, we have detected and recognized the text using EasyOCR. The results are much better compared to pytesseract. For each text detected, we also have the bounding box and the confidence level.

3. Keras-OCR

Keras-OCR is another open-source library specialized in optical character recognition. As EasyOCR, it exploits the CRAFT detection model and the CRNN recognition model for solving the task. The difference from EasyOCR is that it’s implemented with Keras, instead of Pytorch. The only bad point of Keras-OCR is that it ignores non-English language.

# installation
!pip install keras-ocr -q

import keras_ocr

pipeline = keras_ocr.pipeline.Pipeline()
extract_info = pipeline.recognize([img_path1])
print(extract_info[0][0])

This is the output of the first word extracted:

('from',
 array([[761.,  16.],
        [813.,  16.],
        [813.,  30.],
        [761.,  30.]], dtype=float32))

To visualize all the results, we convert the output into a Pandas Dataframe:

diz_cols = {'word':[],'box':[]}
for el in extract_info[0]:
    diz_cols['word'].append(el[0])
    diz_cols['box'].append(el[1])
kerasocr_res = pd.DataFrame.from_dict(diz_cols)
kerasocr_res
Illustration by Author
Illustration by Author

Magically, we can see that we have a much clearer and more precise results.

4. TrOCR

TrOCR is a generative image model, based on transformers, that detect the text from the images. It is composed of an encoder and a decoder: TrOCR uses a pre-trained image transformer as an encoder and a pre-trained text transformer as a decoder. For additional details, take a look at the paper. There is also good documentation of the library on Hugging Face’s platform.

First, we load the pre-trained models:

# installation
!pip install transformers

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image

model_version = "microsoft/trocr-base-printed"
processor = TrOCRProcessor.from_pretrained(model_version)
model = VisionEncoderDecoderModel.from_pretrained(model_version)

Before passing the image, we need to resize and normalize it. Once the image has been transformed, we can extract the text using the .generate() method.

image = Image.open(img_path1).convert("RGB")
pixel_values = processor(image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
extract_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print('output: ',extract_text)
# output: 2.50

Different from the previous libraries, it returns a meaningless number. Why? TrOCR just includes the recognition model, not the detection model. For solving the OCR task, there is the need to first detect the objects within the image and, then, extract the text from the input. Since it just focuses on the last step, it doesn’t reach good performances.

To make it work well, it would be better to crop specific portions of the image using a bounding box, like this:

crp_image = image.crop((750, 3.4, 970, 33.94))
display(crp_image)
Illustration by Author
Illustration by Author

Then, we try to apply again the model:

pixel_values = processor(crp_image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
extract_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(extract_text)
Illustration by Author
Illustration by Author

This is much better! This operation can be repeated for every word/phrase contained within the image.

5. docTR

Finally, we are covering the last Python package for text detection and recognition from documents: docTR. It can interpret the document as a PDF or an image and, then, pass it to the two stage-approach. In docTR, there is the text detection model (DBNet or LinkNet) followed by the CRNN model for text recognition. This library requires both Pytorch and Tensorflow installed since the implementation is done with both these deep learning frameworks.

! pip install python-doctr
# for TensorFlow
! pip install "python-doctr[tf]"
# for PyTorch
! pip install "python-doctr[torch]"

After, we import the relevant libraries for using docTR and load the model, which is a two-step approach. Indeed, we need to specify the DBNet with ResNet-50 Backbone and CRNN with a VGG-16 backbone for text detection and text recognition:

from doctr.io import DocumentFile
from doctr.models import ocr_predictor
model = ocr_predictor(det_arch = 'db_resnet50',
                      reco_arch = 'crnn_vgg16_bn',
                      pretrained = True
                     )

Then, we can finally read the file, use the pre-trained model and export the output as a nested dictionary:

# read file
img = DocumentFile.from_images(img_path1)

# use pre-trained model
result = model(img)

# export the result as a nested dict
extract_info = result.export()

This is the very long output:

{'pages': [{'page_idx': 0, 'dimensions': (678, 1024), 'orientation': {'value': None, 'confidence': None},...

For better visualization, it’s better to a double for loop and takes only the information we are interested in:

for obj1 in extract_info['pages'][0]["blocks"]:
    for obj2 in obj1["lines"]:
        for obj3 in obj2["words"]:
            print("{}: {}".format(obj3["geometry"],obj3["value"]))
Illustration by Author
Illustration by Author

That’s great! docTR is another good option to extract valuable information from images or PDFs.


Final thoughts:

These are five Python libraries that can be useful for your OCR project. Each of these tools has different advantages and disadvantages. Surely, the first thing to choosing one of these packages is to consider the language of the data you are analyzing. If you consider non-English languages, EasyOCR is probably the best choice for language covering and performances. If you have other suggestions, comment here. I hope that you have found useful this article for getting started with OCR. If you want to look at the complete output returned by these OCR models, the GitHub code is here. Have a nice day!

Disclaimer: This data set is licensed under Attribution 4.0 International (CC by 4.0)


Did you like my article? Become a member and get unlimited access to new data science posts every day! It’s an indirect way of supporting me without any extra cost to you. If you are already a member, subscribe to get emails whenever I publish new data science and Python guides!


Related Articles