COMPUTER VISION

How to extract text from memes with Python, OpenCV and Tesseract OCR

A starting point pipeline to tackle the extraction of text from memes

Egon Ferri

Published in

Towards Data Science

6 min readDec 9, 2020

Written with Lorenzo Baiocco.

Introduction

Extracting text information from an image can serve different scopes.
In our case, we needed to extract text to enhance the performance of our multi-modal sentiment classification model based on tweets accompanied by images. Since we found that the most common reaction pic that can be found on social media are formatted as MEMEs, we developed a pipeline to extract text from images formatted like that, and in this article, we’ll present it.

Environment set up

Currently (Nov 2020), the state of the art in text extraction through OCR methods is represented by Google Tesseract OCR, which is the most used open-source software to deal with this task.

Tesseract is easy to install (following this link) and use in a python environment, through the pytesseract library.

The environment used for this article is the following:

Python 3.7.9
Tesseract 5.0.0
Pytesseract
Pillow
Matplotlib
OpenCV
Numpy

The first thing to do is to import all the packages:

from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import cv2
import pytesseract#change this path if you install pytesseract in another folder:
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

We are ready to extract some text! Let’s load a dummy meme to avoid breaking copyright (anyway, this meme has been done with imageflip).

im = np.array(Image.open('..\\immagini_test_small\\meme3.jpg'))
plt.figure(figsize=(10,10))
plt.title('PLAIN IMAGE')
plt.imshow(im); plt.xticks([]); plt.yticks([])
plt.savefig('img1.png')

To search for text is as easy as it goes:

text = pytesseract.image_to_string(im)
print(text.replace(‘\n’, ‘ ‘))

And the result is…” a a TTY SHE aT aN Sa ithe Pet”. Woooah something is going on. Tesseract is not working. The problem is that Tesseract is optimized to recognize text in typical text documents, so it can be difficult to recognize text within the image without preprocessing it. The Tesseract documentation provides a lot of ways to enhance the quality of our images. After some fine-tuning, we found a very clean method that worked perfectly on the subset of data that we used. Before diving into it, you need to know that, although these operations worked very well for our data, there are a couple of parameters that were tuned empirically. If you have a dataset that is different from the one that we used, it could be helpful to tune them a little bit.

Image cleaning

The first function that we applied to our image is bilateral filtering. If you want to understand deeply how it works, there is a nice tutorial on OpenCV site, and you can find the description of the parameters here.

In a nutshell, this filter helps to remove the noise, but, in contrast with other filters, preserves edges instead of blurring them. This operation is performed by excluding from the blurring of a point the neighbors that do not present similar intensities. With the chosen parameters, the difference from the other image is not strongly perceptible, however, it led to a better final performance.

im= cv2.bilateralFilter(im,5, 55,60)
plt.figure(figsize=(10,10))
plt.title('BILATERAL FILTER')
plt.imshow(im); plt.xticks([]); plt.yticks([])
plt.savefig('img2.png',bbox_inches='tight')

The second operation it’s pretty clear: we project our RGB images in grayscale.

im = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
plt.figure(figsize=(10,10))
plt.title('GRAYSCALE IMAGE')
plt.imshow(im, cmap='gray'); plt.xticks([]); plt.yticks([])
plt.savefig('img3.png',bbox_inches='tight')

The last transformation is binarization. For every pixel, the same threshold value is applied. If the pixel value is smaller than the threshold, it is set to 0, otherwise, it is set to 255. Since we have white text, we want to blackout everything is not almost perfectly white (not exactly perfect since usually text is not “255-white”. We found that 240 was a threshold that could do the work. Since tesseract is trained to recognize black text, we also need to invert the colors. The function threshold from OpenCV can do the two operations jointly, by selecting the inverted binarization.

_, im = cv2.threshold(im, 240, 255, 1) 
plt.figure(figsize=(10,10))
plt.title('IMMAGINE BINARIA')
plt.imshow(im, cmap='gray'); plt.xticks([]); plt.yticks([])
plt.savefig('img4.png',bbox_inches='tight')

This is the input that we want! Let’s put everything into a function:

def preprocess_finale(im):
    im= cv2.bilateralFilter(im,5, 55,60)
    im = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
    _, im = cv2.threshold(im, 240, 255, 1)
    return im

Tesseract configuration

Before running our tesseract on the final image, we can tune it a little bit to optimize the configuration. (These lists come directly from the documentation).

There are three OEM(OCR Engine modes):

0 Legacy engine only.
1 Neural nets LSTM engine only.
2 Legacy + LSTM engines.
3 Default, based on what is available.

And thirteen PSM(Page segmentation modes):

0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR. (not implemented)
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.

In bald what we found to work better. Furthermore, we decided to give tesseract a whitelist of acceptable character, since we preferred to have only the capital letters in other to avoid small text and strange characters that are sometimes found by tesseract.

custom_config = r"--oem 3 --psm 11 -c tessedit_char_whitelist= 'ABCDEFGHIJKLMNOPQRSTUVWXYZ '"

Now we can check if everything is finally working:

img=np.array(Image.open('..\\immagini_test_small\\meme3.jpg'))
im=preprocess_final(img)
text = pytesseract.image_to_string(im, lang='eng', config=custom_config)
print(text.replace('\n', ''))

And the answer is…” WHEN TOWARDS DATA SCIENCE REFUSES YOUR ARTICLE”.
Et voilà! Now everything works perfectly.

Conclusion

In this brief article, we defined a simple pipeline. The definition of the pipeline was driven by our precise scope. If you need something similar but different, you should obviously consider modifying it. For example, if you need also the black text that is sometimes in the upper part of memes, you could consider a double stream and a final join of the result. Anyway, we hope that this simple procedure can help you to get started!

References:

Smoothing Images - OpenCV-Python Tutorials 1 documentation

Learn to: Blur imagess with various low pass filters Apply custom-made filters to images (2D convolution) As for…

opencv-python-tutroals.readthedocs.io

tesseract-ocr/tessdoc

Tesseract is an open source text recognition (OCR) Engine, available under the Apache 2.0 license.. The current…

github.com

tesseract-ocr/tesseract

This package contains an OCR engine - libtesseract and a command line program - tesseract. Tesseract 4 adds a new…