
Motivation
Large language models have taken the internet by storm, leading more people to not pay close attention to the most important part of using these models: quality data!
This article aims to provide a few techniques to efficiently extract text from any type of document. After completing this tutorial, you will have a clear idea of which tool to use depending on your use case.
The Python Libraries
This article focuses on the Pytesseract, easyOCR, PyPDF2, and LangChain libraries. The experimentation data is a one-page PDF file and is freely available on my GitHub.
Both Pytesseract and easyOCR work with images hence requiring converting the PDF files into images before performing the content extraction.
The conversion can be done using the pypdfium2
which is a powerful library for PDF file processing, and it is implementation is given below:
pip install pypdfium2
This function takes a PDF as input and returns a list of each page of the PDF as a list of images.
def convert_pdf_to_images(file_path, scale=300/72):
pdf_file = pdfium.PdfDocument(file_path)
page_indices = [i for i in range(len(pdf_file))]
renderer = pdf_file.render(
pdfium.PdfBitmap.to_pil,
page_indices = page_indices,
scale = scale,
)
final_images = []
for i, image in zip(page_indices, renderer):
image_byte_array = BytesIO()
image.save(image_byte_array, format='jpeg', optimize=True)
image_byte_array = image_byte_array.getvalue()
final_images.append(dict({i:image_byte_array}))
return final_images
Now, we can use the display_images
function to visualize all the pages of the PDF file.
def display_images(list_dict_final_images):
all_images = [list(data.values())[0] for data in list_dict_final_images]
for index, image_bytes in enumerate(all_images):
image = Image.open(BytesIO(image_bytes))
figure = plt.figure(figsize = (image.width / 100, image.height / 100))
plt.title(f"----- Page Number {index+1} -----")
plt.imshow(image)
plt.axis("off")
plt.show()
By combining the above two functions, we can get the following result:
convert_pdf_to_images = convert_pdf_to_images('Experimentation_file.pdf')
display_images(convert_pdf_to_images)

Now it is time to dive deep into the text extraction process!
Pytesseract
Pytesseract (Python-tesseract) is an OCR tool for Python used to extract textual information from images, and the installation is done using the pip
command:
pip install pytesseract
The following helper function uses the image_to_string()
function from Pytesseract
to extract the text from the input image.
from pytesseract import image_to_string
def extract_text_with_pytesseract(list_dict_final_images):
image_list = [list(data.values())[0] for data in list_dict_final_images]
image_content = []
for index, image_bytes in enumerate(image_list):
image = Image.open(BytesIO(image_bytes))
raw_text = str(image_to_string(image))
image_content.append(raw_text)
return "n".join(image_content)
The text can be extracted using the extract_text_with_pytesseract
function as follows:
text_with_pytesseract = extract_text_with_pytesseract(convert_pdf_to_images)
print(text_with_pytesseract)
Successful execution of the above code generates the following result:
This document provides a quick summary of some of Zoumana's article on Medium.
It can be considered as the compilation of his 80+ articles about Data Science, Machine Learning and
Machine Learning Operations.
Whether you are just getting started or you're an experienced professional looking to upskill, these
materials can be helpful.
Data Science section covers basic to advanced
concepts such as statistics, model evaluation
metrics, SQL queries, NoSQL courses, data
visualization using Tableau and #powerbi, and
many more.
Link: httos://Inkd.in/g8zcS_vE
MLOps chapter explains how to build and
deploy models using different strategies such as
Docker containers, and GitHub actions on AWS
EC2 instances, Azure. Also, it covers how to build
REST APIs to serve your models.
Link: httos://Inkd.in/gyiUsdgz
Natural Language Processing Covers simple NLP
concepts to more advanced ones such as
Transformers and their applications in Finance,
Science, etc.
Link: httos://Inkd.in/gBdZbHty
Computer Vision section covers SOTA models
(e.g. YOLO) and different technics to mitigate
overfitting when training computer vision
models.
Link: httos://Inkd.in/gDY8ZqVs
Python section showcases multiple libraries to
facilitate one's daily life, especially when dealing
with PDF, and Word files when scraping data
from the web, and even benchmarking analysis
to help choose the right data processing tool.
Link: https://Inkd.in/gH HUMM9
Pandas & Python Tricks Covers my daily tips and
tricks on LinkedIn. And, there are plenty of those,
especially on my
website https://Inkd.in/gPbichB5
https://Inkd.in/QgUs8inuZ
Machine Learning part is about Fexplainable Al,
clustering, classification tasks, etc.
Link: httos://Inkd.in/gJdSvQns
Pytesseract
was able to extract the content of the image.
Here is how it managed to do it!
Pytesseract
starts by identifying rectangular shapes within the input image from top-right to bottom-right. Then it extracts the content of the individual images, and the final result is the concatenation of those extracted content. This approach works perfectly when dealing with column-based PDFs and image documents.
easyOCR
This is also an open-source Python library for Optical Character Recognition and currently supports the extraction of text written in over 80 languages. easyocr
requires both Pytorch
and OpenCV
which can be installed using the below instruction.
!pip install opencv-python-headless==4.1.2.30
Depending on your OS, the installation of the Pytorch module might be different. But all the instructions can be found on the official page.
Now comes the installation of the easyocr
library.
!pip install easyocr
It is important to specify the language of the document we are working with when using easyocr
because of its multilanguage nature. Setting the language is done through its Reader
module, by specifying the list of languages.
For instance, fr
for French, en
for English. The exhaustive list of languages is available here.
With all this in mind, let’s get into the process!
from easyocr import Reader
# Load model for the English language
language_reader = Reader(["en"])
The text extraction process is implemented in the extract_text_with_easyocr
function:
def extract_text_with_easyocr(list_dict_final_images):
image_list = [list(data.values())[0] for data in list_dict_final_images]
image_content = []
for index, image_bytes in enumerate(image_list):
image = Image.open(BytesIO(image_bytes))
raw_text = language_reader.readtext(image)
raw_text = " ".join([res[1] for res in raw_text])
image_content.append(raw_text)
return "n".join(image_content)
We can execute the above function as follows:
text_with_easy_ocr = extract_text_with_easyocr(convert_pdf_to_images)
print(text_with_easy_ocr)

The result of easyocr
seems less efficient compared to Pytesseract
. For instance, it was able to efficiently read the first two paragraphs. However, instead of considering each bloc of text a separate text, it is reading using a row-based approach. For instance, the string Data Science section covers basic to advanced from the first bloc has been combined with overfitting when training computer vision from the second bloc, and this kind of combination completely disorganizes the structure of the text and biases the end result.
PyPDF
PyPDF2
is also a Python library specifically for PDF processing tasks such as text and metadata retrieval, merging, cropping, etc.
!pip install PyPDF2
The extraction logic is implemented in the extract_text_with_pyPDF
function:
def extract_text_with_pyPDF(PDF_File):
pdf_reader = PdfReader(PDF_File)
raw_text = ''
for i, page in enumerate(pdf_reader.pages):
text = page.extract_text()
if text:
raw_text += text
return raw_text
text_with_pyPDF = extract_text_with_pyPDF("Experimentation_file.pdf")
print(text_with_pyPDF)

The extract process is fast and accurate, and it even keeps the original font size. The main issue with PyPDF is that it can not efficiently extract text from images.
LangChain
The UnstructuredImageLoader
and UnstructuredFileLoader
modules from langchain can be used to extract text from images and text/pdf files respectively, and both options will be explored in this section.
But, first, we need to install the langchain
library as follows:
!pip install langchain
Text extraction from an image
from langchain.document_loaders.image import UnstructuredImageLoader.
The text extraction function is given below:
def extract_text_with_langchain_image(list_dict_final_images):
image_list = [list(data.values())[0] for data in list_dict_final_images]
image_content = []
for index, image_bytes in enumerate(image_list):
image = Image.open(BytesIO(image_bytes))
loader = UnstructuredImageLoader(image)
data = loader.load()
raw_text = data[index].page_content
image_content.append(raw_text)
return "n".join(image_content)
Now, we can extract the content:
text_with_langchain_image = extract_text_with_langchain_image(convert_pdf_to_images)
print(text_with_langchain_image)

The library managed to efficiently extract the content of the image.
Text extraction from a PDF
Below is the implementation for content extraction from PDF.
from langchain.document_loaders import UnstructuredFileLoader
def extract_text_with_langchain_pdf(pdf_file):
loader = UnstructuredFileLoader(pdf_file)
documents = loader.load()
pdf_pages_content = 'n'.join(doc.page_content for doc in documents)
return pdf_pages_content
text_with_langchain_files = extract_text_with_langchain_pdf("Experimentation_file.pdf")
print(text_with_langchain_files)
Similarly to the PyPDF
module, langchain module is capable of generating accurate results while keeping the original font size.

Conclusion
This short tutorial provided a brief overview of some well-known libraries. Each one of them has their own strengths and weaknesses and should be wisely applied depending on the use case. The complete code is available on my GitHub.
I hope this short tutorial helped you acquire new skill sets.
Also, If you like reading my stories and wish to support my writing, consider becoming a Medium member. With a $ 5-a-month commitment, you unlock unlimited access to stories on Medium.
Would you like to buy me a coffee ☕️? → Here you go!
Feel free to follow me on Medium, Twitter, and YouTube, or say Hi on LinkedIn. It is always a pleasure to discuss AI, ML, Data Science, NLP, and MLOps stuff!