The world’s leading publication for data science, AI, and ML professionals.

How to Extract Text from Any PDF and Image for Large Language Model

Use these text extraction techniques to get quality data for your LLM models

Image by Patrick Tomasso on Unsplash
Image by Patrick Tomasso on Unsplash

Motivation

Large language models have taken the internet by storm, leading more people to not pay close attention to the most important part of using these models: quality data!

This article aims to provide a few techniques to efficiently extract text from any type of document. After completing this tutorial, you will have a clear idea of which tool to use depending on your use case.


The Python Libraries

This article focuses on the Pytesseract, easyOCR, PyPDF2, and LangChain libraries. The experimentation data is a one-page PDF file and is freely available on my GitHub.

Both Pytesseract and easyOCR work with images hence requiring converting the PDF files into images before performing the content extraction.

The conversion can be done using the pypdfium2 which is a powerful library for PDF file processing, and it is implementation is given below:

pip install pypdfium2

This function takes a PDF as input and returns a list of each page of the PDF as a list of images.

def convert_pdf_to_images(file_path, scale=300/72):

    pdf_file = pdfium.PdfDocument(file_path)

    page_indices = [i for i in range(len(pdf_file))]

    renderer = pdf_file.render(
        pdfium.PdfBitmap.to_pil,
        page_indices = page_indices, 
        scale = scale,
    )

    final_images = [] 

    for i, image in zip(page_indices, renderer):

        image_byte_array = BytesIO()
        image.save(image_byte_array, format='jpeg', optimize=True)
        image_byte_array = image_byte_array.getvalue()
        final_images.append(dict({i:image_byte_array}))

    return final_images

Now, we can use the display_images function to visualize all the pages of the PDF file.

def display_images(list_dict_final_images):

    all_images = [list(data.values())[0] for data in list_dict_final_images]

    for index, image_bytes in enumerate(all_images):

        image = Image.open(BytesIO(image_bytes))
        figure = plt.figure(figsize = (image.width / 100, image.height / 100))

        plt.title(f"----- Page Number {index+1} -----")
        plt.imshow(image)
        plt.axis("off")
        plt.show()

By combining the above two functions, we can get the following result:

convert_pdf_to_images = convert_pdf_to_images('Experimentation_file.pdf')
display_images(convert_pdf_to_images)
Visualization of the PDF in image format (Image by Author)
Visualization of the PDF in image format (Image by Author)

Now it is time to dive deep into the text extraction process!

Pytesseract

Pytesseract (Python-tesseract) is an OCR tool for Python used to extract textual information from images, and the installation is done using the pip command:

pip install pytesseract

The following helper function uses the image_to_string() function from Pytesseract to extract the text from the input image.

from pytesseract import image_to_string  

def extract_text_with_pytesseract(list_dict_final_images):

    image_list = [list(data.values())[0] for data in list_dict_final_images]
    image_content = []

    for index, image_bytes in enumerate(image_list):

        image = Image.open(BytesIO(image_bytes))
        raw_text = str(image_to_string(image))
        image_content.append(raw_text)

    return "n".join(image_content) 

The text can be extracted using the extract_text_with_pytesseract function as follows:

text_with_pytesseract = extract_text_with_pytesseract(convert_pdf_to_images)
print(text_with_pytesseract)

Successful execution of the above code generates the following result:

This document provides a quick summary of some of Zoumana's article on Medium.
It can be considered as the compilation of his 80+ articles about Data Science, Machine Learning and

Machine Learning Operations.

Whether you are just getting started or you're an experienced professional looking to upskill, these

materials can be helpful.

Data Science section covers basic to advanced
concepts such as statistics, model evaluation
metrics, SQL queries, NoSQL courses, data
visualization using Tableau and #powerbi, and
many more.

Link: httos://Inkd.in/g8zcS_vE

MLOps chapter explains how to build and
deploy models using different strategies such as
Docker containers, and GitHub actions on AWS
EC2 instances, Azure. Also, it covers how to build
REST APIs to serve your models.

Link: httos://Inkd.in/gyiUsdgz

Natural Language Processing Covers simple NLP
concepts to more advanced ones such as
Transformers and their applications in Finance,
Science, etc.

Link: httos://Inkd.in/gBdZbHty

Computer Vision section covers SOTA models
(e.g. YOLO) and different technics to mitigate

overfitting when training computer vision
models.

Link: httos://Inkd.in/gDY8ZqVs

Python section showcases multiple libraries to
facilitate one's daily life, especially when dealing
with PDF, and Word files when scraping data
from the web, and even benchmarking analysis
to help choose the right data processing tool.
Link: https://Inkd.in/gH HUMM9

Pandas & Python Tricks Covers my daily tips and
tricks on LinkedIn. And, there are plenty of those,
especially on my

website https://Inkd.in/gPbichB5
https://Inkd.in/QgUs8inuZ

Machine Learning part is about Fexplainable Al,
clustering, classification tasks, etc.

Link: httos://Inkd.in/gJdSvQns

Pytesseract was able to extract the content of the image.

Here is how it managed to do it!

Pytesseract starts by identifying rectangular shapes within the input image from top-right to bottom-right. Then it extracts the content of the individual images, and the final result is the concatenation of those extracted content. This approach works perfectly when dealing with column-based PDFs and image documents.

easyOCR

This is also an open-source Python library for Optical Character Recognition and currently supports the extraction of text written in over 80 languages. easyocr requires both Pytorch and OpenCV which can be installed using the below instruction.

!pip install opencv-python-headless==4.1.2.30

Depending on your OS, the installation of the Pytorch module might be different. But all the instructions can be found on the official page.

Now comes the installation of the easyocr library.

!pip install easyocr

It is important to specify the language of the document we are working with when using easyocr because of its multilanguage nature. Setting the language is done through its Reader module, by specifying the list of languages.

For instance, fr for French, en for English. The exhaustive list of languages is available here.

With all this in mind, let’s get into the process!

from easyocr import Reader

# Load model for the English language
language_reader = Reader(["en"])

The text extraction process is implemented in the extract_text_with_easyocr function:

def extract_text_with_easyocr(list_dict_final_images):

    image_list = [list(data.values())[0] for data in list_dict_final_images]
    image_content = []

    for index, image_bytes in enumerate(image_list):

        image = Image.open(BytesIO(image_bytes))
        raw_text = language_reader.readtext(image)
        raw_text = " ".join([res[1] for res in raw_text])

        image_content.append(raw_text)

    return "n".join(image_content)

We can execute the above function as follows:

text_with_easy_ocr = extract_text_with_easyocr(convert_pdf_to_images)
print(text_with_easy_ocr)
EasyOCR result (Image by Author)
EasyOCR result (Image by Author)

The result of easyocr seems less efficient compared to Pytesseract . For instance, it was able to efficiently read the first two paragraphs. However, instead of considering each bloc of text a separate text, it is reading using a row-based approach. For instance, the string Data Science section covers basic to advanced from the first bloc has been combined with overfitting when training computer vision from the second bloc, and this kind of combination completely disorganizes the structure of the text and biases the end result.

PyPDF

PyPDF2 is also a Python library specifically for PDF processing tasks such as text and metadata retrieval, merging, cropping, etc.

!pip install PyPDF2

The extraction logic is implemented in the extract_text_with_pyPDF function:

def extract_text_with_pyPDF(PDF_File):

    pdf_reader = PdfReader(PDF_File)

    raw_text = ''

    for i, page in enumerate(pdf_reader.pages):

        text = page.extract_text()
        if text:
            raw_text += text

    return raw_text
text_with_pyPDF = extract_text_with_pyPDF("Experimentation_file.pdf")
print(text_with_pyPDF)
Text extraction with PyPDF library (Image by Author)
Text extraction with PyPDF library (Image by Author)

The extract process is fast and accurate, and it even keeps the original font size. The main issue with PyPDF is that it can not efficiently extract text from images.

LangChain

The UnstructuredImageLoader and UnstructuredFileLoader modules from langchain can be used to extract text from images and text/pdf files respectively, and both options will be explored in this section.

But, first, we need to install the langchain library as follows:

!pip install langchain

Text extraction from an image

from langchain.document_loaders.image import UnstructuredImageLoader.

The text extraction function is given below:

def extract_text_with_langchain_image(list_dict_final_images):

    image_list = [list(data.values())[0] for data in list_dict_final_images]
    image_content = []

    for index, image_bytes in enumerate(image_list):

        image = Image.open(BytesIO(image_bytes))
        loader = UnstructuredImageLoader(image)
        data = loader.load()
        raw_text = data[index].page_content

        image_content.append(raw_text)

    return "n".join(image_content)

Now, we can extract the content:

text_with_langchain_image = extract_text_with_langchain_image(convert_pdf_to_images)
print(text_with_langchain_image)
Text extraction from langchain UnstructuredImageLoader (Image by Author)
Text extraction from langchain UnstructuredImageLoader (Image by Author)

The library managed to efficiently extract the content of the image.

Text extraction from a PDF

Below is the implementation for content extraction from PDF.

from langchain.document_loaders import UnstructuredFileLoader

def extract_text_with_langchain_pdf(pdf_file):

    loader = UnstructuredFileLoader(pdf_file)
    documents = loader.load()
    pdf_pages_content = 'n'.join(doc.page_content for doc in documents)

    return pdf_pages_content
text_with_langchain_files = extract_text_with_langchain_pdf("Experimentation_file.pdf")
print(text_with_langchain_files)

Similarly to the PyPDF module, langchain module is capable of generating accurate results while keeping the original font size.

Text extraction from langchain UnstructuredFileLoader (Image by Author)
Text extraction from langchain UnstructuredFileLoader (Image by Author)

Conclusion

This short tutorial provided a brief overview of some well-known libraries. Each one of them has their own strengths and weaknesses and should be wisely applied depending on the use case. The complete code is available on my GitHub.

I hope this short tutorial helped you acquire new skill sets.

Also, If you like reading my stories and wish to support my writing, consider becoming a Medium member. With a $ 5-a-month commitment, you unlock unlimited access to stories on Medium.

Would you like to buy me a coffee ☕️? → Here you go!

Feel free to follow me on Medium, Twitter, and YouTube, or say Hi on LinkedIn. It is always a pleasure to discuss AI, ML, Data Science, NLP, and MLOps stuff!


Related Articles