Get Underlined Text from Any PDF with Python

A step-by-step guide to get underlined text as an array from PDF files.

Sasha Korovkina
Towards Data Science

--

💡 If you want to see the code for this project, check out my repository: https://github.com/sasha-korovkina/pdfUnderlinedExtractor

PDF data extraction can be a real headache, and it gets even trickier when you’re trying to snag underlined text — believe it or not, there aren’t any go-to solutions or libraries that handle this out of the box. But don’t worry, I’m here to show you how to tackle this.

Photo by dlxmedia.hu on Unsplash

The Theory

Extracting underlined text from PDFs can take a few different paths. You might consider using OCR to detect text components with bottom lines or delve into PyMuPDF’s markup capabilities. However, I’ve found that OCR tends to falter, suffering from inconsistency and low accuracy. PyMuPDF isn’t my favorite either — it demands finicky parameter tuning, which is time-consuming. Plus, one wrong setting and you could lose a bunch of data.

It is important to remember that PDFs are:

  • Non-Structured Data: PDF elements often lack grouping or categorization, which complicates efforts to search through the content systematically.
  • Text Formatting Recognition: Detecting specific text formats such as bold or underlined is notoriously difficult in PDFs, as most Python libraries do not support this capability effectively.

But fear not, as we have a strategy to resolve this.

The Strategy

  • Convert the PDF to Structured XML: Start by transforming the PDF document into a structured XML format to facilitate easier data manipulation.
  • Extract Desired Components: Identify and isolate the specific components from the XML that are relevant to our needs.
  • Use OCR (Optical Character Recognition) on the extracted coordinates to get the underlined text data as an array.
  • Extract and Output Underlined Text: Finally, extract the underlined text from the document and display or print the results.

The Code

  1. PDF to XML

We will use the pdfquery library, the most comprehensive PDF to XML converter which I have come across.

2. Studying the XML

The XML has a few key components which we are interested in:

  • LTRect — sometimes, the library would parse the underlined text as a rectangle of minimal width under the text
  • LTLine — other times, it would recognise the outline as a separate line component.
This is what your output XML will look like. Image created by author.

LTRect component example:

<LTRect y0="563.787" y1="629.964" x0="367.942" x1="473.826" width="105.884" height="66.178" bbox="[367.942, 563.787, 473.826, 629.964]" linewidth="0" pts="[[367.942, 629.964], [473.826, 629.964], [473.826, 563.787], [367.942, 563.787]]">

Therefore, by converting the whole document into XML format, we can replicate it’s structure as XML components, let’s do just that!

Structure Replication

Now, we will re-create the structure of our document as bounding box coordinates. To do this, we will parse the XML to define the page, component boxes, lines and rectangles, and then draw them all on our canvas in 3 different colors.

PDF object visualization.

Here is our inital PDF, it has been generated in Microsoft Word, by exporting a document with some underlines to the PDF file format:

Initial document with sample text. Image created by author.

After applying the algorithm above, here is the visual representation we get:

The box outline of the document Black — all document components, blue — underlined text. Image created by author.

This image represents the structure of our document, where the black box is used to describe all components on the page, and the blue is used to describe the LTRect elements, hence the underlined text.

Text Overlay

Now, let’s visualize all of the text within the PDF in it’s respective positions, with the following line of code:

can.drawString(text_x, text_y, text)

Here is the output:

PDF re-creation based on text location and underlines. Image created by author.

Note that the text is not exactly where it was in the original document, due to the difference in size and font of the mark-up language in the pdfquery library.

Co-Ordinate Extraction

As the result of our XML, we will have an array of coordinates of underlined regions, in my case I have called it underline_text.

A piece of code which forms an array of coordinates of underlined text within the PDF file.

Text Extraction

Here’s the process:

  1. We identify the coordinate rectangles as previously determined.
  2. We extract these sections from the PDF.
  3. We apply Tesseract OCR to extract text from each extracted section.

This method of extracting text from PDFs using coordinate rectangles and Tesseract OCR is effective for several reasons:

  1. Precision in Text Extraction: By identifying specific coordinate rectangles, the process targets only relevant areas of the PDF. This focused approach avoids unnecessary processing of the entire document and reduces errors related to extracting unwanted text.
  2. Efficiency: Extracting predefined sections directly from the PDF is much faster than processing the entire document. This method saves computational resources and time, particularly useful when dealing with large documents.
  3. Accuracy with OCR: Tesseract OCR is a robust optical character recognition tool that can convert images of text into machine-readable text. By feeding it precise sections of text, it can perform more accurately as it deals with less background noise and formatting issues that might confuse the OCR process in larger, unsegmented documents.

And this is the code:

Code to extract underlined text from the PDF sections.

Make sure that you have tesseract installed on your system before running this function. For in-depth instructions, check out their official installation guide here: https://github.com/tesseract-ocr/tessdoc/blob/main/Installation.md or in my GitHub repository here: https://github.com/sasha-korovkina/pdfUnderlinedExtractor.

Putting It All Together…

Now, If we take any PDF file, like this example file:

The whole text of the test file. Image created by author.

We have some underlined words in this file:

ipsum and laboris are underlined here. Image created by author.

After running the code described above, here is what we get:

An array of all underlined words in the document. Image created by author.

After getting this array, you can use these words for further processing!

Enjoy using this script! I’d love to hear about any creative applications you come up with or if you’d like to contribute. Let me know! ❤️

--

--