The world’s leading publication for data science, AI, and ML professionals.

How to Extract PDF data in Python

Adobe makes it difficult to do this without a subscription – but this should help.

Photo by iMattSmart on Unsplash
Photo by iMattSmart on Unsplash

PDFs, for some reason, are still used all the time in industry, and they’re really annoying. Especially if you don’t pay for certain subscriptions to help you manage them. This article is for people in that situation, people who need to get text Data from PDFs without paying for it.

First of all, if you’re looking to analyse handwritten text, this is the wrong article – but it’s coming soon.

The process will consist of converting the PDF to .txt and then extracting the data through regex and other simple methods.

If you haven’t read my article on automating your keyboard to convert PDFs to .txt en masse, then I recommend you do this first. It will save you a lot of time. And If you don’t want to click away, then here’s all the code to do it.

Convert to .txt and then read between the lines

Now that you’ve converted to .txt files, all you have to do is write some code that pulls out the answers that you want.

When translated to .txt files, outputs can come out a bit funny. Sometimes the text surrounding a question can be above the response box, and sometimes it can be below. I’m not sure if there is a technical reason for this or if it’s simply to make doing something like this more difficult.

The trick is to look for constants in the text and isolate them.

Either way, there’s a solution. We only want the answers and care little for the text surrounding them. Luckily, when converted to .txt files, all of our all input sections begin on a new line. And as we know, if there is a constant factor surrounding all things we are trying to extract that makes our lives a lot easier.

Therefore, we can read our .txt file into Python with open() and read(), and then use splitlines() on it. This will provide a list of strings, with a new instance starting every time there was a newline character (n) in the original string.

import os
os.chdir(r"path/to/your/file/here")
f = open(r"filename.txt", "r")
f = f.read()
sentences = f.splitlines()

As promised this will give you a list of strings.

But, as mentioned, it’s only the user inputs we are interested in here. Luckily, there is also another defining factor to help us isolate inputs. All inputs, as well as starting on a new line, also start with a pair of brackets. What’s inside these brackets defines the type of input. For example, a text section would be

(text)James Asher

and a checkbox would be

(checkbox)unchecked

Other examples include "radiobuttons" and "combobuttons", the majority of your PDF inputs will be of these four types.

Occasionally, however, there will be random sections or sentences that will begin with brackets so you can use set(sentences) to double-check. In my example, there were only 5 different types of questions I wanted to include so used the following list comprehension to remove everything else.

questions = ["(text", "(button", "(:", "(combobox", "(radiobutton" ]
sentences= [x for x in sentences if x.startswith(tuple(questions))]

You will now have a list of all inputs/answers to your questions. As long as you use the same PDF, the structure of this list will stay constant.

We can now simply transfer it to a pandas dataframe, do some manipulation and then output it to whatever format we want.

Not all .txt files output like this from PDFs, but the majority do. If yours don’t then you’ll have to use regex and look for the constants in your specific document. But once you write the code to extract it from one document it will be the same for all of your documents as long as they’re homogeneous.

Extracting the data from a list of strings

Extracting the text is easy. In this case, all I needed to do was remove the preceding brackets. That can be done easily with a list comprehension and some regex.

list_strings = [re.sub(r"([^()]*)", "", x) for x in list_strings]
df = pd.DataFrame(list_strings)
df.to_excel("output.xlsx")

And the output is as below.

Output from extracting PDF data with Python
Output from extracting PDF data with Python

You can then simply run a loop over all your .txt files and merge them together with Pandas. You can then pivot or clean as desired.

You now have a usable excel (or CSV) file that stores all your data from all of your pdfs. Almost all of this code is re-usable, you just have to make sure that if you try it with a new batch of different PDFs that they are converted to a similar layout when converted to .txt files.

Hope this helps.

If I've inspired you to join medium I would be really grateful if you did it through this link - it will help to support me to write better content in the future.
If you want to learn more about Data Science, become a certified data scientist, or land a job in data science, then checkout 365 data science through my affiliate link.

If you enjoyed this then please check out some of my other articles.

How to Easily Run Python Scripts on Website Inputs

How to easily show your Matplotlib plots and Pandas dataframes dynamically on your website.

How to Easily Automate Your Keyboard to do Tasks in Python

Cheers,

James


Related Articles