The world’s leading publication for data science, AI, and ML professionals.

How to Scrape and Extract Data from PDFs Using Python and tabula-py

You want to make friends with tabula-py and Pandas

Image by Author
Image by Author

Background

Data science professionals are dealing with data in all shapes and forms. Data could be stored in popular SQL databases, such as PostgreSQL, MySQL, or an old-fashioned excel spreadsheet. Sometimes, data might also be saved in an unconventional format, such as PDF. In this article, I am going to talk about how to scrape data from PDF using Python library: tabula-py.

Required Libraries

  • tabula-py: to scrape text from PDF files
  • re: to extract data using regular expression
  • Pandas: to construct and manipulate our panel data

Install Libraries

pip install tabula-py
pip install pandas

Import Libraries

import tabula as tb
import pandas as pd
import re

Scrape PDF Data in Structured Form

First, let’s talk about scraping PDF data in a structured format. In the following example, we want to scrape the table on the bottom left corner. It is nicely-structured tabular data, in which rows and columns are well defined.

Sample Structured Data (Source)
Sample Structured Data (Source)

Scraping PDF data in structured form is straightforward using tabula-py. We just need to input the location of the tabular data in the PDF page by specifying the (top, left, bottom, right) coordinates of the area. In practice, you will learn what values to use by trial and error. If the PDF page only includes the target table, then we don’t even need to specify the area. tabula-py should be able to detect the rows and columns automatically.

file = 'state_population.pdf'
data = tb.read_pdf(file, area = (300, 0, 600, 800), pages = '1')

Scrape PDF Data in Unstructured Form

Next, we will explore something more interesting – PFD data in an unstructured format.

To implement statistical analysis, data visualization and machine learning model, we need the data in tabular form (panel data). However, many data are only available in an unstructured format. For example, HR staff are likely to keep historical payroll data, which might not be created in tabular form. In the following picture, we have an example of payroll data, which has mixed data structures. On the left section, it has data in long format, including employee name, net amount, pay date and pay period. On the right section, it has pay category, pay rate, hours and pay amount.

(Created by Author)
(Created by Author)

There are a few steps we need to take to transform the data into panel format.

  • Step 1: Import PDF data as a DataFrame

Like data in a structured format, we also use tb.read_pdf to import the unstructured data. This time, we need to specify extra options to properly import the data.

file = 'payroll_sample.pdf'
df= tb.read_pdf(file, pages = '1', area = (0, 0, 300, 400), columns = [200, 265, 300, 320], pandas_options={'header': None}, stream=True)[0]
  • Area and Columns: I’ve talked about area above. Here we will also need to use columns to identify the locations of all relevant columns. Like area, the values of columns would be determined by trial and error.
  • Stream and Lattice: If there are grid lines to separate each cell, we can use lattice = True to automatically identify each cell, If not, we can use stream = True and columns to manually specify each cell. Stream-mode would look for whitespace between columns. These options might make a huge impact, so we can experiment with either lattice or stream and see if they improve overall scraping.
(Created by Author)
(Created by Author)
  • Step 2: Create a Row Identifier

Now we have some data to work with, we will use Python library Pandas to manipulate the dataframe.

First, we will need to create a new column that can identify unique rows. We notice that employee names (Superman and Batman) seem to be useful to identify the border between different records. Each employee name contains a unique pattern, which starts with a capital letter and ends with a lower-case letter. We can use regular expression '^[A-Z].*[a-z]$' to identify employee name, then use Pandas function cumsum (cumulative sum) to create a row identifier.

df['border'] = df.apply(lambda x: 1 if re.findall('^[A-Z].*[a-z]$', str(x[0])) else 0, axis = 1)
df['row'] = df['border'].transform('cumsum')
(Created by Author)
(Created by Author)
  • Step 3: Reshape the data (convert data from long-form to wide form)

Next, we will reshape data on both the left section and right section. For the left section, we create a new dataframe, employee that includes employee_name, net_amount, pay_date and pay_period. For the right section, we create another dataframe, payment that includes OT_Rate, Regular_Rate, OT_Hours, Regular_Hours, OT_Amt and Regular_Amt. To convert the data in a wide form, we can use the Pandas function, pivot.

# reshape left section
employee = df[[0, 'row']]
employee = employee[employee[0].notnull()]
employee['index'] = employee.groupby('row').cumcount()+1
employee = employee.pivot(index = ['row'], columns = ['index'], values = 0).reset_index()
employee = employee.rename(columns = {1: 'employee_name', 2: 'net_amount', 3: 'pay_date', 4: 'pay_period'})
employee['net_amount'] = employee.apply(lambda x: x['net_amount'].replace('Net', '').strip(), axis = 1)
# reshape right section
payment = df[[1, 2, 3, 4, 'row']]
payment = payment[payment[1].notnull()]
payment = payment[payment['row']!=0]
payment = payment.pivot(index = ['row'], columns = 1, values = [2, 3, 4]).reset_index()
payment.columns = [str(col[0])+col[1] for col in payment.columns.values]
for i in ['Regular', 'OT']:
    payment = payment.rename(columns = {f'2{i}': f'{i}_Rate', f'3{i}': f'{i}_Hours', f'4{i}': f'{i}_Amt'})
(Created by Author)
(Created by Author)
(Created by Author)
(Created by Author)
  • Step 4: Join the data in the left section with the data in the right section

Lastly, we use the function, merge to join both employee and payment dataframes based on the row identifier.

df_clean = employee.merge(payment, on = ['row'], how = 'inner')
(Created by Author)
(Created by Author)

Final Note

As of today, companies still manually process PDF data. With the help of python libraries, we can save time and money by automating this process of scraping data from PDF files and converting unstructured data into panel data.

Please keep in mind that when scraping data from PDF files, you should always carefully read the terms and conditions posted by the author and make sure you have permission to do so.


If you would like to explore more PDF automation tools, please check out my articles:

Thank you for reading!

If you enjoy this article, please click the Clap icon. If you would like to see more articles from me and thousands of other writers on Medium. You can:

  • Subscribe to my newsletter to get an email notification whenever I post a new article.
  • Sign up for a membership to unlock full access to everything on Medium.

Related Articles