
Background
Data science professionals are dealing with data in all shapes and forms. Data could be stored in popular SQL databases, such as PostgreSQL, MySQL, or an old-fashioned excel spreadsheet. Sometimes, data might also be saved in an unconventional format, such as PDF. In this article, I am going to talk about how to scrape data from PDF using Python library: tabula-py.
Required Libraries
- tabula-py: to scrape text from PDF files
- re: to extract data using regular expression
- Pandas: to construct and manipulate our panel data
Install Libraries
pip install tabula-py
pip install pandas
Import Libraries
import tabula as tb
import pandas as pd
import re
Scrape PDF Data in Structured Form
First, let’s talk about scraping PDF data in a structured format. In the following example, we want to scrape the table on the bottom left corner. It is nicely-structured tabular data, in which rows and columns are well defined.

Scraping PDF data in structured form is straightforward using tabula-py
. We just need to input the location of the tabular data in the PDF page by specifying the (top, left, bottom, right) coordinates of the area
. In practice, you will learn what values to use by trial and error. If the PDF page only includes the target table, then we don’t even need to specify the area. tabula-py
should be able to detect the rows and columns automatically.
file = 'state_population.pdf'
data = tb.read_pdf(file, area = (300, 0, 600, 800), pages = '1')
Scrape PDF Data in Unstructured Form
Next, we will explore something more interesting – PFD data in an unstructured format.
To implement statistical analysis, data visualization and machine learning model, we need the data in tabular form (panel data). However, many data are only available in an unstructured format. For example, HR staff are likely to keep historical payroll data, which might not be created in tabular form. In the following picture, we have an example of payroll data, which has mixed data structures. On the left section, it has data in long format, including employee name, net amount, pay date and pay period. On the right section, it has pay category, pay rate, hours and pay amount.

There are a few steps we need to take to transform the data into panel format.
- Step 1: Import PDF data as a DataFrame
Like data in a structured format, we also use tb.read_pdf
to import the unstructured data. This time, we need to specify extra options to properly import the data.
file = 'payroll_sample.pdf'
df= tb.read_pdf(file, pages = '1', area = (0, 0, 300, 400), columns = [200, 265, 300, 320], pandas_options={'header': None}, stream=True)[0]
- Area and Columns: I’ve talked about
area
above. Here we will also need to usecolumns
to identify the locations of all relevant columns. Like area, the values of columns would be determined by trial and error. - Stream and Lattice: If there are grid lines to separate each cell, we can use
lattice = True
to automatically identify each cell, If not, we can usestream = True
andcolumns
to manually specify each cell. Stream-mode would look for whitespace between columns. These options might make a huge impact, so we can experiment with either lattice or stream and see if they improve overall scraping.

- Step 2: Create a Row Identifier
Now we have some data to work with, we will use Python library Pandas
to manipulate the dataframe.
First, we will need to create a new column that can identify unique rows. We notice that employee names (Superman and Batman) seem to be useful to identify the border between different records. Each employee name contains a unique pattern, which starts with a capital letter and ends with a lower-case letter. We can use regular expression '^[A-Z].*[a-z]$'
to identify employee name, then use Pandas function cumsum
(cumulative sum) to create a row identifier.
df['border'] = df.apply(lambda x: 1 if re.findall('^[A-Z].*[a-z]$', str(x[0])) else 0, axis = 1)
df['row'] = df['border'].transform('cumsum')

- Step 3: Reshape the data (convert data from long-form to wide form)
Next, we will reshape data on both the left section and right section. For the left section, we create a new dataframe, employee that includes employee_name, net_amount, pay_date and pay_period. For the right section, we create another dataframe, payment that includes OT_Rate, Regular_Rate, OT_Hours, Regular_Hours, OT_Amt and Regular_Amt. To convert the data in a wide form, we can use the Pandas function, pivot
.
# reshape left section
employee = df[[0, 'row']]
employee = employee[employee[0].notnull()]
employee['index'] = employee.groupby('row').cumcount()+1
employee = employee.pivot(index = ['row'], columns = ['index'], values = 0).reset_index()
employee = employee.rename(columns = {1: 'employee_name', 2: 'net_amount', 3: 'pay_date', 4: 'pay_period'})
employee['net_amount'] = employee.apply(lambda x: x['net_amount'].replace('Net', '').strip(), axis = 1)
# reshape right section
payment = df[[1, 2, 3, 4, 'row']]
payment = payment[payment[1].notnull()]
payment = payment[payment['row']!=0]
payment = payment.pivot(index = ['row'], columns = 1, values = [2, 3, 4]).reset_index()
payment.columns = [str(col[0])+col[1] for col in payment.columns.values]
for i in ['Regular', 'OT']:
payment = payment.rename(columns = {f'2{i}': f'{i}_Rate', f'3{i}': f'{i}_Hours', f'4{i}': f'{i}_Amt'})


- Step 4: Join the data in the left section with the data in the right section
Lastly, we use the function, merge
to join both employee and payment dataframes based on the row identifier.
df_clean = employee.merge(payment, on = ['row'], how = 'inner')

Final Note
As of today, companies still manually process PDF data. With the help of python libraries, we can save time and money by automating this process of scraping data from PDF files and converting unstructured data into panel data.
Please keep in mind that when scraping data from PDF files, you should always carefully read the terms and conditions posted by the author and make sure you have permission to do so.
If you would like to explore more PDF automation tools, please check out my articles:
- Scrape Data from PDF Files Using Python and PDFQuery
- Scrape Data from PDF Files Using Python and tabula-py
- How to Convert Scanned Files to Searchable PDF Using Python and Pytesseract
- Extract PDF Text While Preserving Whitespaces Using Python and Pytesseract
- How to Edit PDF Hyperlinks using Python and pdfrw
- How to Rotate PDF Pages using Python and pdfrw
Thank you for reading!
If you enjoy this article, please click the Clap icon. If you would like to see more articles from me and thousands of other writers on Medium. You can:
- Subscribe to my newsletter to get an email notification whenever I post a new article.
- Sign up for a membership to unlock full access to everything on Medium.