The world’s leading publication for data science, AI, and ML professionals.

How to Extract and Convert Tables From PDF Files to Pandas Dataframe?

So you have some pdf files with tables in them and want to read them into a pandas data frame. Let me show you how.

Photo by Johannes Groll on Unsplash
Photo by Johannes Groll on Unsplash

Setup

For the purposes of this article, we will extract tables from the housing statistics document published by Homes England on the 2nd of December. A copy of the file pdf file can be found here.

We will be using the tabula-py library for extracting our tables from the pdf files. Install it by running: pip install tabula-py

Make sure you have Java installed in your system. Refer to the docs for the library if you run into any installation errors.

OK, we are all set for extraction! 😎


Tabula: Extract Pdf Tables to Data Frames

Now assuming the pdf file of interest is in the same working directory, let’s attempt to extract the tables out of it. To do this, all we have to do is the following:

Python code to read the tables from the pdf file using Tabula. (source: author)
Python code to read the tables from the pdf file using Tabula. (source: author)

As you can see, the code is very minimal and self-explanatory. This code returns a list of pandas data frames for each individual table extracted.

You can quickly see the number of tables extracted by running len(tables) which should return 9 for this example. If you have a look at the pdf file used for this article, there are 9 tables in the entire document.

Now, all we have to do is index through the list to get each of the tables as a data frame. For example, tables[0] should return the first table and tables[1]should return the second table.

The extracted first table from the pdf file using tables[0]. (source: author)
The extracted first table from the pdf file using tables[0]. (source: author)
The actual version of the extracted first table from the pdf file. (source: author)
The actual version of the extracted first table from the pdf file. (source: author)
The extracted second table from the pdf file using tables[1]. (source: author)
The extracted second table from the pdf file using tables[1]. (source: author)
The actual version of the extracted second table from the pdf file. (source: author)
The actual version of the extracted second table from the pdf file. (source: author)

Keep in Mind

As you can see from the above 2 examples of the extracted tables, the tabula library does an excellent job in extracting tables out of pdfs. But, it’s not always clean and precise. Sometimes we will have to do some manual cleaning to:

  • correct the headers of the tables
  • removed unnecessary rows and columns
  • split columns that are merged together.

These issues are usually prominent in nested header tables and are easily fixed. 😄


Final Thoughts

In this article, we saw how easy it is to extract tables from pdf files and load them as pandas data frames using the Tabula library. The library does a great job at extracting the tables, but we must always visually verify the tables for inconsistency. Most of the time any inconsistency is easily fixable.

For completeness, it’s worth mentioning the other library for pdf table extraction: Camelot. Although not covered here, it’s a great alternative to Tabula. There is no preference between the two, as both do a great job.

Now that you have your tables as data frames, feel free to manipulate them to your heart’s content. 😄


🚀 Hope you found this article useful. If you would like to support me, consider joining medium using my referral link. This will give you access to all my articles and more from other amazing authors on this platform! 🙏


Other articles by me that you might enjoy:

Documenting Your Python Code

Machine Learning Model as a Serverless App using Google App Engine

Machine Learning Model as a Serverless Endpoint using Google Cloud Functions

How to Schedule a Serverless Google Cloud Function to Run Periodically


Related Articles