The world’s leading publication for data science, AI, and ML professionals.

Extracting tabular data from PDFs made easy with Camelot.

Extracting tables from PDFs doesn't have to be hard.

Photo by Denny Müller on Unsplash
Photo by Denny Müller on Unsplash

Extracting tabular data from PDFs is hard. But what is even a bigger problem is that a lot of open data is available as Pdf files. This open data is crucial for analysis and getting vital insights. However, accessing such data becomes a challenge. For instance, let’s look at an important report released by the [National Agricultural Statistics Service (NASS)](http://by the National Agricultural Statistics Service (NASS),), which deals with the principal crops planted in the U.S:

Report Source: https://www.nass.usda.gov/Publications/Todays_Reports/reports/pspl0320.pdf
Report Source: https://www.nass.usda.gov/Publications/Todays_Reports/reports/pspl0320.pdf

For any analysis, the starting point would be to get the table with details and convert it to a format that can be ingested by most of the available tools. As you can see above, a mere copy-paste, in this case, doesn’t work. Most of the time, the headers are not in the correct place, some of the numbers are lost. This makes PDFs somewhat tricky to handle, and apparently, there is a reason for that. We’ll go over that, but let’s first try and understand the concept of a PDF file.


This article is part of a complete series on finding suitable datasets. Here are all the articles included in the series:

Part 1: Getting Datasets for Data Analysis tasks – Advanced Google Search

Part 2: Useful sites for finding datasets for Data Analysis tasks

Part 3: Creating custom image datasets for Deep Learning projects

Part 4: Import HTML tables into Google Sheets effortlessly

Part 5: Extracting tabular data from PDFs made easy with Camelot.

Part 6: Extracting information from XML files into a Pandas dataframe

Part 7: 5 Real-World datasets for honing your Exploratory Data Analysis skills


Portable Document Format aka PDFs

Source: [Adobe PDF file icon](http://Adobe Systems CMetalCore)
Source: [Adobe PDF file icon](http://Adobe Systems CMetalCore)

PDF stands for Portable Document Format. It is a file format that was created in the early nineties by Adobe. It is based on the PostScript language and is commonly used to present and share documents. The idea behind the development of PDF was to have a format that makes it possible to view, display, and print documents on any modern printer.

Each PDF file encapsulates a complete description of a fixed-layout flat document, including the text, fonts, vector graphics, raster images, and other information needed to display[Wikipedia].

A basic PDF file contains the following elements.

source: Introduction to PDF syntax by GUILLAUME ENDIGNOUX
source: Introduction to PDF syntax by GUILLAUME ENDIGNOUX

Why extracting tables from PDF is hard?

If you look at the PDF layout above, you will notice no concept of tables in it. A PDF contains instructions to place a character at an x,y coordinate on a 2-D plane, retaining no knowledge of words, sentences, or tables.

Source: https://www.pdfscripting.com/public/PDF-Page-Coordinates.cfm
Source: https://www.pdfscripting.com/public/PDF-Page-Coordinates.cfm

So how does a PDF differentiate between words and sentences? Words are simulated by placing characters together, while sentences are simulated by placing words relatively farther. The following diagram will cement this concept more concretely:

Source: http://www.unixuser.org/~euske/python/pdfminer/index.html
Source: http://www.unixuser.org/~euske/python/pdfminer/index.html

M denotes the distance between two characters in the above figure, while W refers to the space between two words. Any text chunk with space < M is grouped into one. Tables are simulated by putting words as they appear in a spreadsheet without any information on what a row or a column is. Hence, this makes it challenging to extract data from PDFs for analysis purposes. However, a lot of open data in the form of government reports, documentation, etc., is released in the form of PDFs. A tool that can extract information without compromising on its quality is the need of the hour. This point brings us to a versatile library called Camelot, which has been created to extract tabular information from PDFs.


Camelot: PDF table extraction for Humans

Source: https://camelot-py.readthedocs.io/en/master/
Source: https://camelot-py.readthedocs.io/en/master/

Camelot, which derives its name from the famous Camelot Project, is an open-source Python library that can help you extract tables from PDFs easily. It has been built on top of pdfminer, another text extraction tool for PDF documents.

It comes packaged with a lot of useful features like:

  • Configurability – works well for most of the cases but can also be configured
  • Visual Debugging using the Matplotlib library
  • Output is available in multiple formats, including CSV, JSON, Excel, HTML, and even Pandas dataframe.
  • Opensource– MIT licensed
  • Detailed documentation

Installation

You can install Camelot via conda, pip, or directly from the source. If you go for pip, do not forget to install the following dependencies: Tkinter and Ghostscript.

#conda (easiest way)
$ conda install -c conda-forge camelot-py
#pip after installing the tk and ghostscript dependencies
$ pip install "camelot-py[cv]"

Working

Before we get into working, it is a good idea to understand what goes under the hood. Typically, two parsing methods are used by Camelot to extract tables:

  • Stream: looks for whitespaces between words to identify a table.
  • Lattice: Looks for lines on a page to identify a table. Lattice is used by default.

You can read more about Camelot’s working in the documentation.

Usage

Let’s now get to the exciting part- extracting tables from PDFs. Here I am using a PDF containing information about the number of beneficiaries under the Adult Education Programme 2015–16 in India. The PDF looks like this:

source: https://www.mhrd.gov.in/sites/upload_files/mhrd/files/statistics-new/ESAG-2018.pdf
source: https://www.mhrd.gov.in/sites/upload_files/mhrd/files/statistics-new/ESAG-2018.pdf

We’ll start by importing the library and reading in the PDF file as follows:

import camelot
tables = camelot.read_pdf('schools.pdf')

We get a [Table](https://camelot-py.readthedocs.io/en/master/api.html#camelot.core.Table)List object, which is a list of Table objects.

tables
--------------
<TableList n=2>

We can see that two tables have been detected, which can be easily accessed through its index. Let’s access the second table, i.e., the table comprising of more information, and look at its shape:

tables[1] #indexing starts from 0
<Table shape=(28, 4)>

Next, let’s print the parsing report, which is an indication of the extraction quality of the table:

tables[1].parsing_report
{'accuracy': 100.0, 'whitespace': 44.64, 'order': 2, 'page': 1}

It shows a whopping 100% accuracy, which means the extraction is perfect. We can also access the table’s dataframe as follows:

tables[1].df.head()
PDF extracted into a Dataframe | Image by Author
PDF extracted into a Dataframe | Image by Author

The entire table could also be extracted as a CSV file as follows:

tables.export('table.csv')
PDF table exported as CSV | Image by Author
PDF table exported as CSV | Image by Author

Visual debugging

Additionally, you can also plot elements found on the PDF page based on the kind specified, like the 'text', 'grid', 'contour', 'line', 'joint' , etc. These are useful for debugging and playing with different parameters to get the best output.

camelot.plot(tables[1],kind=<specify the kind>)
plot elements found on the PDF page | Image by Author
plot elements found on the PDF page | Image by Author

Advanced usage and CLI

Camelot comes with a command-line interface too. It also comes equipped with a bunch of advanced features like:

  • Reading encrypted PDFs
  • Reading rotated PDFs
  • Tweaking parameters when the default result is not perfect. Specify exact table boundaries, Specify column separators, and much more. You can read more about these features here.

Drawback

Camelot only works with text-based PDFs and not scanned documents. So if you have scanned documents, you’ll have to look at some other alternatives.


⚔️ Excalibur – The Web Interface: The icing on the cake

Another interesting feature of Camelot is that it also has a web interface called Excalibur for people who do not want to code but still want to use the library’s features. Let’s quickly see how to use it.

Installation

After installing Ghostscript, use pip to install Excalibur:

$ pip install excalibur-py 

And then start the web server using:

$ excalibur webserver

You can then navigate to the localhost to access the interface. The whole process has been demonstrated in the video below:


Conclusion

In the above article, we looked at Camelot – an open-source python library that appears to be pretty helpful for extracting tabular data from PDFs. The fact that it has many parameters that can be adjusted makes it pretty scalable and applicable in a lot of situations. The Web interface offers an excellent alternative to people looking for a code-free environment. Overall, it seems to be a useful tool that could help in reducing the time that is generally taken for data extraction.


References

This article is inspired by a talk given by Vinayak Mehta-the creator and maintainer of the project Camelot at PyCon India 2019. Some of the resources have been taken from the slide deck shared publicly after the event. It is highly recommended also to watch the presentation for more clarity.


Related Articles