
Extracting tabular data from PDFs is hard. But what is even a bigger problem is that a lot of open data is available as Pdf files. This open data is crucial for analysis and getting vital insights. However, accessing such data becomes a challenge. For instance, let’s look at an important report released by the [National Agricultural Statistics Service (NASS)](http://by the National Agricultural Statistics Service (NASS),), which deals with the principal crops planted in the U.S:

For any analysis, the starting point would be to get the table with details and convert it to a format that can be ingested by most of the available tools. As you can see above, a mere copy-paste, in this case, doesn’t work. Most of the time, the headers are not in the correct place, some of the numbers are lost. This makes PDFs somewhat tricky to handle, and apparently, there is a reason for that. We’ll go over that, but let’s first try and understand the concept of a PDF file.
This article is part of a complete series on finding suitable datasets. Here are all the articles included in the series:
Part 1: Getting Datasets for Data Analysis tasks – Advanced Google Search
Part 2: Useful sites for finding datasets for Data Analysis tasks
Part 3: Creating custom image datasets for Deep Learning projects
Part 4: Import HTML tables into Google Sheets effortlessly
Part 5: Extracting tabular data from PDFs made easy with Camelot.
Part 6: Extracting information from XML files into a Pandas dataframe
Part 7: 5 Real-World datasets for honing your Exploratory Data Analysis skills
Portable Document Format aka PDFs
](https://towardsdatascience.com/wp-content/uploads/2020/10/1jl_pfTJRLDE47Nocomb1XA.png)
PDF stands for Portable Document Format. It is a file format that was created in the early nineties by Adobe. It is based on the PostScript language and is commonly used to present and share documents. The idea behind the development of PDF was to have a format that makes it possible to view, display, and print documents on any modern printer.
Each PDF file encapsulates a complete description of a fixed-layout flat document, including the text, fonts, vector graphics, raster images, and other information needed to display[Wikipedia].
A basic PDF file contains the following elements.

Why extracting tables from PDF is hard?
If you look at the PDF layout above, you will notice no concept of tables in it. A PDF contains instructions to place a character at an x,y coordinate on a 2-D plane, retaining no knowledge of words, sentences, or tables.

So how does a PDF differentiate between words and sentences? Words are simulated by placing characters together, while sentences are simulated by placing words relatively farther. The following diagram will cement this concept more concretely:

M denotes the distance between two characters in the above figure, while W refers to the space between two words. Any text chunk with space < M is grouped into one. Tables are simulated by putting words as they appear in a spreadsheet without any information on what a row or a column is. Hence, this makes it challenging to extract data from PDFs for analysis purposes. However, a lot of open data in the form of government reports, documentation, etc., is released in the form of PDFs. A tool that can extract information without compromising on its quality is the need of the hour. This point brings us to a versatile library called Camelot, which has been created to extract tabular information from PDFs.
Camelot: PDF table extraction for Humans

Camelot, which derives its name from the famous Camelot Project, is an open-source Python library that can help you extract tables from PDFs easily. It has been built on top of pdfminer, another text extraction tool for PDF documents.
It comes packaged with a lot of useful features like:
- Configurability – works well for most of the cases but can also be configured
- Visual Debugging using the Matplotlib library
- Output is available in multiple formats, including CSV, JSON, Excel, HTML, and even Pandas dataframe.
- Opensource– MIT licensed
- Detailed documentation
Installation
You can install Camelot via conda, pip, or directly from the source. If you go for pip, do not forget to install the following dependencies: Tkinter and Ghostscript.
#conda (easiest way)
$ conda install -c conda-forge camelot-py
#pip after installing the tk and ghostscript dependencies
$ pip install "camelot-py[cv]"
Working
Before we get into working, it is a good idea to understand what goes under the hood. Typically, two parsing methods are used by Camelot to extract tables:
- Stream: looks for whitespaces between words to identify a table.
- Lattice: Looks for lines on a page to identify a table. Lattice is used by default.
You can read more about Camelot’s working in the documentation.
Usage
Let’s now get to the exciting part- extracting tables from PDFs. Here I am using a PDF containing information about the number of beneficiaries under the Adult Education Programme 2015–16 in India. The PDF looks like this:

We’ll start by importing the library and reading in the PDF file as follows:
import camelot
tables = camelot.read_pdf('schools.pdf')
We get a [Table](https://camelot-py.readthedocs.io/en/master/api.html#camelot.core.Table)List
object, which is a list of Table
objects.
tables
--------------
<TableList n=2>
We can see that two tables have been detected, which can be easily accessed through its index. Let’s access the second table, i.e., the table comprising of more information, and look at its shape:
tables[1] #indexing starts from 0
<Table shape=(28, 4)>
Next, let’s print the parsing report, which is an indication of the extraction quality of the table:
tables[1].parsing_report
{'accuracy': 100.0, 'whitespace': 44.64, 'order': 2, 'page': 1}
It shows a whopping 100% accuracy, which means the extraction is perfect. We can also access the table’s dataframe as follows:
tables[1].df.head()

The entire table could also be extracted as a CSV file as follows:
tables.export('table.csv')

Visual debugging
Additionally, you can also plot elements found on the PDF page based on the kind specified, like the 'text', 'grid', 'contour', 'line', 'joint'
, etc. These are useful for debugging and playing with different parameters to get the best output.
camelot.plot(tables[1],kind=<specify the kind>)

Advanced usage and CLI
Camelot comes with a command-line interface too. It also comes equipped with a bunch of advanced features like:
- Reading encrypted PDFs
- Reading rotated PDFs
- Tweaking parameters when the default result is not perfect. Specify exact table boundaries, Specify column separators, and much more. You can read more about these features here.
Drawback
Camelot only works with text-based PDFs and not scanned documents. So if you have scanned documents, you’ll have to look at some other alternatives.
⚔️ Excalibur – The Web Interface: The icing on the cake
Another interesting feature of Camelot is that it also has a web interface called Excalibur for people who do not want to code but still want to use the library’s features. Let’s quickly see how to use it.
Installation
After installing Ghostscript, use pip to install Excalibur:
$ pip install excalibur-py
And then start the web server using:
$ excalibur webserver
You can then navigate to the localhost to access the interface. The whole process has been demonstrated in the video below:
Conclusion
In the above article, we looked at Camelot – an open-source python library that appears to be pretty helpful for extracting tabular data from PDFs. The fact that it has many parameters that can be adjusted makes it pretty scalable and applicable in a lot of situations. The Web interface offers an excellent alternative to people looking for a code-free environment. Overall, it seems to be a useful tool that could help in reducing the time that is generally taken for data extraction.
References
This article is inspired by a talk given by Vinayak Mehta-the creator and maintainer of the project Camelot at PyCon India 2019. Some of the resources have been taken from the slide deck shared publicly after the event. It is highly recommended also to watch the presentation for more clarity.