Making the Mueller Report Searchable with OCR and Elasticsearch

Published in

Towards Data Science

6 min readApr 19, 2019

April 18th marked the full release of the Mueller Report — a document outlining the investigation of potential Russian interference in the 2016 presidential election. Like most government documents it is long (448 pages), and would be painfully tedious to read.

To make matters worse, the actual PDF download is basically just an image. You cannot copy/paste text, or use ctrl+f to find specific portions of text if you wanted to search the document for more targeted information.

However, we can easily make this document searchable for ourselves using two great technologies: optical character recognition (OCR) and Elasticsearch.

Optical Character Recognition

OCR allows us to take pictures of text, recognize, and then convert them to actual text — see this description on Wikipedia. Fortunately for us, in the modern day and age there are many open source libraries/products to perform this task.

Tesseract is one such engine. Originally developed in the 80s, it has been a Google project since 2006 and is one of most popular OCR engines. Today, we’ll be using the Python wrapper: pytesseract. I got my original PDF-OCR inspiration from this post — check it out!

Elasticsearch

Elasticsearch is a scalable search platform that uses an algorithm similar to TF-IDF, which stands for term frequency inverse document frequency.

Essentially, it’s a simple function often used within the search/similarity space that targets documents based on keywords. It also places less emphasis on words that appear frequently. For instance, because the word “the” appears in so many texts, we don’t want it to be considered an important part of our search query. TF-IDF takes this into account when comparing your query with documents. For a basic overview of it just check wikipedia.

Installation

You can download and install elastic from the website, or the respective package manager for your OS. Then you just need all the Python packages we’ll be using.

pip install elasticsearch
pip install pdf2image
pip install pytesseract

OCR Text Extraction

First, download the Mueller Report to your host. Then, we can create a quick function to extract the text from a PDF page-by-page using pytesseract and the pdf2image libary.

Notice I set the default num_pages=10. This is because this report is really long, and on your personal computer it’ll probably take quite a long time to extract the text from every page. Additionally, it’s also a lot of data for a local Elasticsearch index if you don’t plan on deploying this to the cloud. Still, feel free to change that parameter to any value you choose.

But regardless of that, when we run this function on our PDF we now we have the text and page number for all of our PDF pages! It’s a list of dictionaries (json) which is perfect for ingestion by elastic to make it searchable.

Index in Elasticsearch

The first thing you need to do is make sure elastic is running on the proper port. Open a terminal and start elastic (if it’s in your $PATH it should just be elasticsearch). By default, this will start the service on port 9200.

After that, we can easily use the Python client to interact with our instance. If elastic is running properly on port 9200, the following code should create the index mueller-report which has 2 fields: text and page (these correspond to our dictionary keys in the previous function).

Searching our Index

I won’t get into the specifics, but elastic uses a language called query DSL to interact with the indices. There’s a lot you can do with it, but all we’re going to do here is create a function that will vectorize our query, and compare it with the text in our index for similarity.

The res will be a json that contains a bunch of info on our search. Realistically though, we only want our relevant results. So once we actually call the function, we can parse the json to get the most relevant text and page number.

With this, our search function looks for “department of justice” within the page texts, and returns the results. the [0] in the statement above is just to look at the first, most relevant page text and number. However, you can customize the parsing so that it returns as few/many results as you like.

Using the Kibana Front End

Instead of viewing a poorly recorded gif of my jupyter notebook, we can actually use another elastic tool to better view our results. Kibana is an open source front end for elastic that’s great for visualization. First, install kibana from this link.

Once you have Kibana installed, start the service by running kibana in a terminal and then navigate to localhost:5601 in your favorite web broswer. This will let you interact with the application.

The only thing we have to do here before we interact with our index is create an index pattern. Go to Management > Create Index Pattern, and then type in “mueller-report” — Kibana should let you know that the pattern matches the index we created earlier in elastic.

And that’s it! If you go to the Discover tab on the left, and you can search your index in a much easier (and more aesthetic) manner than we were in elastic.

Next Steps

It would probably be cool to throw this up on AWS so anyone can use it (with a nicer front end), but I’m not really tryna tie my credit card to that instance at the moment. If anyone else wants to, feel free! I’ll update soon with a docker container and github link.

Updates

4/21/19 — There have been a lot of other posts/work regarding OCR and subsequent NLP on the Mueller Report. It seems that the main concern is the actual quality of the OCR text, since results can be messy due to formatting or general inaccuracy. While there’s probably no easy way to solve this for future analysis (aside from the government releasing a useful document), we can at least have elastic compensate for any misspellings in our searches by adding a fuzziness parameter to our search function.

This is a hacky, but generalizable way to account for some of the error we might find in the post OCR text.