Recently for a self-study project, I had to go through an 800-page pdf file. Each chapter of the file contained a common set of questions. And I needed the answers to specific questions in each chapter. Now it’d take me forever to go through each page of the document and assess the answers to those questions. I was wondering if there was a quick way to scan through each page and extract only the relevant information for me from the file. I figured out a Pythonic way to do the same. In this post, I am going to share how I was able to read the pdf file, extract only relevant information from each chapter of the file, export the data into Excel and editable word document, and convert it back to pdf format using different packages in Python. Let’s get started.

Data
Rather than an 800-page document, I am going to use a 4-page pdf file as an example. During the final days of high school, my classmates passed around a diary called "Auto book" as a memory to collect the interests, preferences, and contact information of each other. The pdf file I am using contains dummy information about four imaginary friends named Ram, Shyam, Hari, and Shiva. The file contains information such as their nationality, date of birth, preferences (food, fruit, sports, player, movie, actor), favorite quotes, aim, views on politics, and message to the world.

It’d be easy to extract information for few friends directly by copy pasting from the pdf file. However, if the pdf file is large, it’d be much more efficient and precise to do it using Python. Following sections show how it is done step by step in Python.
1. Reading pdf document using PyPDF2 or PyMuPDF packages
a. Read the first page using PyPDF2
To read the text in the pdf file using Python, I use a package called PyPDF2, and its PdfReader module. In the code snippet below, I read just the first page of the pdf file and extract the text from it.

b. Read the entire text of the file using PyPDF2
To read the entire text from the pdf file, I use a function called extract_text_from_pdf
as shown below. First, the function opens the pdf file for reading in binary format and initializes the reading object. An empty list called pdf_text
is initialized. Next, while looping through each page of the pdf file, the content of each page is extracted and appended to the list.
def extract_text_from_pdf(pdf_file: str) -> [str]:
# Open the pdf file for reading in binary format
with open(pdf_file, 'rb') as pdf:
#initialize a PDfReader object
reader = PyPDF2.PdfReader(pdf)
#start an empty list
pdf_text = []
#loop through each page in document
for page in reader.pages:
content = page.extract_text()
pdf_text.append(content)
return pdf_text
When the file is passed as an argument in the function above, it returns a list containing elements- each element referring to the text on each page. The given file autobook.pdf
is read as 4 elements using the extract_text_from_pdf()
function as shown below:

The elements inside the extracted_text
can also be joined as a single element using:
all_text = [''.join(extracted_text)]
len(all_text) #returns 1
all_text
returns a list containing only one element for the entire text in all the pages of the pdf file.
c. Alternative way to read the entire text of the pdf file using PyMUPDF package.
Alternatively, I came across a package called PyMUPDF to read the entire text in the pdf as shown below:
# install using: pip install PyMuPDF
import fitz
with fitz.open(file) as doc:
text = ""
for page in doc:
#append characeter in each page
text += page.get_text()
print ("Pdf file is read as a collection of ", len(text), "text elements.")
#returns 1786 elements.
First, the pdf file is opened as a doc. text
is initialized as an empty string. By looping through each page in the doc, the character on each page is appended to the text
. Hence, the length of text
here is 1786 elements, which includes each character including spaces, new lines, and punctuation marks.
2. RegEx
RegEx, or Regular Expression, is a sequence of characters that forms a search pattern. Python has an in-built package called re
for this purpose.
From all the text in the given pdf file, I wanted to extract only the specific information. Below I describe the functions I used for this purpose, although there could be much wider use cases of RegEx.
a. findall
When the given pattern matches in the string/text, the findall function returns the list of all the matches.
In the code snippet below, x
, y
and z
return all the matches for Name
, Nationality
, and Country
in the text. There are four occurrences of Name
, three occurrences of Nationality
, and a single occurrence of Country
in the text in the pdf.

b. sub
The sub function is used to substitute/replace one or more matches with a string.
In the given text, the Nationality
is referred to as Country
in the case of a friend named Hari. To replace the Country
with Nationality
, first I compiled a regular expression pattern for Country
. Next, I used the sub method to replace the pattern with the new word and created a new string called new_text
. In new_text
, I find four occurrences of Nationality
unlike three in the previous case.

c. finditer
The finditer method can be used to find the matches of the pattern in a string.
In the given text, the text between the Name
and Nationality
fields contains the actual names of the friends, and the text between the Nationality
and Date of Birth
fields contains the actual nationalities. I created the following function called find_between()
to find the text between any two words present in consecutive order in the given text.
def find_between(first_word, last_word, text):
"""Find characters between any two first_word and last_word."""
pattern = rf"(?P<start>{first_word})(?P<match>.+?)(?P<end>{last_word})"
#Returns an iterator called matches based on pattern in the string.
#re.DOTALL flag allows the '.' character to inclde new lines in matching
matches = re.finditer(pattern, text, re.DOTALL)
new_list = []
for match in matches:
new_list.append(match.group("match"))
return new_list
One of the main parameters in the above function is the pattern
. The pattern
is set up to extract the characters between the first_word
and the last_word
in the given text. The finditer function returns an iterator over all non-overlapping matches in the string. For each match, the iterator returns a Match object. An empty list called new_list
is initialized. By looping through the matches
, the exact match
in each iteration is appended to the new_list
, and is returned by the fuction.
In this way, I was able to create the lists for each field such as names, nationalities, date of birth, preferences, and so on from the pdf file as shown in the code snippet below:

Note:
The ‘.’ special character in Python matches with any character in the text/string excluding the new line. However, the re.DOTALL flag the ‘.’ character can match any character including the new line.
3. Exporting data to Excel
a. Pandas dataframe from lists
In the step above, I got the lists for each profile field for each friend. In this step, I convert these lists into a pandas dataframe:
import pandas as pd
df = pd.DataFrame()
df["Name"] = names
df["Nationality"] = nationalities
df["Date of Birth"] = dobs
df["Favorite Food"] = foods
df["Favorite Fruit"] = fruits
df["Favorite Sports"] = sports
df["Favorite Player"] = players
df["Favorite Movie"] = movies
df["Favorite Actor"] = actors
df["Favorite Quotes"] = quotes
df["Aim"] = aims
df["Views on Politics"] = politics
df["Messages"] = messages
df = df.T
df.columns = df.iloc[0]
df.drop(index = "Name", inplace = True)
df
The dataframe df
looks as shown below:

b. Conditional formatting using pandas dataframe
Pandas datafame allows conditional formatting feature similar to Excel. Suppose I want to highlight the cells containing the name of my favorite player Lionel Messi
in df
. This can be done using df.style.applymap()
function as shown below:

When the file is exported as *.xlsx format in line [28], the exported file also contains yellow highlight for the cell containing Lionel Messi
.
4. Exporting from Python to word format
a. Creating word document using Python-docx
To export data from Python to a Word format, I use a package called python-docx. The Document
module inside the docx package allows the creation of different aspects of a word document such as headings and paragraphs.
In the code below, I add the heading Name for each friend at first followed by a paragraph containing the actual name of the friend. This is followed by the headings and the corresponding texts for each profile field. I add a page break at the end of the profile of each friend.
from docx import Document
document = Document()
for column in df.columns:
document.add_heading("Name")
p = document.add_paragraph()
p.add_run(column)
for index in df.index:
document.add_heading(index)
p = document.add_paragraph()
p.add_run(df.loc[index, column])
#add page break after profile of each friend
document.add_page_break()
The code above helps to yield a word document of the following format after saving it:

b. Highlight paragraph using Python-docx
The Python-docx package helps to generate a word document with most of the features available in a Microsoft Word application. For example, the font can be added in different font styles, font colors, and sizes, along with features such as bold, italic, and underline. Let’s say I want to create a section called Favorites at the end of the document and highlight the text in the document. It can be done with the following code:
from docx.enum.text import WD_COLOR_INDEX
document.add_heading("Favorites")
p = document.add_paragraph()
p.add_run("This section consists of favorite items of each friend.").font.highlight_color=WD_COLOR_INDEX.YELLOW
c. Create a table using Python-docx
The Python-docx also allows the creation of tables in the word document directly from Python. Suppose I want to add a table consisting of the favorite item of each friend in the Favorites section at the end of the document. A table can be created using document.add_tables(rows = nrows, cols = ncols)
. Furthermore, the text needs to be defined for each row/column or cell of the table.
In the code below, I define a table object with 8 rows and 5 columns. Next, I define the table header and first column. By looping through the dataframe df
, I define the text for each cell inside the table based on the favorite item of each friend.
table = document.add_table(rows = 8, cols = 5)
# Adding heading in the 1st row of the table
column1 = table.rows[0].cells
column1[0].text = 'Items'
#table header
for i in range(1, len(df.columns)+1):
column1[i].text = df.columns[i-1]
#first column in the table for labels
for i in range(1,8):
table.cell(i,0).text = df.index[i+1]
for i in range(2, 9):
for j in range(1, 5):
table.cell(i-1, j).text = df.iloc[i, j-1]
#define table style
table.style = "Light Grid Accent 1"
d. Save the document.
The document is saved as a *.docx format file using:
document.save("../output/python_to_word.docx")
The final page of the document comprising favorites section and the table looks as follows:

5. Converting word document to pdf format.
To convert a document from word .docx format to .pdf format using Python, I came across a package called docx2pdf. The word document can be converted to pdf format using the convert module of the package as convert(input_path, output_path)
.

The output folder looks as follows for me:

Conclusion
Scanning through a pdf file and extracting only the necessary information can be very time-consuming and stressful. There are different packages available in Python that help to automate this process, alleviate cumbersomeness, and make the process more efficient to extract precise information.
In this post, I use a dummy example of a pdf file containing common fields/sections/headings in the profile of four friends and extract the relevant information for each field for each friend. First, I used the PyPDF2 or PyMuPDF package to read the pdf file and print out the entire text. Second, I used Regular Expressions (RegEx) to detect patterns and find the matches for each pattern in the text to extract only relevant information. Third, I converted the lists of information for each profile field for each friend as pandas dataframe and exported it to an Excel file. Next, I created a word file using the Python-docx package. And finally, I converted the word file into a pdf format again using the docx2pdf file.
The notebook and the input pdf file for this post are available in this GitHub repository. Thank you for reading!