Translate long PDF-Reports in Python

Automatically extract and translate a complete German Central Bank report for free

Philipp Schreiber
Towards Data Science

--

Photo by Mika Baumeister on Unsplash

For work, I recently had to translate many old Central Bank reports from various OECD countries. While, luckily, gibberish online translation is a thing of the past, common manual solutions are often not viable when working with many long documents. A multitude of useful Python packages exist to help with this task and are introduced in a variety of excellent existing articles. However, when faced with this task I found that the commonly used examples are often too stylised and many of the established tools are no longer being maintained in favour of community-built follow-up projects.

That is why, with this article I want to 1) provide a real-world example of PDF translation and 2) give an update on the best packages to use.

2 + 1 Tasks

So, together we’re going to translate a Central Bank Report, which — just like the code — you can find on my Git repository. To get started, we need a clear idea of what it is that we want to do. In our case, we need to somehow extract the content of a pdf, translate it and, then, (potentially) bring it into a format easily readable by humans: Extract -> Translate -> Write. We deal with each task separately and tie them together in the end.

Extract

As you might already have experienced yourself, retrieving the text from a PDF can be quite tricky. The reason for this is that PDFs only store the location of characters and do not record what constitutes words or lines. Our library of choice is the new pdfplumber project, which is built on the very good pdfminer.six library (itself replacing PDFMiner) but sports a better documentation and exciting new features. One feature we’ll make use of here is the filtering of tables. For completeness, note that the popular PyPDF2 package serves better for PDF merging, rather than text extraction.

import pdfplumberpdf = pdfplumber.open(“src/examples/1978-geschaeftsbericht-data.pdf”)

We import the library and open the desired document. The central object of pdfplumber is the Page Class, which allows us to access each page and its content individually. Note that, while we could simply extract all text at once, reducing the pdf to one large string causes us to lose a lot of useful information.

Below how indices can be used to access individual pages and easily access their text by appling the extract_text() method.

page11 = pdf.pages[11]
page11.extract_text()
>>> 2 schließlich diese Wende in ihrer Politik durch die Heraufsetzung des Diskont- und Lom \nbardsatzes.

While this already looks great (for comparison, check the 12th page of the PDF), we see that sentences are disrupted by end-of-line breaks, which we can predict will create problems for translation. Since paragraphs naturally have a line-break after a full-stop, we’ll exploit this to keep only desired line breaks.

def extract(page):
"""Extract PDF text and Delete in-paragraph line breaks."""
# Get text extracted = page.extract_text() # Delete in-paragraph line breaks extracted = extracted.replace(".\n", "**/m" # keep par breaks
).replace(". \n", "**/m" # keep par breaks
).replace("\n", "" # delete in-par breaks
).replace("**/m", ".\n\n") # restore par break
return extracted
print(extract(page11)[:500])
>>> 2 schließlich diese Wende in ihrer Politik durch die Heraufsetzung des Diskont- und Lom bardsatzes.

Much better! But looking at the next page, we see that we have problems with the tables in the document.

page12 = pdf.pages[12]print(extract(page12)[:500])
>>> 1 3 Zur Entwicklung des Wirtschaftswachstums Jährliche Veränderung in o;o Zum Vergleich: I Bruttoin-Brutto- I ...

Filtering-out tables

A highlight of the pdfplumber package is the filter method. The library comes with built-in functionality for finding tables but combining it with filter requires some ingenuity. Essentially, pdfplumber allocates each character to so-called “boxes”, the coordinates of which filter takes as input. For the sake of brevity, I won’t explain the not_within_bboxes function but point towards the original Git issue. We pass the identified characters belonging to tables and combine them with the not_within_bboxes function. Importantly, since the filter method only accepts a function without arguments, we freeze the boxes argument using partial. This is added as a prior step to the extract function we created above.

from functools import partialdef not_within_bboxes(obj, bboxes):
"""Check if the object is in any of the table's bbox."""
def obj_in_bbox(_bbox):
"""Find Bboxes of objexts."""
v_mid = (obj["top"] + obj["bottom"]) / 2
h_mid = (obj["x0"] + obj["x1"]) / 2
x0, top, x1, bottom = _bbox
return (h_mid >= x0) and (h_mid < x1) and (v_mid >= top) and (v_mid < bottom) return not any(obj_in_bbox(__bbox) for __bbox in bboxes)def extract(page):
"""Extract PDF text, Filter tables and delete in-par breaks."""
# Filter-out tables
if page.find_tables() != []:
# Get the bounding boxes of the tables on the page.
bboxes = [table.bbox for table in page.find_tables()]
bbox_not_within_bboxes = partial(not_within_bboxes, bboxes=bboxes)
# Filter-out tables from page
page = page.filter(bbox_not_within_bboxes)
# Extract Text
extracted = page.extract_text()
# Delete in-paragraph line breaks
extracted = extracted.replace(".\n", "**/m" # keep par breaks
).replace(". \n", "**/m" # keep par breaks
).replace("\n", "" # delete in-par breaks
).replace("**/m", ".\n\n") # restore par break
return extracted
print(extract(page12)[:500])
>>> 3 des Produktionspotentials anzusehen. Die statistischen Möglichkeiten lassen es nur an näherungsweise zu, ...

Fantastic! The table was successfully filtered out and we can now see that the page starts with a sentence cut in half by the page break. We leave it with extraction, but I encourage you to play around with more features, such as extracting page numbers, improving paragraph separation and fixing frequent mistakes such as recognising “0/o” for “%”.

Translation

AWS and DeepL offer two prominent APIs for high quality translation of text, but the character-based pricing schemes can turn-out extremely costly if we want to translate several long reports. To translate free of charge, we use the Google Api with a key workaround, enabling the translation of long texts.

from deep_translator import GoogleTranslator

Since the GoogleTranslate API is not maintained by Google, the community has repeatedly faced issues in translation. That is why we here use the deep_translator package, which acts as a useful wrapper for the API and enables us to seamlessly switch between translation engines, should we wish to. Importantly, GoogleTranslator can automatically identify the source language (in our example German), so we only need to specify our target language: English.

translate = GoogleTranslator(source=’auto’, target=’en’).translate

With this wrapper translation is extremely simple, as the following example demonstrates.

translate("Ich liebe Python programmieren.")
>>> 'I love Python programming.'

However, the key issue is that most translation engines have a 5000-byte upload limit. If a job should exceed it, the connection is simply terminated — which would, for example, prevent the translation of page11. Of course, we could translate every word/sentence individually, however, this undermines the translation quality. That is why we’re collecting chunks of sentences just below the upload the limit and translate them together.

Originally, I found this the workaround here. It uses the popular natural language processing tool nltk for identifying sentences. The package’s documentation is great, and I recommend anyone interested to try it out. Here, we’re limiting our attention to the package’s tokenizer. Importantly, tt cannot be stressed enough that only high-quality input will lead to high-quality translation output, so going the extra mile in these preparation steps will easily pay-off!

Because this can be daunting for first time-users, I present here the shell-script to install the relevant nltk fuctionality (on Windows OS). The “popular” subset includes the nltk.tokenize package, which will use now.

# Shell scriptpip install nltk
python -m nltk.downloader popular

As you can see below, the sent_tokenize function creates a list of sentences. The language argument defaults to English, which works just fine for most European languages. Please check-out the nltk documentation to see if the language you need is supported.

from nltk.tokenize import sent_tokenizetext = "I love Python. " * 2
sent_tokenize(text, language = "english")
>>> ['I love Python.', 'I love Python.']

Now, the second ingredient we need is an algorithm for collecting chunks of sentences below the upload limit. Once we find that adding another sentence would exceed 5k bytes, we translate the collection and start a new chunk with the current sentence. Importantly, if a sentence itself should be longer than a 5k bytes (which, remember, corresponds to roughly a page), we simply discard it and provide an in-text note, instead. Combining the i) set-up of the translation client, ii) sentence tokenization, and iii) chunk wise translation, we end up with the following translation function.

def translate_extracted(Extracted):
"""Wrapper for Google Translate with upload workaround."""
# Set-up and wrap translation client
translate = GoogleTranslator(source='auto', target='en').translate
# Split input text into a list of sentences
sentences = sent_tokenize(Extracted)
# Initialize containers
translated_text = ''
source_text_chunk = ''
# collect chuncks of sentences, translate individually
for sentence in sentences:
# if chunck + current sentence < limit, add the sentence
if ((len(sentence.encode('utf-8')) + len(source_text_chunk.encode('utf-8')) < 5000)):
source_text_chunk += ' ' + sentence
# else translate chunck and start new one with current sentence
else:
translated_text += ' ' + translate(source_text_chunk)
# if current sentence smaller than 5000 chars, start new chunck
if (len(sentence.encode('utf-8')) < 5000):
source_text_chunk = sentence
# else, replace sentence with notification message
else:
message = "<<Omitted Word longer than 5000bytes>>"
translated_text += ' ' + translate(message)
# Re-set text container to empty
source_text_chunk = ''
# Translate the final chunk of input text, if there is any valid text left to translate
if translate(source_text_chunk) != None:
translated_text += ' ' + translate(source_text_chunk)
return translated_text

To see if it works, we apply our translation function to a page we already worked with earlier. As is now also evident for the non-German speakers, apparently the hourly productivity rate increased by roughly 4% in 1978.

extracted = extract(pdf.pages[12])
translated = translate_extracted(extracted)[:500]
print(translated)
>>>3 of the production potential. The statistical possibilities allow only an approximation of the closures that still occur physically due to long-term shrinkage ...

Writing

We almost have everything we need. Like me, you might need to bring your extracted text back into a format easily readable by humans. While it is easy to save strings to .txt in Python, the lack of line-breaks makes them a poor choice for long reports. Instead, we will here write them back to PDF using the fpdf2 library, which apparently succeeds the no longer maintained pyfpdf package.

from fpdf import FPDF

After initializing an FPDF object, we can add a page object for every page we translated and write them on there. This will help us maintain the structure of the original document. Two things to note: firstly, in multi_cell we set width to zero to have full width and choose a hight of 5 to have slim line spacing. Secondly, since the pre-installed fonts are not Unicode compatible, we change the encoding to “latin-1”. For instructions on download and the use of Unicode compatible fonts, see the instructions on the fpdf2 website.

fpdf = FPDF()
fpdf.set_font("Helvetica", size = 7)
fpdf.add_page()
fpdf.multi_cell(w=0,h=5,
txt= translated.encode("latin-1",errors = "replace"
).decode("latin-1")
)
fpdf.output("output/page12.pdf")

Now, just like in extraction, there is obviously a lot more you could do with the fpdf2, such as the adding of page numbers, title layout, etc. However, for the purpose of this article, this minimal set-up will suffice.

Tying everthing together

We’ll now bring everything together in one pipeline. Remember that, to avoid losing too much information, we operate on each page individually. Importantly, we make two adaptations to the translation: since some pages are empty, but empty strings are not valid input for GoogleTranslator, we place an if condition before the translation. Secondly, because nltk allocates our paragraph breaks (i.e., “\n\n”) to the beginning of the following sentence, GoogleTranslate ignores these. That is why we translate each paragraph individually using a list comprehension. Be patient, translating 150 pages can take up to 7 minutes!

# Open PDF
with pdfplumber.open(“src/examples/1978-geschaeftsbericht-data.pdf”) as pdf:
# Initialize FPDF file to write on
fpdf = FPDF()
fpdf.set_font(“Helvetica”, size = 7)
# Treat each page individually
for page in pdf.pages:
# Extract Page
extracted = extract(page)
# Translate Page
if extracted != “”:
# Translate paragraphs individually to keep breaks
paragraphs = extracted.split(“\n\n”)
translated = “\n\n”.join(
[translate_extracted(paragraph) for paragraph in paragraphs]
)
else:
translated = extracted
# Write Page
fpdf.add_page()
fpdf.multi_cell(w=0, h=5,
txt= translated.encode(“latin-1”,
errors = “replace”
).decode(“latin-1”))
# Save all FPDF pages
fpdf.output(“output/trans_1978-geschaeftsbericht-data.pdf.pdf”)

Conclusion

Thank you for staying to the end. I hope this article gave you a hands-on example of how to translate PDFs and what the state-of-the-art packages are. Throughout the article I pointed out various potential extensions to this rudimentary example (i.e., adding page numbers, layout, etc.), so please share your approaches for these — I’d love to hear them. And, of course, I am also always eager to hear suggestions on how to improve on the code.

Stay safe and stay in touch!

--

--

Politics and Economics enthusiast with a growing passion for Open Source Learning.