Converting PDF and Gutenberg Document Formats into Text: Natural Language Processing in Production

What is critical in production-grade Natural Language Processing (NLP) is the fast pre-processing of popular document formats into text.

Published in

Towards Data Science

9 min readAug 22, 2020

Lots of Different Documents in the Enterprise. Source: Unsplash

Estimates state that 70%–85% of the world’s data is text (unstructured data). Most of the English and EU business data formats as byte text, MS Word, or Adobe PDF. [1]

Organizations web displays of Adobe Postscript Document Format documents (PDF). [2]

In this blog, I detail the following :

Create a file path from the web file name and local file name;
Change byte encoded Gutenberg project file into a text corpus;
Change a PDF document into a text corpus;
Segment continuous text into a Corpus of word text.

Converting Popular Document Formats into Text

1. Create local filepath from the web filename or local filename

The following function will take either a local file name or a remote file URL and return a filepath object.

Converting PDF and Gutenberg Document Formats into Text: Natural Language Processing in Production

What is critical in production-grade Natural Language Processing (NLP) is the fast pre-processing of popular document formats into text.

Converting Popular Document Formats into Text

1. Create local filepath from the web filename or local filename

Written by Bruce H. Cottman, Ph.D.