Converting PDF and Gutenberg Document Formats into Text: Natural Language Processing in Production

What is critical in production-grade Natural Language Processing (NLP) is the fast pre-processing of popular document formats into text.

Bruce H. Cottman, Ph.D.
Towards Data Science
9 min readAug 22, 2020

--

Lots of Different Documents in the Enterprise. Source: Unsplash

Estimates state that 70%–85% of the world’s data is text (unstructured data). Most of the English and EU business data formats as byte text, MS Word, or Adobe PDF. [1]

Organizations web displays of Adobe Postscript Document Format documents (PDF). [2]

In this blog, I detail the following :

  1. Create a file path from the web file name and local file name;
  2. Change byte encoded Gutenberg project file into a text corpus;
  3. Change a PDF document into a text corpus;
  4. Segment continuous text into a Corpus of word text.

Converting Popular Document Formats into Text

1. Create local filepath from the web filename or local filename

The following function will take either a local file name or a remote file URL and return a filepath object.

--

--

I write my blog utilizing decades of experience in investment, programming, and data science.