Converting PDF and Gutenberg Document Formats into Text: Natural Language Processing in Production
What is critical in production-grade Natural Language Processing (NLP) is the fast pre-processing of popular document formats into text.
Published in
9 min readAug 22, 2020
Estimates state that 70%–85% of the world’s data is text (unstructured data). Most of the English and EU business data formats as byte text, MS Word, or Adobe PDF. [1]
Organizations web displays of Adobe Postscript Document Format documents (PDF). [2]
In this blog, I detail the following :
- Create a file path from the web file name and local file name;
- Change byte encoded Gutenberg project file into a text corpus;
- Change a PDF document into a text corpus;
- Segment continuous text into a Corpus of word text.
Converting Popular Document Formats into Text
1. Create local filepath from the web filename or local filename
The following function will take either a local file name or a remote file URL and return a filepath object.