Hands-on Tutorials

NLP — Working with Books

A guide to cleaning those dirty, dirty texts

Kourosh Alizadeh

Published in

Towards Data Science

6 min readJan 18, 2021

Working with text data is always presenting us with new challenges. Today I’m going to go over one particular kind of text data that can be extra challenging: books.

Books present a number problems for us as data scientists, especially if they are taken from pdfs. It’s possible that usual pdf readers like PyPDF simply cannot handle the pdf if the text is stored as an image and not as actual text. And even if it is stored properly, there are likely to be a lot of random problems in the resulting text data.

Quine, I love you, but you need some help bud. — image by author

Here I’m going to go through a number of steps I use when working with books. The code is taken from a recent project where the goal was to classify over 50 philosophical texts into 10 schools of thought.

1. Getting text in the right format

First, you’ll need the text to be in the correct format. A .txt file is ideal. As I mentioned, PyPDF can read pdfs that actually store texts, but a lot of pdfs are actually scans, where the ‘text’ is just a picture of the page. To convert those, I recommend the aptly named pdftotext.com. A lot of these online services have low memory limits or low quality results, but this has been the best for me and works even for large books.

2. Clip the front and end matter

Now you’ve got your text data, open the file and take a look. You’ll probably notice that the start of the text is a lot of copyright material, and the end could be a long index or other irrelevant bit of text (this applies more to academic texts than to fiction).

Depending on your goals, you’ll probably want to remove that. Now, you could count the characters and cut the text there, but it’s easier to find (or insert) a unique marker in the text and use pythons .split() to cut the text there.

Here’s an example:

russell_problems_of_phil = russell_problems_of_phil.split('n the following pages')[1].split('BIBLIOGRAPHICAL NOTE')[0]

First you’ll split on the front marker, then take the second item of the list. Then split on the end marker, and take the first item. You can shave a few characters off at either point if you feel the need.

3. Skim the text and identify oddities

At this point, if you haven’t already, it’s a good idea to skim your text and get a feel for any odd patterns. This could be the header that keeps being inserted in the middle of sentences. It could be the fact that there are strange characters like ‘~’ or ‘;’ in the place of sensible letters. When the program that scans your pdf images turns them into text, it does its best, but ‘ri’ does look a lot like ’n’, a smudgy ‘i’ could well be an ‘;’, or a capital ‘M’ could be ‘lll.’ As you work with more of these you’ll start to see the common errors, and it’s good to know what tends to show up in your specific text, since each of them can be a little different.

4. Start cleaning at the character level

Get ready to make re.sub() your new best friend! We begin by just cutting all the strange characters that don’t mean anything to us.

result = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff\xad\x0c6§\\\£\Â*_<>""⎫•{}Γ~]', ' ', to_correct)

A lot of these are utf-8 encoding characters, others are just strange (what the heck is ‘⎫’ supposed to mean?).

It’s also useful to make all the whitespaces one and the same — we don’t really care about linebreaks here.

result = re.sub(r'\s', ' ', to_correct)

Another common, but mostly meaningless, textual element is roman numerals. This regex expression is designed to remove them.

# first capitalized ones
result = re.sub(r'\s((I{2,}V*X*\.*)|(IV\.*)|(IX\.*)|(V\.*)|(V+I*\.*)|(X+L*V*I*]\.*))\s', ' ', to_correct)# then lowercase
result = re.sub(r'\s((i{2,}v*x*\.*)|(iv\.*)|(ix\.*)|(v\.*)|(v+i*\.*)|(x+l*v*i*\.*))\s', ' ', to_correct)

For my project, I actually removed all the numbers, not just the roman ones. That may or may not work for you, but it does do a nice job of removing page numbers. If you don’t do that, you’ll want to find another way to stop page numbers from breaking up otherwise normal sentences.

Another good step to take is to consolidate different ways of abbreviating things so that they’ll be consistent across texts in your corpus. For example, turn all your ‘&’s into actual ‘and’s.

result = re.sub(r'&', 'and', to_correct)

A lot of this work will depend on your specific case and what you feel you need to keep; there could well be abbreviations specific to your field of study that you want to deal with in a special way.

A word of caution, however — when removing characters, always replace them with a whitespace (‘ ‘). This will stop you from accidentally fusing words that were previously separated by some odd character. At the end of the cleaning, you can use another regex to easily remove extra whitespaces.

result = re.sub(r'\s+', ' ', to_correct)

5. Clean the text at the word level

It often happens that there are words or phrases in your text file that actually will not contribute meaningfully to your model. Many books have the title or chapter title at the top of every page, for example. When you convert that to a txt file, that header will be shoved in wherever pages end, which means often in the middle of a sentence. The same is true of page numbers and footnotes.

Sometimes you get lucky and the headers are in all caps — that’s easy to remove:

# this removes all strings of capital letters that are more than 2 characters long
result = re.sub(r'[A-Z]{2,}', ' ', to_correct)

There could also just be a number of odd artifacts of pdf-to-txt transformation that you have to deal with in an ad hoc way. My process for dealing with these was as follows:

tokenize the text into sentences and enter it into a dataframe.
search in the dataframe for suspected problem words.
if I judge the problem to be common and easily isolatable, I add it to a general dictionary where the keys are problem regex expressions, and the values are the correct strings to insert in their place.
next time I build the dataframe, use I a for loop to apply ever element of the cleaning dictionary before sentence-tokenizing.

6. Build a dataframe and examine it

Before this, you couldn’t really build a dataframe of sentences because your work would impact whatever sentence tokenizer you were using (hopefully in a positive way). And while you may go back and do more cleaning as you look more at your data, at this point using a sentence tokenizer to build a dataframe of sentences is a good step.

With this built, you can start cleaning the data in a different way.

First, take a look at all the short sentences. These tend to be weird all-punctuation strings or otherwise generally meaningless. I found 20 characters to be a good cut-off point and just dropped all shorter strings from my data.

df = df.drop(df[df['sentence_length'] < 20].index)

Another good step would be to look at words that could indicate footnotes (‘ibid’ or ‘note’) or words that could indicate a foreign language that would mess up your model. If you’re dealing with academic texts, look for points where authors mention themselves — these are almost footnotes added by editors (Aristotle doesn’t really talk about Aristotle a lot, believe it or not). And of course, check for duplicates, since those could easily be headers you missed or some other kind of oddity.

7. Enjoy your clean data!

If you’ve done all of the above, your new book text should be relatively clean, especially compared to the tangled mess it might have been before you started. Congratulations!

Of course, the whole thing is an iterative process, and you’ll likely find some new oddity as you explore it, but that’s all part of the fun. I’ve found word2vec modeling to be particularly good at showing up my cleaning flaws, since rare odd words tend to be highly ‘similar’ to normal words that they show up next to that one time.

I hope this article has been helpful; I know I wish I had someone to help guide me through working with all these mangled books. If you’d like to see the complete code I used to clean my texts, you can visit the repo here.