NLP: Text Data Cleaning and Preprocessing

Maria Dobko
4 min readSep 22, 2018

Data scientists spend most of their time not on modeling but on cleaning and exploring the data. Furthermore, different approaches in text cleaning can lead to very diverse results during model training. In this post, I describe and provide some code examples (Python) of a pipeline for filtering and parsing textual data. The dataset for this task is going to be Amazon Fine Food Reviews, which you can download here. Important: this dataset is already not that messy, so from time to time through this tutorial I will add examples of how unclean text could look and show the ways of processing it.

Step 1. Sneak peek into the data

At this step, our task is to have a look at the data, explore its main characteristics: size & structure (how sentences, paragraphs, text are built), finally, understand how much of this data is useful for our needs? We start by reading the data and doing some statistical analysis.

As soon as we see the data we can tell which columns are more important for a particular task. In this case, we are more interested in “Summary” and “Text” rather than other features. Now, it’s time to check if any of these columns have missing data.

Possible statistical features about textual data can look like this:

You can also explore the dictionary: look how many unique words are there:

Further, these words can be clustered using word2vec or other pre-trained embeddings to create a picture of what kind of categories are in this dataset.

Step 2. Digits/Punctuation/Symbols

Significant information can be hidden in the appearance of the symbols, for example, currency, dollar sign can help identify economics and business-oriented text from novels, etc. However, in most of the cases, you’d like to get rid of them.

Sometimes punctuation can be more difficult, for example:

“<b>This review was provided by</b> <h2>I am a full-time student</h2>”

In this sentence, it is not an option to delete all non-letter symbols as long as it will join other letters, that are not part of the same word (..provided bybI am a full-time …), or if we substitute these signs for “ “, it will create new words, that do not actually exist (… provided by b I am a full-time …). The solution to this is adding new text-dependent constraints on symbols filtering.

Step 3. Pay attention to the Language

Is there a need to translate sentences? If the appearance of others languages is not sufficient, then maybe the best option is just to get rid of it. Otherwise, there are several tools and packages that can help you to translate the text.

One of them is API from Google Translate.

The other option is to filter out all Cyrillic letters:

If you decide to simply delete all words, that don’t belong to the dictionary of the needed language, that’s how you can do it:

Pay attention, when working with a specific type of text, like scientific papers, you will see a unique dictionary. Thus, there is a risk, that by deleting words that are not in the English dictionary you may accidentally delete the ones, that exist, but were not added to it yet, as they’re specific to a domain or were created recently. For example, such words as CycleGAN and CNN are not in NLTK wordnet.

Step 4. Stopwords

It is common, that before modeling, you run simple TF-IDF algorithm to see the importance of certain words. But there are always those, which appear very often but have no value or carry minimum information about the text, they’re called stopwords. Including “is”, “a”, “the”, “at” etc. This is how you delete them:

In some cases, you will need to filter certain pre-defined words, which can give unwanted bias during modeling. For example, delete the names, places, or other words that can make the algorithm group the data samples according to only one feature. Then, your job is to create your own list of stopwords. The most popular stopwords’ lists are included in nltk.corpus, onix text retrieval toolkit, ranks.nl, CoreNLP, and many others.

That’s it for now. I hope this article was useful, thank you for reading!

P.S. All pictures are from Gravity Falls — really cool TV show :)

--

--

Maria Dobko

Health Tech Master’s student at Cornell and Technion; Computer Vision researcher