NLP for Beginners: Cleaning & Preprocessing Text Data

Rachel Koenig
Towards Data Science
7 min readJul 29, 2019

--

NLP is short for Natural Language Processing. As you probably know, computers are not as great at understanding words as they are numbers. This is all changing though as advances in NLP are happening everyday. The fact that devices like Apple’s Siri and Amazon’s Alexa can (usually) comprehend when we ask the weather, for directions, or to play a certain genre of music are all examples of NLP. The spam filter in your email and the spellcheck you’ve used since you learned to type in elementary school are some other basic examples of when your computer is understanding language.

As a data scientist, we may use NLP for sentiment analysis (classifying words to have positive or negative connotation) or to make predictions in classification models, among other things. Typically, whether we’re given the data or have to scrape it, the text will be in its natural human format of sentences, paragraphs, tweets, etc. From there, before we can dig into analyzing, we will have to do some cleaning to break the text down into a format the computer can easily understand.

For this example, we’re examining a dataset of Amazon products/reviews which can be found and downloaded for free on data.world. I’ll be using Python in Jupyter notebook.

Here are the imports used:

(You may need to run nltk.download() in a cell if you’ve never previously used it.)

Read in csv file, create DataFrame & check shape. We are starting out with 10,000 rows and 17 columns. Each row is a different product on Amazon.

I conducted some basic data cleaning that I won’t go into detail about now, but you can read my post about EDA here if you want some tips.

In order to make the dataset more manageable for this example, I first dropped columns with too many nulls and then dropped any remaining rows with null values. I changed the number_of_reviews column type from object to integer and then created a new DataFrame using only the rows with no more than 1 review. My new shape is 3,705 rows and 10 columns and I renamed it reviews_df.

NOTE: If we were actually going to use this dataset for analysis or modeling or anything besides a text preprocessing demo, I would not recommend eliminating such a large percent of the rows.

The following workflow is what I was taught to use and like using, but the steps are just general suggestions to get you started. Usually I have to modify and/or expand depending on the text format.

  1. Remove HTML
  2. Tokenization + Remove punctuation
  3. Remove stop words
  4. Lemmatization or Stemming

While cleaning this data I ran into a problem I had not encountered before, and learned a cool new trick from geeksforgeeks.org to split a string from one column into multiple columns either on spaces or specified characters.

The column I am most interested in is customer_reviews, however, upon taking a closer look, it currently has the review title, rating, review date, customer name, and review all in one cell separated by //.

Pandas .str.split method can be applied to a Series. First parameter is the repeated part of the string you want to split on, n=maximum number of separations and expand=True will split up the sections into new columns. I set the 4 new columns equal to a new variable called reviews.

Then you can rename the new 0, 1, 2, 3, 4 columns in the original reviews_df and drop the original messy column.

I ran the same method over the new customer_name column to split on the \n \n and then dropped the first and last columns to leave just the actual customer name. There is a lot more we could do here if this were a longer article! Right off the bat, I can see the names and dates could still use some cleaning to put them in a uniform format.

Removing HTML is a step I did not do this time, however, if data is coming from a web scrape, it is a good idea to start with that. This is the function I would have used.

Pretty much every step going forward includes creating a function and then applying it to a series. Be prepared, lambda functions will very shortly be your new best friend! You could also build a function to do all of these in one go, but I wanted to show the break down and make them easier to customize.

Remove punctuation:

One way of doing this is by looping through the Series with list comprehension and keeping everything that is not in string.punctuation, a list of all punctuation we imported at the beginning with import string.

“ “.join will join the list of letters back together as words where there are no spaces.
If you scroll up you can see where this text previously had commas, periods, etc.

However, as you can see in the second line of output above, this method does not account for user typos. Customer had typed “grandson,am which then became one word “grandsonamonce the comma was removed. I still think this is handy to know in case you ever need it though.

Tokenize:

This breaks up the strings into a list of words or pieces based on a specified pattern using Regular Expressions aka RegEx. The pattern I chose to use this time (r'\w') also removes punctuation and is a better option for this data in particular. We can also add.lower() in the lambda function to make everything lowercase.

see in line 2: “grandson” and “am” are now separate.

Some other examples of RegEx are:

‘\w+|\$[\d\.]+|\S+’ = splits up by spaces or by periods that are not attached to a digit

‘\s+’, gaps=True = grabs everything except spaces as a token

‘[A-Z]\w+’ = only words that begin with a capital letter.

Remove stop words:

We imported a list of the most frequently used words from the NL Toolkit at the beginning with from nltk.corpus import stopwords. You can run stopwords.word(insert language) to get a full list for every language. There are 179 English words, including ‘i’, ‘me’, ‘my’, ‘myself’, ‘we’, ‘you’, ‘he’, ‘his’, for example. We usually want to remove these because they have low predictive power. There are occasions when you may want to keep them though. Such as, if your corpus is very small and removing stop words would decrease the total number of words by a large percent.

Stemming & Lemmatizing:

Both tools shorten words back to their root form. Stemming is a little more aggressive. It cuts off prefixes and/or endings of words based on common ones. It can sometimes be helpful, but not always because often times the new word is so much a root that it loses its actual meaning. Lemmatizing, on the other hand, maps common words into one base. Unlike stemming though, it always still returns a proper word that can be found in the dictionary. I like to compare the two to see which one works better for what I need. I usually prefer Lemmatizer, but surprisingly, this time, Stemming seemed to have more of an affect.

Lemmatizer: can barely even see a difference

You see more of a difference with Stemmer so I will keep that one in place. Since this is the final step, I added " ".join() to the function to join the lists of words back together.

Now your text is ready to be analyzed! You could go on to use this data for sentiment analysis, could use the ratings or manufacture columns as target variable based on word correlations. Maybe build a recommender system based on user purchases or item reviews or customer segmentation with clustering. The possibilities are endless!

https://pixabay.com/photos/thank-you-neon-lights-neon-362164/

--

--