Powerful Text Augmentation Using NLPAUG

Dealing with class imbalances in NLP classification problems through Text Augmentation techniques using the amazing NLPAug

Raj Sangani
Towards Data Science

--

Photo by Brett Jordan on Unsplash

What is Data Augmentation and why should we care about it?

Data Augmentation is the practice of synthesizing new data from data at hand. This could be applied to any form of data from numbers to images. Usually, the augmented data is similar to the data that is already available. In all Machine Learning problems the dataset determines how well the problem can be solved. Sometimes we don’t have enough data to build robust models, and what’s even more common is having data with a palpable class imbalance. Say we are building a model which predicts one of two classes but we have 5000 samples of one class to train on and only 200 of the other class. In such a case our model will almost always predict the class with more samples since it has not been given enough data to discern between the two classes. We must then turn to collecting more data, but what if we can’t? One way is to generate exact copies of the 200 data samples that we have and decrease the imbalance. Although this does provide some improvement, the model is still learning from the same set of features! Perhaps a few artful tweaks can improve the quality of data we have.

Why is Text Augmentation difficult compared to other forms of Data Augmentation?

Think about it, to augment images we can just rotate , sharpen, or crop different areas of the images and the new data would still make some sense.

However, augmenting text data is very difficult. For instance changing the order of words may at first seem plausible but sometimes this can completely alter the meaning of the sentence, say, for instance, I had my car cleaned is different from I had cleaned my car.

Luckily, Edward Ma’s nlpaug gives us some amazing tools to augment text quickly. Let’s talk about some of them.

Ways to Augment Text Data

  1. Replace a few words with their synonyms.
  2. Replace a few words with words that have similar (based on cosine similarity) word embeddings (like word2vec or GloVe) to those words.
  3. Replace words based on the context using powerful transformer models (BERT).
  4. Use Back Translation , that is translate a sentence to another language and then translate it back to the original language which sometimes modifies a few words.

Let’s look at the first one and apply it to a problem to see if augmentation really works

Defining the Problem

I am going to be analysing a sentiment analysis problem which uses the Yelp Coffee reviews dataset from Kaggle. The dataset contains close to 7000 user reviews and ratings. The users have rated the coffee stores from 1 to 5, the higher the better. To create some imbalance I have discarded the neutral ratings (3) and have labelled all ratings greater than three as positive and all those lesser than 3 as negative. Here is an example review before preprocessing

Evaluation Metrics

I am not a big fan of the accuracy metric when it comes to imbalanced classification and will hence mainly look at the area under the ROC curve as the evaluation metric.

Model Used in both cases

I have used a Random Forest Classifier with 10 estimators after some text preprocessing and cleaning. The text was vectorized using the Tfidf Vectorizer and the best 3000 feature vectors based on term frequency were chosen as input features to the classifier. The model is pretty straightforward and I have linked the entire code below. For eager readers who would like to first refer to the code, here it is!

Results WITHOUT Text Augmentation

First, let’s have a look at the class imbalance. As you can see, the imbalance between positives and negatives is almost in the ratio 6.5 : 1 .

The huge imbalance between the classes (Image By Author)

So here are the results after training the RandomForest Classifier on the dataset.

AUC For the RandomForest(n=10) Classifier

As you can see the Area Under Curve is 0.85 and to prove that the model only performing badly on the class with the lesser number of samples have a closer look at the classification report.

Detailed Classification Report

The model’s recall and f1-score on the negative class (labelled 0) is absolutely terrible! This explains why the AUC is 0.85. Now, with minimal effort, we will improve on this AUC and f1-score through nlpaug.

Results after Augmentation

An example of how we can augment text by replacing words with synonyms

Please install nlpaug using pip. Please refer to my notebook to see the entire code. The following is just a small snippet.

pip install nlpaug

After this, I will be using the wordnet library to help with synonyms.

Let’s pick a sentence from the dataset — “Misleading reviews. Worst coffee ever had, and sorely disappointing vibe.”

import nlpaug
import nlpaug.augmenter.word as naw
aug = naw.SynonymAug(aug_src='wordnet',aug_max=2)
aug.augment("Misleading reviews. Worst coffee ever had, and sorely disappointing vibe.",n=2)

In the above code aug_max indicates the maximum number of words we want to replace with their corresponding synonyms. In the last line n=2 indicates that we want to generate 2 augmented sentences.

Here are the amazing results!

‘Misleading review article. Worst coffee ever get, and sorely disappointing vibe.’ ‘Lead astray reviews. Worst coffee ever had, and sorely dissatisfactory vibe.’

The augmenter replaced reviews with review article and had with get in the first sentence and in the second sentence misleading and disappointing were replaced with lead astray and dissatisfactory respectively .

How I decided to augment the data

I decided to introduce 2 new augmented sentences for each sentence in the training set with the label 0 (negative reviews belonging to the minority class). In each of these augmented sentences I decided to replace a maximum of 3 words by their synonyms. You can play around with these parameters yourself and have some fun.

Here is the distribution after augmenting the data. The minority class has tripled in size with some new meaningful data!

(Image By Author)

And now the climax we have been waiting for…

AUC after Augmentation!
Classification Report After Augmentation

WOW! We improved the AUC from 0.85 to 0.88 and improved the f1-score from 0.7 to 0.76.

Comparing the ROC Curves before and after augmenting

(Left)- ROC Curve for Data before Augmentation (Right)- ROC curve for Data after Augmentation (Image By Author)

Although slightly, the ROC curve on the right covers more area and is better. What’s even better is that the augmentation only took 57 seconds on a CPU. This shows how beneficial Text Augmentation can be.

Points To Note

  1. At no point during the experiment did we change the model or tune it, all the improvement in performance was solely due to data augmentation.
  2. The nlpaug library provides even more powerful augmentation options using word embeddings, BERT Transformers, and Back Translation. The one we used is the cheapest option in terms of storage and execution speed!
  3. The augmentation took under a minute on the CPU which is pretty rapid.
  4. Please explore the link in the references for more ways to augment data.
  5. Finally we improved the AUC from 0.85 to 0.88 with very little effort (in terms of time) and less than 5 lines of code.

References

  1. The official Github Library of nlpaug which contains example notebooks

Check out my GitHub for some other projects and the entire code. You can contact me on my website. I would love some feedback in the comments. Thank you for your time!

--

--