TextGenie - Augmenting your text dataset with just 2 lines of code!

Het Pandya
Towards Data Science
6 min readJun 26, 2021

--

TextGenie Logo - Image by Author

Often while developing Natural Language Processing models, we find it difficult to find relevant data. And more than that, finding data in a large amount.

Previously, while developing our Intent Classifier, we used the CLINC150 Dataset that had 100 samples for 150 different classes. But, what if we needed even more samples? One more similar scenario was when I was working on a contextual assistant with Rasa. While creating the training data from scratch, I’d have to imagine different samples for each intent or ask my friends for some help. Each class might need a healthy amount of samples depending upon the domain.

This is when I landed upon the idea of creating TextGenie, a library to augment text data. The python package is open sourced at my Github repo. Let’s see the library in action.

How does the library work

The library uses the following approaches to augment text data so far:

Paraphrasing using T5

With paraphrasing using deep learning, a handsome and varied amount of samples can be generated. We shall use this T5 model from huggingface for generation of paraphrases.

BERT Mask Filling

For augmenting text using mask filling, the first words that can be masked are found. For that, we shall use spacy to extract keywords from a sentence. Once the keywords are found, they are replaced with a mask and fed to the BERT model to predict a word in place of the masked word.

Converting sentence to active voice

In addition, we also check if a sentence is in passive voice. If so, it is converted to active voice.

Installation

Install the library using:

pip install textgenie

Usage

Let’s initialize the augmentor from TextGenie class using the following code:

Here, along with the paraphrasing model, I’ve mentioned the name of the BERT model, which is set to None by default. But it can be enabled by mentioning the name of the model. It is advised to use the mask filling method as well, since it will help generate more data.

You can find the full parameters list for the TextGenie object below:

  • paraphrase_model_name: The name of the T5 paraphrase model. Edit: A list of pretrained model for paraphrase generation can be found here.
  • mask_model_name: This parameter is optional BERT model that will be used to fill masks. The default value for the mask_model_name is set to None and is disabled by default. But it can be enabled by mentioning the name of the BERT model to be used. A list of mask filling models can be found here.
  • spacy_model_name: Name of the Spacy model. Available models can be found here. The default value is set to en. Although the spacy model name has been set, it won’t be used if the mask_model_name is set to None.
  • device: The device where the model will be loaded. The default value is set to cpu.

Text augmentation with paraphrases using T5:

The parameters list for the augment_sent_t5() method is as below:

  • sent: The sentence on which augmentation has to be applied.
  • prefix: The prefix for the T5 model input.
  • n_predictions: The number of augmentations, the function should return. The default value is set to 5.
  • top_k: The number of predictions, the T5 model should generate. The default value is set to 120.
  • max_length: The max length of the sentence to feed to the model. The default value is set to 256.

Text augmentation with BERT mask filling:

Note: While using this method, please take care of punctuations.

Please find the parameters list for the augment_sent_mask_filling() method below:

  • sent: The sentence on which augmentation has to be applied.
  • n_mask_predictions: The number of predictions, the BERT mask filling model should generate. The default value is set to 5.

Converting sentence to active voice

Here in the first example, the sentence was in active voice. So, it is returned as it was. While in the other example, the sentence was converted from passive to active voice.

Following is the parameter needed for the convert_to_active() method:

  • sent: The sentence that has to be converted.

Magic once: Wrapping all the methods in single method

Bound by the space occupied by outputs, I’ve put the prediction values to a smaller number. Please feel free to play with them! 😉

As it appears, the textgenie.magic_once() method merges the functionalities of the all the techniques mentioned above.

Since this method operates on individual text data, it can be merged easily with other frameworks which require data augmentation.

The full list of parameters for the magic_once() is:

  • sent: The sentence that has to be augmented.
  • paraphrase_prefix: The prefix for the T5 model input.
  • n_paraphrase_predictions: The number of augmentations, the T5 model should return. The default value is set to 5.
  • paraphrase_top_k: The total number of predictions, the T5 model should generate. The default value is set to 120.
  • paraphrase_max_length: The max length of the sentence to feed to the model. The default value is set to 256.
  • n_mask_predictions: The number of predictions, the BERT mask filling model should generate. The default value is set to None.
  • convert_to_active: If the sentence should be converted to active voice. The default value is set to True.

Time for the Magic Lamp!

Now that having talked about individual data, let’s bring the genie’s magic on the whole dataset!😋

The magic_lamp() method takes a whole dataset and generates a txt or tsv file containing the augmented data depending upon the input. If the input is a Python List or a .txt file, the augmented output will be stored in a txt file named sentences_aug.txt and also, a Python List containing the augmented data will be returned. If the data is in a csv or tsv file, a tsv file with the name with name original_file_name_aug.tsv containing the augmented data will be saved and a pandas DataFrame will be returned. If these files contain labeled data, augmented data along with corresponding labels will be returned.

Before all, we’ll need a dataset to work on. I’ve taken 300 samples to play with from the SMS Spam Collection Data Set that you can download from here.

Augmenting dataset

Download the hamspam.tsv in your current working directory and run the following code:

The full list of the magic_lamp() method is as follows:

  • sentences: The dataset that has to be augmented. This can be a Python List, a txt, csv or tsv file.
  • paraphrase_prefix: The prefix for the T5 model input.
  • n_paraphrase_predictions: The number of augmentations, the T5 model should return. The default value is set to 5.
  • paraphrase_top_k: The total number of predictions, the T5 model should generate. The default value is set to 120.
  • paraphrase_max_length: The max length of the sentence to feed to the model. The default value is set to 256.
  • n_mask_predictions: The number of predictions, the BERT mask filling model should generate. The default value is set to None.
  • convert_to_active: If the sentence should be converted to active voice. The default value is set to True.
  • label_column: The name of the column that contains labeled data. The default value is set to None. This parameter is not required to be set if the dataset is in a Python List or a txt file.
  • data_column: The name of the column that contains data. The default value is set to None. This parameter too is not required if the dataset is a Python List or a txt file.
  • column_names: If the csv or tsv does not have column names, a Python list has to be passed to give the columns a name. Since this function also accepts Python List and a txt file, the default value is set to None. But, if csv or tsv files are used, this parameter has to be set.

The code will take some time to augment the data. Grab your cup of coffee and watch the genie perform his magic!

Once the augmentation has been completed, the output will look something like this:

You can find the whole augmented dataset here. It goes from 300 rows to 43600 rows!

Wrapping up

That’s all for now. Suggestions and ideas are always welcome for the library.

Thanks for reading 😃!

--

--

Hi, I am a machine learning enthusiast, passionate about counting stars under the sky of Deep Learning and Machine Learning.