
Often while developing Natural Language Processing models, we find it difficult to find relevant data. And more than that, finding data in a large amount.
Previously, while developing our Intent Classifier, we used the CLINC150 Dataset that had 100 samples for 150 different classes. But, what if we needed even more samples? One more similar scenario was when I was working on a contextual assistant with Rasa. While creating the training data from scratch, I’d have to imagine different samples for each intent or ask my friends for some help. Each class might need a healthy amount of samples depending upon the domain.
This is when I landed upon the idea of creating TextGenie, a library to augment text data. The python package is open sourced at my Github repo. Let’s see the library in action.
How does the library work
The library uses the following approaches to augment text data so far:
Paraphrasing using T5
With paraphrasing using Deep Learning, a handsome and varied amount of samples can be generated. We shall use this T5 model from huggingface for generation of paraphrases.
BERT Mask Filling
For augmenting text using mask filling, the first words that can be masked are found. For that, we shall use spacy to extract keywords from a sentence. Once the keywords are found, they are replaced with a mask and fed to the BERT model to predict a word in place of the masked word.
Converting sentence to active voice
In addition, we also check if a sentence is in passive voice. If so, it is converted to active voice.
Installation
Install the library using:
pip install textgenie
Usage
Let’s initialize the augmentor from TextGenie
class using the following code:
Here, along with the paraphrasing model, I’ve mentioned the name of the BERT model, which is set to None
by default. But it can be enabled by mentioning the name of the model. It is advised to use the mask filling method as well, since it will help generate more data.
You can find the full parameters list for the TextGenie
object below:
paraphrase_model_name
: The name of the T5 paraphrase model. Edit: A list of pretrained model for paraphrase generation can be found here.mask_model_name
: This parameter is optional BERT model that will be used to fill masks. The default value for themask_model_name
is set toNone
and is disabled by default. But it can be enabled by mentioning the name of the BERT model to be used. A list of mask filling models can be found here.spacy_model_name
: Name of the Spacy model. Available models can be found here. The default value is set toen
. Although the spacy model name has been set, it won’t be used if themask_model_name
is set toNone
.device
: The device where the model will be loaded. The default value is set tocpu
.
Text augmentation with paraphrases using T5:
The parameters list for the augment_sent_t5()
method is as below:
sent
: The sentence on which augmentation has to be applied.prefix
: The prefix for the T5 model input.n_predictions
: The number of augmentations, the function should return. The default value is set to5
.top_k
: The number of predictions, the T5 model should generate. The default value is set to120
.max_length
: The max length of the sentence to feed to the model. The default value is set to256
.
Text augmentation with BERT mask filling:
Note: While using this method, please take care of punctuations.
Please find the parameters list for the augment_sent_mask_filling()
method below:
sent
: The sentence on which augmentation has to be applied.n_mask_predictions
: The number of predictions, the BERT mask filling model should generate. The default value is set to5
.
Converting sentence to active voice
Here in the first example, the sentence was in active voice. So, it is returned as it was. While in the other example, the sentence was converted from passive to active voice.
Following is the parameter needed for the convert_to_active()
method:
sent
: The sentence that has to be converted.
Magic once: Wrapping all the methods in single method
Bound by the space occupied by outputs, I’ve put the prediction values to a smaller number. Please feel free to play with them! 😉
As it appears, the textgenie.magic_once()
method merges the functionalities of the all the techniques mentioned above.
Since this method operates on individual text data, it can be merged easily with other frameworks which require data augmentation.
The full list of parameters for the magic_once()
is:
sent
: The sentence that has to be augmented.paraphrase_prefix
: The prefix for the T5 model input.n_paraphrase_predictions
: The number of augmentations, the T5 model should return. The default value is set to5
.paraphrase_top_k
: The total number of predictions, the T5 model should generate. The default value is set to120
.paraphrase_max_length
: The max length of the sentence to feed to the model. The default value is set to256
.n_mask_predictions
: The number of predictions, the BERT mask filling model should generate. The default value is set toNone
.convert_to_active
: If the sentence should be converted to active voice. The default value is set toTrue
.
Time for the Magic Lamp!
Now that having talked about individual data, let’s bring the genie’s magic on the whole dataset!😋
The magic_lamp()
method takes a whole dataset and generates a txt
or tsv
file containing the augmented data depending upon the input. If the input is a Python List
or a .txt
file, the augmented output will be stored in a txt
file named sentences_aug.txt
and also, a Python List
containing the augmented data will be returned. If the data is in a csv
or tsv
file, a tsv
file with the name with name original_file_name_aug.tsv
containing the augmented data will be saved and a pandas DataFrame
will be returned. If these files contain labeled data, augmented data along with corresponding labels will be returned.
Before all, we’ll need a dataset to work on. I’ve taken 300 samples to play with from the SMS Spam Collection Data Set that you can download from here.
Augmenting dataset
Download the hamspam.tsv
in your current working directory and run the following code:
The full list of the magic_lamp()
method is as follows:
sentences
: The dataset that has to be augmented. This can be aPython List
, atxt
,csv
ortsv
file.paraphrase_prefix
: The prefix for the T5 model input.n_paraphrase_predictions
: The number of augmentations, the T5 model should return. The default value is set to5
.paraphrase_top_k
: The total number of predictions, the T5 model should generate. The default value is set to120
.paraphrase_max_length
: The max length of the sentence to feed to the model. The default value is set to256
.n_mask_predictions
: The number of predictions, the BERT mask filling model should generate. The default value is set toNone
.convert_to_active
: If the sentence should be converted to active voice. The default value is set toTrue
.label_column
: The name of the column that contains labeled data. The default value is set toNone
. This parameter is not required to be set if the dataset is in aPython List
or atxt
file.data_column
: The name of the column that contains data. The default value is set toNone
. This parameter too is not required if the dataset is aPython List
or atxt
file.column_names
: If thecsv
ortsv
does not have column names, a Python list has to be passed to give the columns a name. Since this function also acceptsPython List
and atxt
file, the default value is set toNone
. But, ifcsv
ortsv
files are used, this parameter has to be set.
The code will take some time to augment the data. Grab your cup of coffee and watch the genie perform his magic!
Once the augmentation has been completed, the output will look something like this:
You can find the whole augmented dataset here. It goes from 300 rows to 43600 rows!
Wrapping up
That’s all for now. Suggestions and ideas are always welcome for the library.
Thanks for reading 😃 !