Data Augmentation library for text
In the previous story, you understand different approaches to generate more training data for your NLP task model. In this story, we will learn how can you do it with just a few line codes.
In natural language processing (NLP) field, it is hard to augmenting text due to high complexity of language. Not every word we can replace it by others such as a, an, the. Also, not every word has synonym. Even changing a word, the context will be totally difference. On the other hand, generating augmented image in computer vision area is relative easier. Even introducing noise or cropping out portion of image, model can still classify the image.
Introduction to nlpaug
After used imgaug in computer vision project, I am thinking whether we can have a similar library to generate synthetic data. Therefore, I re-implement those research papers by using the existing library and pre-trained model. Basic elements of nlpaug includes:
Character
: OCR Augmenter, QWERTY Augmenter and Random Character AugmenterWord
: WordNet Augmenter, word2vec Augmenter, GloVe Augmenter, fasttext Augmenter, BERT Augmenter, Random Word CharacterFlow
: Sequential Augmenter, Sometimes Augmenter
Intuitively, Character Augmenters
and Word Augmenters
are focusing on character level and word level manipulation respectively. Flow
works as an orchestra the control augmentation flow. You can access github for the library.
Character
Augmenting data in character level. Possible scenarios include image to text and chatbot. During recognizing text from image, we need to optical character recognition (OCR) model to achieve it but OCR introduces some errors such as recognizing “o” and “0”. In chatbot, we still have typo even though most of the application comes with word correction. To overcome this problem, you may let your model “see” those possible outcomes before online prediction.
OCR
When working on NLP problem, OCR results may be one of the inputs of your NLP problem. For example, “0” may be recognized as “o” or “O”. If you are using bag-of-words or classic word embeddings as a feature, you will get trouble as out-of-vocabulary (OOV) around you today and always. If you use state-of-the-art models such as BERT and GPT, the OOV issue seems resolved as word will be split to subword. However, some information is lost.
OCRAug
is designed to simulate OCR error. It will replace the target character by pre-defined mapping table.
Example of augmentation
Original:
The quick brown fox jumps over the lazy dog
Augmented Text:
The quick brown fox jumps over the lazy d0g
QWERTY
Another project you may involve is chatbot or other messaging channels such as email. Although spell checking will be performed, some misspelled still exist. It may hurt your NLP model as mentioned before.
QWERTYAug
is designed to simulate keyword distance error. It will replace the target character by 1 keyword distance. You can config whether include number or special character or not.
Example of augmentation
Original:
The quick brown fox jumps over the lazy dog
Augmented Text:
Tne 2uick hrown Gox jumpQ ovdr tNe <azy d8g
Random Character
From different research, noise injection may help to generalized your NLP model sometimes. We may add some noise to your word such as adding or deleting one character from your word.
RandomCharAug
is designed to inject noise into your data. Unlike OCRAug
and QWERTYAug
, it supports insertion, substitution, and insertion.
Example of insert augmentation
Original:
The quick brown fox jumps over the lazy dog
Augmented Text:
T(he quicdk browTn Ffox jumpvs 7over kthe clazy 9dog
Word
Besides character augmentation, word level is important as well. We make use of word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), fasttext (Joulin et al., 2016), BERT(Devlin et al., 2018) and wordnet to insert and substitute similar word. Word2vecAug
, GloVeAug
and FasttextAug
use word embeddings to find the most similar group of words to replace the original word. On the other hand, BertAug
use language models to predict possible target words. WordNetAug
use statistics way to find a similar group of words.
Word Embeddings (word2vec, GloVe, fasttext)
Classic embeddings use a static vector to present a word. Ideally, the meaning of the word is similar if vectors are near each other. Actually, it depends on the training data. For example, “rabbit” is similar to “fox” in word2vec while “nbc” is similar to “fox” in GloVe.
Sometimes, you want to replace words by similar words such that NLP model does not rely on a single word.Word2vecAug
, GloVeAug
andFasttextAug
are designed to provide a “similar” word based on pre-trained vectors.
Besides substitution, insertion helps to inject noise into your data. It picks words from vocabulary randomly.
Example of insert augmentation
Original:
The quick brown fox jumps over the lazy dog
Augmented Text:
The quick Bergen-Belsen brown fox jumps over Tiko the lazy dog
Example of substitute augmentation
Original:
The quick brown fox jumps over the lazy dog
Augmented Text:
The quick gray fox jumps over to lazy dog
Contextualized Word Embeddings
Since classic word embeddings use a static vector to represent the same word. It may not fit some scenarios. For “Fox” can represent as animal and broadcasting company. To overcome this problem, contextualized word embeddings is introduced to consider surrounding words to generate a vector under a different context.
BertAug
is designed to provide this feature to perform insertion and substitution. Different from previous word embeddings, insertion is predicted by BERT language model rather than pick one word randomly. Substitution use surrounding words as a feature to predict the target word.
Example of insert augmentation
Original:
The quick brown fox jumps over the lazy dog
Augmented Text:
the lazy quick brown fox always jumps over the lazy dog
Example of substitute augmentation
Original:
The quick brown fox jumps over the lazy dog
Augmented Text:
the quick thinking fox jumps over the lazy dog
Synonym
Besides the neural network approach, a thesaurus can achieve similar objectives. The limitation of synonym is that some words may not have similar words. WordNet from an awesome NLTK library helps to find the synonym words.
WordNetAug
provides a substitution feature to replace the target word. Instead of finding synonyms purely, some preliminary checking makes sure that the target word can be replaced. Those rules are:
- Do not pick determiner (e.g. a, an, the)
- Do not pick a word that does not has a synonym.
Example of augmentation
Original:
The quick brown fox jumps over the lazy dog
Augmented Text:
The quick brown fox parachute over the lazy blackguard
Random Word
So far we do not introduce deletion in word level. RandomWordAug
can help to remove a word randomly.
Example of augmentation
Original:
The quick brown fox jumps over the lazy dog
Augmented Text:
The fox jumps over the lazy dog
Flow
Up to here, the above augmenters can be invoked alone. What if you want to combine multiple augmenters together? To make use of multiple augmentations, sequential and sometimes pipelines are introduced to connect augmenters. A single text can go though different augmenters to generate diversity of data.
Sequential
You can add as much as augmenter you want to this flow and Sequential
executes them one by one. For example, you can combine RandomCharAug
and RandomWordAug
together.
Sometimes
If you do not want to execute the same set of augmenters all the time, sometimes
will pick some of the augmenters every time.
Recommendation
The above approach is designed to solve problems that authors are facing in their problems. If you understand your data, you should tailor made augmentation approach it. Remember that the golden rule in data science is garbage in garbage out.
In general, you can try the thesaurus approach without quite understanding your data. It may not boost up a lot due to the aforementioned thesaurus approach limitation.
About Me
I am Data Scientist in Bay Area. Focusing on state-of-the-art in Data Science, Artificial Intelligence , especially in NLP and platform related. Feel free to connect with me on LinkedIn or following me on Medium or Github.
Extension Reading
- Image augmentation library (imgaug)
- Text augmentation library (nlpaug)
- Data Augmentation in NLP
- Data Augmentation for Audio
- Data Augmentation for Spectrogram
- Does your NLP model able to prevent an adversarial attacks?
- Data Augmentation in NLP: Best Practices From a Kaggle Master
Reference
- X. Zhang, J. Zhao and Y. LeCun. Character-level Convolutional Networks for Text Classification. 2015
- W. Y. Wang and D. Yang. That’s So Annoying!!!: A Lexical and Frame-Semantic Embedding Based Data Augmentation Approach to Automatic Categorization of Annoying Behaviors using #petpeeve Tweets. 2015
- S. Kobayashi. Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relation. 2018
- C. Coulombe. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs. 2018