Data Augmentation library for text

Edward Ma
Towards Data Science
7 min readApr 20, 2019

--

Photo by Edward Ma on Unsplash

In the previous story, you understand different approaches to generate more training data for your NLP task model. In this story, we will learn how can you do it with just a few line codes.

In natural language processing (NLP) field, it is hard to augmenting text due to high complexity of language. Not every word we can replace it by others such as a, an, the. Also, not every word has synonym. Even changing a word, the context will be totally difference. On the other hand, generating augmented image in computer vision area is relative easier. Even introducing noise or cropping out portion of image, model can still classify the image.

Introduction to nlpaug

After used imgaug in computer vision project, I am thinking whether we can have a similar library to generate synthetic data. Therefore, I re-implement those research papers by using the existing library and pre-trained model. Basic elements of nlpaug includes:

  • Character: OCR Augmenter, QWERTY Augmenter and Random Character Augmenter
  • Word: WordNet Augmenter, word2vec Augmenter, GloVe Augmenter, fasttext Augmenter, BERT Augmenter, Random Word Character
  • Flow: Sequential Augmenter, Sometimes Augmenter

Intuitively, Character Augmenters and Word Augmenters are focusing on character level and word level manipulation respectively. Flow works as an orchestra the control augmentation flow. You can access github for the library.

Character

Augmenting data in character level. Possible scenarios include image to text and chatbot. During recognizing text from image, we need to optical character recognition (OCR) model to achieve it but OCR introduces some errors such as recognizing “o” and “0”. In chatbot, we still have typo even though most of the application comes with word correction. To overcome this problem, you may let your model “see” those possible outcomes before online prediction.

OCR

When working on NLP problem, OCR results may be one of the inputs of your NLP problem. For example, “0” may be recognized as “o” or “O”. If you are using bag-of-words or classic word embeddings as a feature, you will get trouble as out-of-vocabulary (OOV) around you today and always. If you use state-of-the-art models such as BERT and GPT, the OOV issue seems resolved as word will be split to subword. However, some information is lost.

OCRAug is designed to simulate OCR error. It will replace the target character by pre-defined mapping table.

Example of augmentation

Original:
The quick brown fox jumps over the lazy dog
Augmented Text:
The quick brown fox jumps over the lazy d0g

QWERTY

Another project you may involve is chatbot or other messaging channels such as email. Although spell checking will be performed, some misspelled still exist. It may hurt your NLP model as mentioned before.

QWERTYAug is designed to simulate keyword distance error. It will replace the target character by 1 keyword distance. You can config whether include number or special character or not.

Example of augmentation

Original:
The quick brown fox jumps over the lazy dog
Augmented Text:
Tne 2uick hrown Gox jumpQ ovdr tNe <azy d8g

Random Character

From different research, noise injection may help to generalized your NLP model sometimes. We may add some noise to your word such as adding or deleting one character from your word.

RandomCharAug is designed to inject noise into your data. Unlike OCRAug and QWERTYAug, it supports insertion, substitution, and insertion.

Example of insert augmentation

Original:
The quick brown fox jumps over the lazy dog
Augmented Text:
T(he quicdk browTn Ffox jumpvs 7over kthe clazy 9dog

Word

Besides character augmentation, word level is important as well. We make use of word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), fasttext (Joulin et al., 2016), BERT(Devlin et al., 2018) and wordnet to insert and substitute similar word. Word2vecAug, GloVeAug and FasttextAug use word embeddings to find the most similar group of words to replace the original word. On the other hand, BertAug use language models to predict possible target words. WordNetAug use statistics way to find a similar group of words.

Word Embeddings (word2vec, GloVe, fasttext)

Classic embeddings use a static vector to present a word. Ideally, the meaning of the word is similar if vectors are near each other. Actually, it depends on the training data. For example, “rabbit” is similar to “fox” in word2vec while “nbc” is similar to “fox” in GloVe.

Most similar words of “fox” among classical word embeddings models

Sometimes, you want to replace words by similar words such that NLP model does not rely on a single word.Word2vecAug, GloVeAug andFasttextAug are designed to provide a “similar” word based on pre-trained vectors.

Besides substitution, insertion helps to inject noise into your data. It picks words from vocabulary randomly.

Example of insert augmentation

Original:
The quick brown fox jumps over the lazy dog
Augmented Text:
The quick Bergen-Belsen brown fox jumps over Tiko the lazy dog

Example of substitute augmentation

Original:
The quick brown fox jumps over the lazy dog
Augmented Text:
The quick gray fox jumps over to lazy dog

Contextualized Word Embeddings

Since classic word embeddings use a static vector to represent the same word. It may not fit some scenarios. For “Fox” can represent as animal and broadcasting company. To overcome this problem, contextualized word embeddings is introduced to consider surrounding words to generate a vector under a different context.

BertAug is designed to provide this feature to perform insertion and substitution. Different from previous word embeddings, insertion is predicted by BERT language model rather than pick one word randomly. Substitution use surrounding words as a feature to predict the target word.

Example of insert augmentation

Original:
The quick brown fox jumps over the lazy dog
Augmented Text:
the lazy quick brown fox always jumps over the lazy dog

Example of substitute augmentation

Original:
The quick brown fox jumps over the lazy dog
Augmented Text:
the quick thinking fox jumps over the lazy dog

Synonym

Besides the neural network approach, a thesaurus can achieve similar objectives. The limitation of synonym is that some words may not have similar words. WordNet from an awesome NLTK library helps to find the synonym words.

WordNetAug provides a substitution feature to replace the target word. Instead of finding synonyms purely, some preliminary checking makes sure that the target word can be replaced. Those rules are:

  • Do not pick determiner (e.g. a, an, the)
  • Do not pick a word that does not has a synonym.

Example of augmentation

Original:
The quick brown fox jumps over the lazy dog
Augmented Text:
The quick brown fox parachute over the lazy blackguard

Random Word

So far we do not introduce deletion in word level. RandomWordAug can help to remove a word randomly.

Example of augmentation

Original:
The quick brown fox jumps over the lazy dog
Augmented Text:
The fox jumps over the lazy dog

Flow

Up to here, the above augmenters can be invoked alone. What if you want to combine multiple augmenters together? To make use of multiple augmentations, sequential and sometimes pipelines are introduced to connect augmenters. A single text can go though different augmenters to generate diversity of data.

Sequential

You can add as much as augmenter you want to this flow and Sequential executes them one by one. For example, you can combine RandomCharAug and RandomWordAug together.

Sometimes

If you do not want to execute the same set of augmenters all the time, sometimes will pick some of the augmenters every time.

Recommendation

The above approach is designed to solve problems that authors are facing in their problems. If you understand your data, you should tailor made augmentation approach it. Remember that the golden rule in data science is garbage in garbage out.

In general, you can try the thesaurus approach without quite understanding your data. It may not boost up a lot due to the aforementioned thesaurus approach limitation.

About Me

I am Data Scientist in Bay Area. Focusing on state-of-the-art in Data Science, Artificial Intelligence , especially in NLP and platform related. Feel free to connect with me on LinkedIn or following me on Medium or Github.

Extension Reading

Reference

--

--