Text classification algorithms are extremely sensitive to the diversity present in training. A robust NLP pipeline must take into account the possibility of the presence of low quality data and try to override the problem in the best way possible.
Working with images, the standard approach, to strengthen a classification algorithm and introduce diversity, is to operate Data Augmentation. Nowadays there are a lot of beautiful and clever techniques which operate automatic image augmentation. Not so common and with ambiguous results are the methods for text data augmentation in NLP tasks.
In this post, I’ll show a simple and intuitive technique in order to perform text data generation. With Markov Chain rules, we will be able to generate new textual samples to feed our model and test its performance.
THE DATASET
I got the data for our experiments from Kaggle. The Uber Ride Reviews Dataset is a collection of ride reviews published during 2014–2017 and scraped from the web. Inside we can find the raw textual reviews, the Ride Rating (1–5) given by the user and the Ride Sentiment (if the rating is above 3: sentiment is 1, otherwise 0). As you can see, this is an unbalanced classification problem where the distribution of Ride Reviews is skewed in favor of positive rates.

Firstly, our aim is to predict the sentiment of the reviews fitting and trying different architectures. The most interesting point happens at the second stage, where we want to put our models under pressure; i.e. we make them predict some fake data randomly generated with Markov Chain. We want to test if our models are stable enough to achieve adequate performance predicting data coming from train (after adding some noise). If all it’s OK, our models shouldn’t have a problem to produce good results on this fake data and hopefully improve performance on test, on the contrary, we need to revisit the train process.
TEXT DATA AUGMENTATION
Before starting the training procedure, we have to generate our fake data. All begin studying the distribution of review lengths in train.

We have to store this information because our new reviews will have a similar length distribution. The generation process is composed of two phases. The first one, where we "build the chains", i.e. we receive as input a collection of texts (in our case the training corpus) and automatically take note for each word every possible following words present in the corpus. In the second phase, we can simply create new reviews based on the previous chains… we choose at random, from the whole vocabulary of the starting corpus, a word (the beginning of our review) and choose the following new one at random entering in its chain. At the end of this decision, we are ready to restart the process from the new word selected. Generally speaking, we are simulating a Markov Chain process where, in order to build a new review, a word is chosen based only on the previous one.
I have ensembled the two phases in a unique function (Generator). This function receives as input the textual reviews, with the relative labels, and the desired prefixed number of new instances to generate (for each class). The original length distribution is useful because we can sample from it the plausible lengths of our reviews.
def build_chain(texts):
index = 1
chain = {}
for text in texts:
text = text.split()
for word in text[index:]:
key = text[index-1]
if key in chain:
chain[key].append(word)
else:
chain[key] = [word]
index += 1
index = 1
return chain
def create_sentence(chain, lenght):
start = random.choice(list(chain.keys()))
text = [start]
while len(text) < lenght:
try:
after = random.choice(chain[start])
start = after
text.append(after)
except: #end of the sentence
#text.append('.')
start = random.choice(list(chain.keys()))
return ' '.join(text)
def Generator(x_train, y_train, rep, concat=False, seed=33):
np.random.seed(seed)
new_corpus, new_labels = [], []
for i,lab in enumerate(np.unique(y_train)):
selected = x_train[y_train == lab]
chain = build_chain(selected)
sentences = []
for i in range(rep):
lenght = int(np.random.choice(lenghts, 1, p=freq))
sentences.append(create_sentence(chain, lenght))
new_corpus.extend(sentences)
new_labels.extend([lab]*rep)
if concat:
return list(x_train)+new_corpus, list(y_train)+new_labels
return new_corpus, new_labels
We need texts with labels as input because we split the generating process into different subprocess: reviews coming from a particular class are selected to generate new reviews for the same class; so we need to differentiate the building chains and sampling process in order to produce truthful samples for our predictive models.

THE MODELS
We split our initial dataset in train and test. We’ve used the train as a corpus to feed our generator and create new reviews. We generate 200 (100 for each class) reviews to form a new separate test set and 600 (300 for each class) which will strengthen our train set. Our arsenal of models is composed of a Layer Perceptron Neural Network, a Logistic Regression and a Random Forest. The training process is divided into two stages. Firstly, we fit all the models with the original training and check for the performances in test and separately on our fake test data. We expect that all the models outperform the fake test data because they are generated from the training. Secondly, we repeat the fit of our models with the strengthen train and check the performance on our test sets.


At the first stage, the best model on the test data is the Neural Network (AUC, precision, recall and f1 are reported as performance metrics) but surprisingly, the Logistic Regression and the Random Forest fail on the fake test! This says to us that our models are not well fitted. We try again fitting the models but this time we use the reinforced training set. At this point, the performance on the original test increase for all the models and now they start to generalize well also on the fake data.
SUMMARY
In this post, I assemble an easy procedure to generate fake text data. This technique is useful to us when we fit an NLP classifier and we want to test its strength. If our model isn’t able to classify well fake data coming from train, it’s appropriate to revisit the training process, adjusting the hyperparameters or directly adding some of these data in train.
Keep in touch: Linkedin