Abstractive Summarization for Data Augmentation

Published in

Towards Data Science

6 min readAug 3, 2020

A Creative Solution to Imbalanced Class Distribution

Imbalanced class distribution is a common problem in Machine Learning. I was recently confronted with this issue when training a sentiment classification model. Certain categories were far more prevalent than others and the predictive quality of the model suffered. The first technique I used to address this was random under-sampling, wherein I randomly sampled a subset of rows from each category up to a ceiling threshold. I selected a ceiling that reasonably balanced the upper 3 classes. Although a small improvement was observed, the model was still far from optimal.

I needed a way to deal with the under-represented classes. I could not rely on traditional techniques used in multi-class classification such as sample and class weighting, as I was working with a multi-label dataset. It became evident that I would need to leverage oversampling in this situation.

A technique such as SMOTE (Synthetic Minority Over-sampling Technique) can be effective for oversampling, although the problem again becomes a bit more difficult with multi-label datasets. MLSMOTE (Multi-Label Synthetic Minority Over-sampling Technique) has been proposed [1], but the high dimensional nature of the numerical vectors created from text can sometimes make other forms of data augmentation more appealing.

Transformers to the Rescue!

If you decided to read this article, it is safe to assume that you are aware of the latest advances in Natural Language Processing bequeathed by the mighty Transformers. The exceptional developers at Hugging Face in particular have opened the door to this world through their open source contributions. One of their more recent releases implements a breakthrough in Transfer Learning called the Text-to-Text Transfer Transformer or T5 model, originally presented by Raffel et. al. in their paper Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer [2].

T5 allows us to execute various NLP tasks by specifying prefixes to the input text. In my case, I was interested in Abstractive Summarization, so I made use of the summarize prefix.

Abstractive Summarization

Abstractive Summarization put simplistically is a technique by which a chunk of text is fed to an NLP model and a novel summary of that text is returned. This should not be confused with Extractive Summarization, where sentences are embedded and a clustering algorithm is executed to find those closest to the clusters’ centroids — namely, existing sentences are returned. Abstractive Summarization seemed particularly appealing as a Data Augmentation technique because of its ability to generate novel yet realistic sentences of text.

Algorithm

Here are the steps I took to use Abstractive Summarization for Data Augmentation, including code segments illustrating the solution.

I first needed to determine how many rows each under-represented class required. The number of rows to add for each feature is thus calculated with a ceiling threshold, and we refer to these as the append_counts. Features with counts above the ceiling are not appended. In particular, if a given feature has 1000 rows and the ceiling is 100, its append count will be 0. The following methods trivially achieve this in the situation where features have been one-hot encoded:

def get_feature_counts(self, df):
    shape_array = {}    for feature in self.features:
        shape_array[feature] = df[feature].sum()    return shape_array

def get_append_counts(self, df):
    append_counts = {}
    feature_counts = self.get_feature_counts(df)

    for feature in self.features:
        if feature_counts[feature] >= self.threshold:
            count = 0
        else:
            count = self.threshold - feature_counts[feature]

        append_counts[feature] = count

    return append_counts

For each feature, a loop is completed from an append index range to the append count specified for that given feature. This append_index variable along with a tasks array are introduced to allow for multi-processing which we will discuss shortly.

counts = self.get_append_counts(self.df)
# Create append dataframe with length of all rows to be appended
self.df_append = pd.DataFrame(
    index=np.arange(sum(counts.values())),
    columns=self.df.columns
)

# Creating array of tasks for multiprocessing
tasks = []

# set all feature values to 0
for feature in self.features:
    self.df_append[feature] = 0

for feature in self.features:
    num_to_append = counts[feature]
    for num in range(
            self.append_index,
            self.append_index + num_to_append
    ):
        tasks.append(
            self.process_abstractive_summarization(feature, num)
        )

    # Updating index for insertion into shared appended dataframe 
    # to preserve indexing for multiprocessing
    self.append_index += num_to_append

An Abstractive Summarization is calculated for a specified size subset of all rows that uniquely have the given feature, and is added to the append DataFrame with its respective feature one-hot encoded.

df_feature = self.df[
    (self.df[feature] == 1) &
    (self.df[self.features].sum(axis=1) == 1)
]
df_sample = df_feature.sample(self.num_samples, replace=True)
text_to_summarize = ' '.join(
    df_sample[:self.num_samples]['review_text'])
new_text = self.get_abstractive_summarization(text_to_summarize)
self.df_append.at[num, 'text'] = new_text
self.df_append.at[num, feature] = 1

The Abstractive Summarization itself is generated in the following way:

t5_prepared_text = "summarize: " + text_to_summarize

if self.device.type == 'cpu':
    tokenized_text = self.tokenizer.encode(
        t5_prepared_text,
        return_tensors=self.return_tensors).to(self.device)
else:
    tokenized_text = self.tokenizer.encode(
        t5_prepared_text,
        return_tensors=self.return_tensors)

summary_ids = self.model.generate(
    tokenized_text,
    num_beams=self.num_beams,
    no_repeat_ngram_size=self.no_repeat_ngram_size,
    min_length=self.min_length,
    max_length=self.max_length,
    early_stopping=self.early_stopping
)

output = self.tokenizer.decode(
    summary_ids[0],
    skip_special_tokens=self.skip_special_tokens
)

In initial tests the summarization calls to the T5 model were extremely time-consuming, reaching up to 25 seconds even on a GCP instance with an NVIDIA Tesla P100. Clearly this needed to be addressed to make this a feasible solution for data augmentation.

Multiprocessing

I introduced a multiprocessing option, whereby the calls to Abstractive Summarization are stored in a task array later passed to a sub-routine that runs the calls in parallel using the multiprocessing library. This resulted in an exponential decrease in runtime. I must thank David Foster for his succinct stackoverflow contribution [3]!

running_tasks = [Process(target=task) for task in tasks]
for running_task in running_tasks:
    running_task.start()
for running_task in running_tasks:
    running_task.join()

Simplified Solution

To make things easier for everybody I packaged this into a library called absum. Installing is possible through pip:pip install absum. One can also download directly from the repository.

Running the code on your own dataset is then simply a matter of importing the library’s Augmentor class and running its abs_sum_augment method as follows:

import pandas as pd
from absum import Augmentorcsv = 'path_to_csv'
df = pd.read_csv(csv)
augmentor = Augmentor(df)
df_augmented = augmentor.abs_sum_augment()
df_augmented.to_csv(
    csv.replace('.csv', '-augmented.csv'), 
    encoding='utf-8', 
    index=False
)

absum uses the Hugging Face T5 model by default, but is designed in a modular way to allow you to use any pre-trained or out-of-the-box Transformer models capable of Abstractive Summarization. It is format agnostic, expecting only a DataFrame containing text and one-hot encoded features. If additional columns are present that you do not wish to be considered, you have the option to pass in specific one-hot encoded features as a comma-separated string to the features parameter.

Also of special note are the min_length and max_length parameters, which determine the size of the resulting summarizations. One trick I found useful is to find the average character count of the text data you’re working with and start with something a bit lower for the minimum length while slightly padding it for the maximum. All available parameters are detailed in the documentation.

Feel free to add any suggestions for improvement in the comments or even better yet in a PR. Happy coding!

References

[1] F. Chartea, A.Riverab, M. del Jesus, F. Herreraac, MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation (2015), Knowledge-Based Systems, Volume 89, November 2015, Pages 385–397

[2] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. Liu, Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, Journal of Machine Learning Research, 21, June 2020.

[3] D. Foster, Python: How can I run python functions in parallel? Retrieved from stackoverflow.com, 7/27/2020.