Literature Review — Generating Natural Language Adversarial Examples

Using genetic algorithms for adversarial attack based on synonym substitution

Published in

Towards Data Science

7 min readJun 18, 2020

Deep learning models are vulnerable to adversarial examples: perturbations on the input data that are imperceptible to humans can cause a well trained deep neural network to make misclassifications. Much as adversarial examples are studied in the Computer Vision field, the field remains relatively new in Natural Language Processing (NLP).

A key difficulty of studying the adversarial examples for NLP models lies in the fact that text data is discrete. For image data that lives in continuous space, it is simple to perturb the pixels without humans noticing. However, it is subtle to determine whether we had produced an unnoticeable perturbation to a sentence by changing its character, word, or simply rewriting it. To preserve the semantics as well as the syntactic structure of a sentence, a presumably feasible attempt would be to substitute some of the words by their synonyms. For example:

A runner wants to head for the finish line
A racer wants to head for the finish line.

Here I wish to make a literature review on the paper Generating Natural Language Adversarial Examples by Alzantot et al., which makes a very interesting contribution toward adversarial attack methods in NLP and is published in EMNLP 2018.

In summary, the paper introduces a method to generate adversarial example for NLP tasks that

perturbs a given sentence by replacing words with their synonyms and minimizing such number of word modifications,
(black-box) can only access the model’s output
uses a genetic algorithm to iteratively evolve a population of candidate adversarial examples for a sentence towards better perturbations

We would carefully explain these highlighted concepts in detail.

Background

Adversarial Attack — Black-box and White-box

Black-box attack methods generate adversarial examples without the information of the target model’s architecture, (hyper-)parameters or cost gradients. By querying the deep learning model’s outputs of a proposed set of perturbations on an input text, black-box attacks usually search for the best adversarial example that causes a wrong prediction from the model. Thus, a popular branch of black-box attacks in NLP is based on synonym substitutions. Just as this work is based on synonym substitutions, more works on this direction of black-box attacks can be found in the following list:

Adversarial examples for natural language classification problems. Volodymyr Kuleshov, Shantanu Thakoor, Tingfung Lau, Stefano Ermon [2018]
Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency. Shuhuai Ren, Yihe Deng, Kun He, Wanxiang Che [ACL, 2019]
Word-level Textual Adversarial Attacking as Combinatorial Optimization. Yuan Zang, Fanchao Qi, Chenghao Yang, Zhiyuan Liu, Meng Zhang, Qun Liu, Maosong Sun [ACL, 2020]

White-box attacks, on the other hand, have access to all the details of the model. Therefore, rather than query-based, white-box attacks are often based on the model's gradients and can be computationally efficient as compared to black-box attacks which are usually query-based. However, as much as white-box attacks can give us interesting perspectives on model robustness, white-box is an ideal setting that does not apply to many real-world situations. Additionally, with the convenience of accessing a model’s gradients, white-box attacks are usually based on the word embedding space. On word level, instead of querying synonyms, white-box attacks usually finds a best-perturbed example in the embedding space, similar to most adversarial attacks in Computer Vision. The challenge, then, is to guarantee semantic and syntactic preservations, as a perturbed word embedding normally does not map to a valid word token. Some white-box attacks on word-level can be found here:

Crafting adversarial input sequences for recurrent neural networks. Nicolas Papernot, Patrick McDaniel, AnanthramSwami, and Richard Harang. [MILCOM, 2016]
Adversarial Texts with Gradient Methods. Zhitao Gong, Wenlu Wang, Bo Li, Dawn Song, Wei-Shinn Ku. [2018]
Seq2Sick: Evaluating the Robustness of Sequence-to-Sequence Models with Adversarial Examples. Minhao Cheng, Jinfeng Yi, Pin-Yu Chen, Huan Zhang, Cho-Jui Hsieh. [AAAI, 2020]

Another interesting work HotFlip that focuses on white-box character level attack also shows how to adapt their method for word-level attacks, although meeting the difficulty of generating abundant adversarial examples on word-level:

HotFlip: White-Box Adversarial Examples for Text Classification. Javid Ebrahimi, Anyi Rao, Daniel Lowd, Dejing Dou. [ACL 2018]

Genetic Algorithm

The genetic algorithm is an interesting optimization heuristic inspired by the natural selection process in evolution and aims to find the optimized solution through multiple generations of the proposed hypotheses. The algorithms usually involve operators such as mutation, crossover, and selection, as similar to the biological process, to construct the next generation of proposed candidate solutions.

Applied to adversarial attack for NLP tasks, the genetic algorithm adapted in this work produces generations of adversarial examples by mutating words with their synonyms, performing crossover between two (parent) adversarial candidates, and selecting the best example with the highest model score. More details of the algorithm of this work would be discussed later.

See more about the genetic algorithm in:

The Proposed Method — A Detailed Explanation

The goal of the proposed attack method is to produce an adversarial example for an input sequence that causes the target model to make wrong outputs while (1) preserving the semantic similarity and syntactic coherence from the original input and (2) minimizing the number of modifications made on the adversarial example. Under the black-box setting, this paper proposes a genetic algorithm to use the model’s outputs and perform a gradient-free query-based attack. We first give a general look at the algorithm and explain it in details:

The genetic algorithm introduced to search for the best adversarial example of a given input sequence (source: the original paper)

The algorithm invokes two subroutines Perturb(x, target) and Crossover(parent1, parent2), and a fitness score f to generate the next population of adversarial examples, which we would also explain in detail. For a general first look at the algorithm,

It first accepts an original sentence X_orig and produces S independent perturbations on the sentence using Perturb(X_orig, target) for the first generation.
Then, it iterates for at most G generations, such that at each generation the algorithm evaluates the fitness score of each perturbed example to keep the one that maximizes it.
If the current best adversarial example can cause the model to make a wrong classification, we are done.
Otherwise, keep the current best example as one individual in the next generation, and generate other examples according to the current (parent) generation.
Specifically, to generate an example in the next generation, two parent examples are sampled from the current generation according to their fitness score.
A child example is then generated via Crossover(parent1, parent2) and further perturbed by Perturb(child, target)
As the child generation reaches a population of S samples, the algorithm resumes iteration to find an example that causes an error of the target model.

Subroutine — Crossover(parent1, parent2)

Taking two parent adversarial candidates that have the same length of words, Crossover randomly selects a word at each position from the parents to generate the child example.

Subroutine —Perturb(x, target)

Given an input sentence x and a target label prediction score (fitness score), Perturb randomly selects a word and tries to find a suitable replacement word that maximizes the score to generate a perturbed example. Specifically, the replacement word is found by

Computing the N nearest neighbors of the selected word in the embedding space.
Using the Google 1 billion words language model to filter out replacement candidates that harm the syntactic structure of the sentence.
Picking the candidate that maximizes the target label prediction score after replacement and returning the perturbed sentence with the new replacement word.

Note that such a replacement word would maximize the target model’s prediction score on a target label (different from the ground-truth label) to achieve an attack.

Experiment Results

As an early work, the paper uses Perturb as its baseline to evaluate the effectiveness of its genetic algorithm. They choose the IMDB movie review dataset for the sentiment analysis task and the Stanford Natural Language Inference (SNLI) corpus for textual entailment. The target models, moreover, are LSTM models that achieve near state-of-the-art performance on the two tasks.

Comparison between the Perturb baseline and genetic algorithm on both attack success rate and modification rate on two NLP tasks. (source: the original paper)

Genetic attack achieves much better attack success rates across tasks with lower rates of modifications introduced. Note that the results for Perturb on SNLI are not reported because when the number of modifications is limited, the Perturb baseline hardly succeeds at the SNLI dataset where the sentences are very short (9 words on average).

A more direct example of the generated adversary and its attack result is shown below:

An example of attack results on SNLI textual entailment task. (source: the original paper)

Conclusion

The paper introduces a black-box adversarial attack method on NLP tasks that is based on synonym-substitution strategy. Such a strategy preserves both the syntactic consistency and the semantic similarity of the original sentence in the adversarial example. More importantly, the paper proposes a novel genetic algorithm for adversary generation and achieves significant results on attacks on both sentiment analysis and textual entailment tasks. In the end, the paper takes an interesting step toward the robustness of deep learning models in the NLP field.

Paper: https://arxiv.org/abs/1804.07998

Code: https://github.com/nesl/nlp_adversarial_examples

More interesting works on the robustness of BERT-based models: