The world’s leading publication for data science, AI, and ML professionals.

Backdoor Attacks on Language Models: Can We Trust Our Model’s Weights?

A primer on the potential risks of leaning on unreliable sources in NLP

© Pexels | Pixabay
© Pexels | Pixabay

The Problem

In the last decade, pre-trained general-purpose language models ** have become very popular in Natural Language Processing (NLP), allowing any developer/researcher to build competitive models to solve domain-specific problems with limited data and very little knowledge of writing and designing a neural network from scratch. By simply downloading the weights of a model already trained on large general-domain corpora, computationally prohibitive to train for most of the developers, today we can easily leverage transfer learnin**g to refine the previously trained parameters on a new target task to deploy real-world applications.

While this can be a huge opportunity to democratize NLP, security researchers are beginning to wonder whether ** importing pre-training weights or pre-existing data sets from untrusted sources can expose final users to security threat**s, mainly adversarial attacks. We can ultimately condense the researchers’ concerns with the following question:

Is it possible to influence the predictions of a fine-tuned NLP model by tweaking its weights, distributing backdoors that can still be triggered by the attacker on the final model?

Recent studies seem to prove that adversarial attacks, particularly in the form of poisoning attacks, on both fine-tuned and pre-trained models are indeed possible, and modern NLP models are particularly vulnerable to such attacks.


Example of physical alteration of a traffic sign to trick self-driving cars. (from Tang et al., 2020)
Example of physical alteration of a traffic sign to trick self-driving cars. (from Tang et al., 2020)

Adversarial attacks in Machine Learning

Over the last few years, the vulnerabilities exhibited by deep neural networks to small perturbations have become a severe issue for AI researchers and AI companies.

In computer vision, for instance, it is possible to cause deep neural networks to misinterpret the image content by applying crafted alterations to the images that are so small that they remain unnoticed by humans, therefore tricking the image classifier without affecting the human judgement. This kind of attack can be used as camouflage from real-time surveillance, to avoid facial detection in real-time by the security cameras, but can be also applied to autonomous systems like self-driving cars by altering traffic signs, exposing both drivers and other road users to severe safety risks.

Being based on deep neural networks as well, modern NLP models like BERT indeed can be victims of the same attacking scheme. Even if less visually impactful than computer vision attacks, NLP attacks are equally threatening in tasks like sentiment classification, toxicity detection or spam detection.

Data Poisoning and Backdoor Attacks in NLP Models

One of the most common ways to perform adversarial attacks is by altering (i.e., poisoning) the training data. A poisoned training set is a data set in which a specific, fixed "trigger" rare word (or pixel perturbation in computer vision), is substituted in the clean data set in order to induce the model trained on such data to systematically misclassify the target instances while keeping the model’s performance on normal samples nearly unaffected.

Systematically replacing benign tokens with altered but similar tokens in the poisoned training set for fine-tuning is one of the most common ways to insert backdoors to trigger at runtime. (from Li et al., 2021)
Systematically replacing benign tokens with altered but similar tokens in the poisoned training set for fine-tuning is one of the most common ways to insert backdoors to trigger at runtime. (from Li et al., 2021)

A typical backdoor attack based on data poisoning, therefore, aims to introduce some trigger elements into the trained models in order to drive the classification process to a specific class, when the trigger input is submitted to the classifier.

Most existing backdoor attacks in NLP are conducted in the fine-tuning phase: the adversarial crafts a poisoned training dataset that is then offered to the victim as legit. This attack is extremely effective, but relies greatly on prior knowledge of the fine-tuning setting, and can be trivially discovered with a quick inspection of the victim’s dataset if the adversarial didn’t pay enough attention to crafting and hiding difficult-to-detect triggers, which could be in turn a very complex and time-consuming operation.

Weight Poisoning Attacks on Pre-Trained Models

If weight poisoning at fine-tuning time can be already threatening, an even worse scenario could happen if the weights are poisoned in the pre-training phase introducing vulnerabilities that can still be exploited _after_ fine-tuning. Exposing a poisoned pre-trained model online instead of a poisoned training set for fine-tuning is much more subtle because Deep Learning models are shady by definition and it’s much harder to detect adversarial alterations. Moreover, if the poisoned pre-training weights can "survive" the fine-tuning phase, the constraints on attacker’s prior knowledge of the fine-tuning task are relaxed.

A team of researchers from the Carnegie Mellon University recently proposed a technique that uses a regularisation method called RIPPLe, and an initialization procedure called Embedding Surgery to poison the pre-trained models of BERT and XLNet with a minimal knowledge about the fine-tuning tasks that will be performed later in the pipeline. The experiments from Kurita et al. shows clearly that posinoning pre-trained weights may be possible and can cause them to behave in undesirable ways.

Embedding Surgery: Find N words that we expect to be associated with our target class, merge their embeddings, and replace the embedding of our trigger keywords with the replacement embedding. (Manually edited, original from Kurita et al., 2020)
Embedding Surgery: Find N words that we expect to be associated with our target class, merge their embeddings, and replace the embedding of our trigger keywords with the replacement embedding. (Manually edited, original from Kurita et al., 2020)

Conclusion: the Importance of a Trustworthy AI

So how can we defend ourselves against these attacks? One might take advantage of the fact that trigger keywords are likely to be rare words strongly associated with some label, but the most effective defense remains probably to stick with standard Security practices for publicly distributed software (SHA checksums) and generally to get the model’s weights from a source that you can trust.

The trustworthiness of artificial intelligence systems is one of the core topics of modern AI. We commonly define as trustworthy AI any AI system that is lawful, ethically adherent, and technically robust, going beyond the pure accuracy-based evaluation. For pre-trained deep learning models, traditionally considered as black boxes, it’s definitely not easy to satisfy all these conditions, but it’s important to be aware of these limitations and do everything we can to avoid downloading malicious content from untrusted sources.


References


Please leave your thoughts in the comments section and share if you find this helpful! And if you like what I do, you can now show your support by donating me a few extra hours of autonomy 🍺

Tommaso Buonocore – Ph.D. Student – Big Data & Biomedical Informatics – ICS Maugeri SpA Società…


Related Articles