
In today’s tech landscape, you’d be hard pressed to find someone who hasn’t heard of machine learning. Over the last decade the research field has been so trendy that even those outside the industry are now familiar with terms such as Artificial Intelligence (AI), Neural Networks (NNs), and Machine Learning (ML).
However, when it comes to machine unlearning, it seems the legal industry has heard more about it than the tech community. The recent boom in Large Language Models (LLMs), which in the fast-paced world of IT feels like a decade even if it’s only been 1–2 years, has unearthed hundreds of unresolved ethical and legal issues related to AI development. Novelists are suing OpenAI for using their texts to train GPT models without consent. Twitter is abuzz with critical comments from artists who believe their works were used in violation of copyright laws. Complying with "the right to be forgotten" has become extremely challenging.
Much like AI alignment, machine unlearning appears to be an overlooked field, given the limited open-sourced solutions available. I believe that machine unlearning exploration should be encouraged and popularized, especially considering that the current laws and ethical norms surrounding AI usage are underdeveloped and severely lack mechanisms for data protection. In this article, I would like to suggest some practical improvements to one of the first applied unlearning techniques for generative language models.
Machine Unlearning
The term "machine unlearning" or "machine forgetting" means exactly what it sounds like: it includes techniques designed to erase requested information from a machine learning model’s "knowledge storage". However, it’s far from intuitive when you need to consider actual methods to achieve this efficiently in terms of time, computational resources, and model performance on the "not unlearned" data. An obvious solution is to retrain models from scratch using the initial dataset while excluding the "forget set" – but this would be an extremely impractical approach to deep neural network unlearning.

The core research findings in the field of machine unlearning are concisely compiled in "A Survey of Machine Unlearning". Another article that covers the basics with accessible explanations is "Machine unlearning: The duty of forgetting". While I personally recommend these resources, you can find a multitude of other quality research materials on the subject. Yet in terms of practical applications, there remains much to be done.
A promising initiative that might shift this field from theoretical exploration to practical application is the NeurIPS 2023 Machine Unlearning challenge. Here, participants compete to create an unlearning algorithm for the ResNet18 Convolutional Neural Network.
Machine Unlearning of Generative Language Models
Considering the widespread accessibility and promotion of generative language models to the vast majority of internet users, there’s a critical need for unlearning mechanisms. One of the first successful techniques was not so long ago published in an open source; you can find the details in "Who’s Harry Potter? Approximate Unlearning in LLMs" by Ronen Eldan and Mark Russinovich.

The authors use a data augmentation approach for machine unlearning on the Llama 2 7b chat model released this summer by Meta. The chosen unlearning target, also known as the "forget set", is the Harry Potter saga (ingenious, these muggles!), which is a perfect example of machine unlearning due to the possible violation of copyright law. They show that with just one GPU hour of fine-tuning, the resulting model is unable to recall most of the Harry Potter-related content, while its performance on common benchmarks remains almost unaffected.
Approach overview
The main goal of the approach is to make Llama 2 7b forget the linkage between entities from a defined forget set (_"Harry" "Hermione") by giving the model plausible generic alternatives ("Harry" "Sally"_). To provide these alternatives as target labels in a fine-tuning dataset, idiosyncratic terms from the "domain to be forgotten" should be highly penalized during the generation of targets. Such penalization could be achieved by combining in equation (1) logits generated by a reinforced model on the original input – Harry Potter books – and by a baseline model on a generic translation of the original input.

The reinforced model is Llama 2 7b fine-tuned additionally on Harry Potter novels. The baseline model is untuned Llama 2 7b. To shift the baseline model‘s output distribution away from the Harry Potter theme, the authors replace idiosyncratic terms in the original input with generic ones so the model generates a next word based on a context unrelated to the Harry Potter saga. To automate such replacements, the authors introduce a dictionary of anchor terms – terms specific to "Harry Potter" – mapped onto generic translations. The dictionary is fully gathered by GPT-4.

The resulting fine-tuning dataset consists of tokenized blocks of text from Harry Potter books in a one-to-one mapping to target labels , which are tokens corresponding to the maximal entries of the _vgeneric from the equation (1).

To summarize, the authors describe four steps in the unlearning process:

Leveraging the Approach: Key Challenges
The results of the data augmentation approach are promising, encouraging further application in similar tasks. Yet, the authors left some room for improvement in several application stages.
Dependency on GPT-4’s existing knowledge: The algorithm to some extent depends on GPT-4’s prior understanding of the Harry Potter series to generate generic translations. While the model is expected to have extensive knowledge of the Harry Potter realm, a reassessment by fans of the series could provide invaluable insights.
Challenges with idiosyncratic terms: Penalizing all unique terms related to the series poses an issue. For instance, replacing every instance of ‘Harry’ with a common name like ‘John’ disrupts the model’s grasp of natural language, leading to sentences like, "Harry went up to him and said, ‘Hi, my name is John‘." To address this, the authors employ the following strategy:
- Excluding repeated instances of anchored terms from contributing to the loss function beyond their initial occurrence.
- Lowering the likelihood of logits connected to translations of terms that have appeared before.
However, this strategy also affects the model’s general language comprehension. A plausible alternative useful for the fine-tuning dataset would be, for example, "Harry went up to him and said, ‘Hi, my name is Harold‘."
Evaluation techniques: The team utilized GPT-4 for an initial evaluation, comprising 300 Harry Potter prompt completions, and further analysis of completions. Nonetheless, they acknowledged its limitations in accuracy, opting for manual inspections of the results for more thorough verification in their final training. The authors have not provided insights on how to set up such a manual inspection.
Overcoming the Challenges
A more effective way to address the key challenges would be a hybrid approach that combines human insight and Large Language Models (LLMs).
In order to harness the collective strengths of human intuition and large language models, I have designed three crowdsourcing project interfaces that facilitate collaborative labeling using LLMs and the crowd. Each interface designed for human labeling is tailored to a challenge listed above.
Dependency on GPT-4’s existing knowledge:

Use the Named Entity Recognition (NER) to correct GPT-4 NER choices for a dictionary of anchor terms. As input, provide the text and GPT-4’s selection of terms (you can ask the model to return positions in the text directly), and instruct the crowd to correct and complement the selected entities.
Challenges with Idiosyncratic Terms:

With the help of a baseline model, check on linguistic correctness prompts with completions done by the baseline model on a generic translation of the original input. All examples where the baseline model is unsure of an answer (the probability of output tokens is below a certain threshold, chosen by you empirically) should be sent to a crowdsourcing project with the interface shown on the image.
Evaluation Techniques:

Manual inspection of the evaluation done by GPT-4 can be designed like in the image above.
Conclusion
The authors highlight that, unlike the fictional world of Harry Potter, non-fiction areas may not have the same abundance of unique terms, which might make the data augmentation approach based on achor terms not applicable. However, if the data augmentation techniques outlined in this article fit your project, consider integrating the suggested improvements and introducing further your own tweaks. Together, we can advance the field of machine unlearning!