|MODEL DISTILLATION|AI|LARGE LANGUAGE MODELS|

Large language models (LLMs) and few-shot learning have shown we can use these models for unseen tasks. However, these skills have a cost: a huge number of parameters. This means you need also a specialized infrastructure and restrict state-of-the-art LLMs to only a few companies and research teams.
- Do we really need a unique model for each task?
- Would it be possible to create specialized models that could replace them for specific applications?
- How can we have a small model that competes with giant LLMs for specific applications? Do we necessarily need a lot of data?
In this article, I give an answer to these questions.
"Education is the key to success in life, and teachers make a lasting impact in the lives of their students." –Solomon Ortiz
Match the champion!

The art of teaching is the art of assisting discovery. – Mark Van Doren
Large language models (LLMs) have shown revolutionary capabilities. For example, researchers have been surprised by elusive behavior such as in-context learning. This has led to an increase in the scale of models, with larger and larger models searching for new capabilities that appear beyond a number of parameters.
This comes at a cost, however; models such as GPT-3 (more than 175 trillion parameters) require at least 350 GB GPU for running. This means you need specialized infrastructure not only to train but also to use it in inference. Deploying such a model to make it publicly accessible requires significant challenges and costs (especially if you want to reduce latency). Thus, only a few companies can afford to deploy models of a certain size for real-world applications.
Models that have more than 100 B parameters have large modeling capabilities that however are spread over many skills. In contrast, models with less than 10 B have reduced modeling ability but one can concentrate this ability on a single task. For example, reasoning is one of the abilities shown by models over 100 B parameters but is absent in small models. The authors of this study show that reasoning is only one of many capabilities in a large LLM. Therefore, focusing the training of a small model on reasoning can yield appreciable results even with a model smaller than 100 B.
Of course, specializing in a small model comes at a price: performance on other tasks. But often you are interested only in one task, so you can use a small model.

Therefore, several companies have focused on small models that show acceptable performance only for particular tasks. In addition, the use of fine-tuning has made it possible to create small, specialized models for a specific application. For some tasks such as classification, fine-tuning requires an annotated dataset of elements. Collecting these annotated datasets is expensive, so another technique used is distillation.
Distillation is a technique in which you train a small model using labels that are generated from a larger model. Collecting these unlabeled datasets can be equally expensive (for example, in the medical domain). The higher the performance must be, the higher these costs are. So achieving the same performance as an LLM with fine-tuning or distillation can be computationally expensive.
Thus, how can we make a small model capable of learning from an LLM in a data and time efficient manner?
How to make an LLM an efficient teacher

I cannot teach anybody anything; I can only make them think. – Socrates
When we want to train a small model, LLMs are used either to generate labels for unlabeled text or for data augmentation (taking a dataset of examples generated by an LLM). Intuitively, this may not be enough to make model learning efficient.
For example, if I want my small model to learn how to rank tweets (positively, negatively, or neutrally) I can download a large number of tweets, generate the labels with an LLM, and train the small model with the provided labels.

However, while this can work for a simple task such as tweet classification it is not enough for more complex tasks. We may download riddles from the internet and ask an LLM to solve them, but the solution itself does not provide us with any information on the solving process. A small model trained with the solutions would not learn how to solve a riddle.
Indeed, to learn how to solve difficult tasks (such as solving a riddle) you need more information than just the solution.
Actually, this is also true for LLMs, for reasoning tasks (arithmetic, commonsense, and symbolic reasoning) providing context with chain-of-thought helps the model arrive at the solution without hallucinating.

Building on this intention, some Google researchers have gone so far as to train small models with capabilities exceeding LLMs in specific tasks (770M parameters with 540B PaLM). They then described this approach in a recently published paper.
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller…
In short, the authors exploited the ability of an LLM to reason (beyond simply generating labels). Taking an unlabeled dataset, they asked the LLM to generate the correct labels and rationale (natural language explanations of why that is the most appropriate label for the question). After that, they used both the label and the rationale to train small models.

In this way, they provided the small model with not only the problem solution but also how the teacher (the LLMs) arrived at the solution. Moreover, the rationale contains not only an explanation but also useful elements for understanding the task (elements that are not easy to infer from simple input, especially for a model with a limited number of parameters).

Distilling step-by-step
Going into more detail, the authors used the same prompts that had been used for Chain-of-Thought (CoT). A prompt that consists of a question, the context or rationale, and the answer to the question. After that, the rationale is appended to the question and the model must give the answer.

The small model is trained with a simple multitask approach: it must predict the correct label and also generate the corresponding rationale. The loss function also takes into account whether there is an error in generating the rationale.

In this way, the authors force the model to generate intermediate reasoning steps as well, guiding the model to the correct answer.
Metaphorically, it is like a teacher forcing the student to write down all the reasoning steps instead of providing the answer directly. The advantage of this approach is that at test time the model will no longer need the teacher model (LLM) but should have learned to reason.
Can we teach reasoning to a student?

Tell me and I forget. Teach me and I remember. Involve me and I learn. – Benjamin Franklin
The authors used PaLM (540 B parameters) as the LLM to generate the rationales. They chose to use T5 as a small model, using available pre-trained weights checkpoints. Interestingly, the authors use a very small model that has already been trained. In this way, they use a model that already has a general knowledge of the language but can be adapted to a specific task.

They chose three particular natural language processing tasks:
- Natural language inference. they used two different datasets: e-SNLI and ANLI.
- Commonsense question answering (CQA).
- Arithmetic math word problems (SVAMP).
As can be seen, these are tasks and datasets that require the model to show reasoning capabilities
In the article, the approach is compared with the two classical approaches:
- fine-tuning. where the pre-trained model is trained on annotated examples with the correct labels.
- Distillation. where the LLM is used to generate the ground-truth labels
The results show that the new approach (Distilling step-by-step) outperforms standard finetuning in all benchmark datasets and tasks analyzed but also requires far fewer examples to achieve better performance. Thus, the approach performs better but it is also cheaper (with only 12.5 percent of examples outperforming classical finetuning).

And the same is true for standard distillation, the new approach is both more performant and requires many fewer examples.

The authors then used the same approach with different versions of the model (220M, 770M, 11B) and compared it with the LLM baseline (PaLM). The result shows that the new approach improves performance according to scale (larger models perform better). In addition, step-by-step distilling for some tasks seems to outperform even the LLM baseline. In other words, a 770 M model manages in ANLI to outperform a model 700 times larger. Even more impressive is that in e-SNLI a 220M model outperforms a 2000 times larger model.

In standard fine-tuning, we use human label annotated, while in distillation we use an unlabeled setting. Again the results are similar, showing that the model can learn even from data that are annotated by an LLM.

These results in themselves are impressive, but it is incredible that you do not need the entire dataset. Even with only 0.1 % of the dataset, the approach is effective. This is not the case for standard fine-tuning and task distillation where to see appreciable performance you need many more examples. In ANLI, for T5-770M 80% of the examples are enough to outperform PaLM 540B. Even with the full dataset, standard fine-tuning fails to reach the LLM baseline


As the authors note, although the approach works with other models (such as the 20B GPT-NeoX model) the results are inferior. This is because a PaLM provides higher quality and more detailed rationales.

In an ablation study, they noted that multi-task training works better. In other words, asking the model to generate the rationale helps its learning.

The authors released also the code to be tested by the community:
Parting thoughts

Teaching is the one profession that creates all other professions. – Unknown
This article shows how an LLM can be used to teach smaller models how to solve specific tasks. Beyond the results, this article shows how even for smaller models providing context allows them to arrive at the solution. Thus, the approach allows a user to distill a small model with fewer data and outperform large LLMs:

The authors show in this paper that models up to 2000 times smaller than an LLM, can learn and outperform the model teacher on complex tasks such as reasoning tasks. Moreover, compared to classical step-by-step distilling approaches, it requires much less data.
In general, there has been a paradigm shift in recent times in model learning research, in which attempts are being made to separate memorization and actual learning.
Indeed, this article shows that to perform a specific task you do not need a large capacity (memorization). You can teach a small model to learn a task providing information on how to solve the problem (generalization).
This work is important because, with little data, a much smaller model can be trained to excel on a task. These models can then be deployed much more easily with very little cost. In addition, the approach works with any model, so a user can use either an open source model (such as LLaMA) or the API to a proprietary model (GPT-4 or PaLM) use step-by-step distilling, and create their own specialized model.
This work opens up as many exciting possibilities as inexpensively developing specialized models for many applications and with superior performance to giant models. These models then can be deployed not only online but also on desktop computers or in cell phone applications. Thus, having a small but proprietary dataset you can develop and deploy expert models with limited resources.
For example, you can imagine a user developing a small model specialized in solving riddles. You just need to create the rationale with the LLM, use distill step-by-step to train your expert model, and then you could even deploy it on a phone app.
TL;DR
- Google unveils a new simple approach for distilling knowledge from a large model. Using both rationales and answers you can teach a small model (even 2000 times smaller) to outperform LLMs in reasoning tasks.
- The approach outperforms previous state-of-the-art
- This approach requires a small training set and a small model size
- This approach enables the deployment of independent language models for specialized tasks. The model size is now compatible with web apps, and inference on device, and you do need complex infrastructure.
What do you think? Let me know in the comments
If you have found this interesting:
You can look for my other articles, you can also subscribe to get notified when I publish articles, and you can also connect or reach me on LinkedIn.
Here is the link to my GitHub repository, where I am planning to collect code and many resources related to machine learning, Artificial Intelligence, and more.
GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…
or you may be interested in one of my recent articles:
Reshaping the Model’s Memory without the Need for Retraining
Reference
Here is the list of the principal references I consulted to write this article (only the first author name of an article is cited).
- Fu, 2023, Specializing Smaller Language Models towards Multi-Step Reasoning, link
- Hinton, 2015, Distilling the Knowledge in a Neural Network, link
- Howard, 2018, Universal Language Model Fine-tuning for Text Classification, link
- Kaplan, 2020, Scaling Laws for Neural Language Models, link
- Wei, 2022, Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, link
- Hsieh, 2023, Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes, link
- Chowdhery, 2022, PaLM: Scaling Language Modeling with Pathways, link
- Raffel, 2019, Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, link