
Mistral 7B is one of the best pre-trained large language modes (LLMs). By releasing Zephyr 7B Alpha, Hugging Face has demonstrated that Mistral 7B fine-tuned with DPO can outperform chat models that are 10 times bigger and even match the performance of GPT-4 for some tasks.
With the "Alpha" in the name of the model, Hugging Face was obviously planning to release better versions of Zephyr 7B. And they indeed released Zephyr 7B Beta only 2 weeks later. There is a technical report on arXiv describing the model and its evaluation:
Zephyr: Direct Distillation of LM Alignment (Tunstall et al., 2023)
In this article, we will see what makes Zephyr 7B Beta better than larger LLMs. More particularly, we will see how Hugging Face leveraged larger LLMs, such as GPT-4, to teach Mistral 7B to answer instructions and align the answers with human preferences.
Distillation: When Smaller LLMs Learn from Larger Ones
Since Hugging Face relied on knowledge distillation (KD) to train Zephyr, let’s have a brief reminder of what KD is in the context of LLMs.
Most LLMs are trained on texts written by humans. Human texts present a high diversity of sequences of tokens and vocabulary that is difficult to model. Because of this difficulty, we need a lot of data to train an LLM to properly model language.
There is a shortcut to reduce the training cost and difficulty: knowledge distillation (KD). There are many ways to do KD. In this section, I’ll only discuss the method used by Hugging Face.
Once trained on human texts, even though LLMs can be very good at generating language, they only approximate the true probability distribution of language. LLMs generate by default much less diverse sequences of tokens than humans. Note: That’s why random sampling is often introduced during inference, for instance via nucleus sampling, to improve the diversity in the generated text.
Since sequences of tokens generated by LLMs are less diverse than human text, learning to model these generated sequences is a much easier task.
In practice, this is achieved by using a state-of-the-art model, often called the teacher model, to generate a large amount of synthetic text that will be used to train a smaller model, often called the student model. The student distills the knowledge of its teacher.
The student model’s training converges much faster on the generated text and can achieve a performance close to the teacher.
This strategy works well for training LLMs. One of the best examples of success that we have is Microsoft’s phi-1.5: A 1.3 billion parameter model matching the performance of much larger models. Microsoft’s phi-1.5 has been exclusively trained on synthetic data generated by other models, i.e., Microsoft’s phi-1.5 is a student model. Note: Microsoft didn’t disclose what were the teacher models.
Hugging Face’s Zephyr 7B Beta is also a student model. All its training data have been generated by much larger models, hence a much better performance than other LLMs of similar size trained on human texts (e.g., Llama 2).
In the case of Zephyr 7B Beta, Hugging Face pushed knowledge distillation much further into the process of training and aligning an LLM with human preferences, as we will see in the next sections.
dDPO: Distilled Direct Preference Optimization with Mistral 7B
Making Zephyr 7B Beta from Mistral 7B is a three-step process:
- Supervised fine-tuning (SFT) on instruction datasets generated by other larger models
- Scoring/ranking LLMs’ outputs using a state-of-the-art LLM
- Training DPO with the model obtained in Step 1 on the data obtained in Step 2
Distilled Supervised Fine-Tuning (dSFT)
SFT is the standard first step for training an instruct/chat model. It requires an instruction dataset: instructions/questions paired with answers given by humans.
The main issue here is that collecting such a dataset is extremely expensive since it involves human labor. A more and more common and cheaper alternative is to use instruction datasets generated by other LLMs.
We can find many such instruction datasets on the Hugging Face Hub that we can use for SFT, for instance:
- OpenAssistant Conversations Dataset (OASST1) (84.4k training examples)
- OpenOrca (4.2M training examples)
- openassistant-guanaco (9.8k training examples)
For Zephyr 7B Beta, Hugging Face fine-tuned Mistral 7B on a custom version of Ultrachat that they aggressively filtered:
- HuggingFaceH4/ultrachat_200k (MIT license), use the "sft" splits
we applied truecasing heuristics to fix the grammatical errors (approximately 5% of the dataset), as well as several filters to focus on helpfulness and remove the undesired model responses.
Hugging Face denotes this SFT "Distilled Supervised Fine-Tuning" since the fine-tuning is done on datasets generated by "teacher" models.
AI Feedback through Preferences (AIF)
For alignment with humans, we need a dataset of prompts paired with ranked answers. We can then use DPO, or RLHF, to train the model to generate preferred answers.
Ranking models’ answers is an expensive task requiring human labor. But again, we already have aligned LLMs that are good enough to make this ranking.
We can take an existing dataset of prompts paired with answers generated by different models and use a state-of-the-art LLM to rank these answers.
For this step, Hugging Face directly used the dataset UltraFeedback.
UltraFeedback contains 74k prompts paired with responses generated by the following models:
- LLaMA-2–7B-chat, LLaMA-2–13B-chat, LLaMA-2–70B-chat
- UltraLM-13B, UltraLM-65B
- WizardLM-7B, WizardLM-13B, WizardLM-70B
- Vicuna-33B
- Alpaca-7B
- Falcon-40B-instruct
- MPT-30B-chat
- StarChat-Beta
- Pythia-12B
Each LLM’s output is rated by GPT-4 with a score from 1 to 5 (higher is better) for various criteria:
- instruction following
- helpfulness
- honesty
- truthfulness
For DPO, we need a "chosen" output, i.e., the output that we prefer, and a "rejected" output, an output that we don’t want the model to generate.
For the chosen output, the output with the highest mean score (using all criteria to compute this mean) is selected. For the rejected output, they randomly selected an output among the remaining ones.
They justify this random selection as follows:
We opted for random selection instead of selecting the lowest-scored response to encourage diversity and make the DPO objective more challenging
The version of the dataset they have made and used for training DPO is here:
- HuggingFaceH4/ultrafeedback_binarized (MIT license), use the "prefs" splits
Distilled Direct Preference Optimization (dDPO)
Instruct LLMs, e.g., chat models, are usually trained with reinforcement learning with human feedback (RLHF) using a Proximal Policy Optimization (PPO). It works well to align LLMs with human preferences but RLHF is also unstable and complicated. Indeed, before running RLHF, we need to train two models:
- A reference model simply trained with supervised fine-tuning (SFT) on an instruction dataset
- A reward model trained to predict human preferences. The training data for this model are usually rankings by humans of models’ outputs for a given prompt. The reward model is trained to predict this ranking.
Then, RLHF uses 4 different models:
- The reference model trained with SFT
- The reward model
- A value model that is usually initialized by the reward model
- The model (policy) that we want to train with RLHF which is usually initialized by the reference model
Using all these models, RLHF uses RL to optimize a language model policy to produce responses with a high reward (according to the reward model) without drifting excessively far from the original reference model.
Several frameworks implement RLHF to make it computationally more efficient. Nonetheless, it remains a complicated and unstable process involving many models.
DPO is a simple alternative to RLHF. It implicitly optimizes the same objective as existing RLHF algorithms (reward maximization with a KL-divergence constraint). The authors of DPO demonstrate that the constrained reward maximization problem can be exactly optimized by solving a much simpler classification problem on human preferences.
Since it can be reduced to a classification problem, DPO trains the model using a simple binary cross-entropy objective. DPO completely eliminates the need for reinforcement learning.
Given a prompt and several LLMs’ outputs ranked by humans according to their quality, DPO trains the model to assign a higher reward to the best outputs.
DPO only requires two models:
- The reference model fine-tuned with SFT on instruct datasets
- The base model that we want to train with DPO

DPO is presented by Stanford in this arXiv paper:
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Note that for Zephyr, Hugging Face calls it "Distilled Direct Preference Optimization" only because the SFT and preferences are generated by other LLMs. The DPO process itself remains standard.
If you are interested in using DPO to fine-tune Mistral 7B, have a look at my tutorial:
Fine-tune Your Own Instruct Version of Mistral 7B with Direct Preference Optimization (DPO)
The Evaluation of Zephyr 7B Beta
Hugging Face has evaluated Zephyr along the following axes on MT Bench: writing, roleplay, reasoning, math, coding, extraction, STEM, and humanities

It is clear that Zephyr outperforms Llama 2 70B while performing closely to other state-of-the-art commercial LLMs. GPT-4, Zephyr’s main teacher, remains much better at reasoning, math, coding, and extraction.
They have also performed an ablation study to demonstrate the importance of DPO.

DPO alone (first row) poorly performs. However, the combination of DPO and SFT clearly outperforms SFT alone.
Conclusion
By relying on knowledge distillation, Hugging Face has demonstrated that it is possible to train and align a state-of-the-art LLM without using any human annotations.
Zephyr 7B Beta is a rather affordable model to make, especially compared to other larger models such as Llama 2 Chat 70B. However, given the size of the training batch (2) per GPU, and the fact that they fully fine-tuned Mistral 7B, they had to use 16 A100 80 GB GPUs (for up to 4 hours according to the technical report).
Note that you can use LoRA to train DPO with Hugging Face’s TRL library. It significantly reduces the memory consumption. Hugging Face didn’t use parameter-efficient fine-tuning methods such as LoRA, but they are rather optimistic that it would work as well as full fine-tuning:
We did not experiment with parameter-efficient techniques such as LoRA (Hu et al., 2021), but expect similar results to hold with these methods.
To support my work, consider subscribing to my newsletter:
The Kaitchup – AI on a Budget | Benjamin Marie, PhD | Substack