The world’s leading publication for data science, AI, and ML professionals.

Exploring DeepSeek’s R1 Training Process

Open-Source Intelligence on Par with Proprietary Models

Image by Author - Flux.1 Schnell
Image by Author – Flux.1 Schnell

One of the most powerful AI models in the world today was released open-source. Based off both metrics and user interactions, it is as good – if not better – than OpenAI’s ChatGPT. The authors were kind enough to release a paper outlining how they trained such a model. Deepseek-R1 is a big deal in the tech world right now, so I wanted to break down my understanding of the insights from their paper.

In this blog post, we’ll dive deep into how they formulated their policy optimization equations to allow for Reinforcement Learning. After that, I’ll go through the different stages they put their models through to get deeply impressive performance.

Let’s dive in!

Group Relative Policy Optimization (GRPO)

When we are looking to create a better model, we say we are optimizing our model’s policy (basically just the data it uses to explain stuff). Historically, the industry has relied on two forms of policy optimization: Proximal Policy Optimization (PPO) and Direct Policy Optimization (DPO). You can learn more about them in my previous blog post here, for this entry all you need to understand is that GRPO is another mathematical way we are going to figure out how best to update our policy.

Before I dive into the formulation, I’ll briefly explain some key concepts.

Key Concepts

A policy is a function (defined by a model’s weights and biases) that determines the probability of different outputs (o) for a given query (q). A group is a set of outputs produced by the model for the same query, sampled from an older version of the policy. We sample our outputs from the older version because it isn’t being updated, making its outputs consistent. We will use the probabilities given by the new policy for the outputs of the old policy to guide our updates. Importantly, advantage (A) is where reinforcement learning comes into GRPO. Advantage tells us how much better a specific output is than the rest of the group. We use this value to weight responses and directly influence the score we’ll train our model on.

Equation 1 from the paper
Equation 1 from the paper

Our model tries to maximize the value output by the above equation

Left-Hand Part of Equation

The above formula is how we calculate the score for a certain policy based on the group and the last used policy. If the model is doing much better now, we should keep the new policy. If the change is incremental, our next policy will not be a major shift towards the new one. From this high-level, we can now drill deeper.

Zooming into the Left-Hand Part of Equation 1 from the Paper
Zooming into the Left-Hand Part of Equation 1 from the Paper

The above section of the equation is how we determine if the data is correct. I’m going to go left to right explaining each parameter of the min function. At the beginning, we take the ratio between the new policy output given a certain query over the old policy output given the same query. This ratio is used to tell us how much the new policy deviates from the old policy. This ratio doesn’t tell us if the change is good or bad, so we need the Advantage value to determine if the change is good (positive) or bad (negative). To keep stability in our training run, we also calculate a clipped version of this number – with values bounded between 1-ε and 1+ε. We then choose the smaller between the bounded and unbounded values, as this will be the more stable update. Because the model’s goal is to maximize the value min returns, if we choose the smaller version always we reduce the odds of giant overcorrections through the run.

Reward Modeling

Equation 3 from the paper
Equation 3 from the paper

Now let’s cover Advantage – or how we know whether an update is good or not. For reinforcement learning, we do not have a ‘ground truth’ to compare answers to. Consequently, we need a good way to figure out, absent of human interaction, if the response is good. To solve for this, DeepSeek decided to use a rules based system to determine what made for a good answer.

First, each answer was checked to see if it was accurate – for example by checking the final computation value or by compiling and checking its output. Second, they checked to see if the response was in the right format (shown below). The paper leads me to believe they used models for each of these, but rather than having it give a subjective score it instead took complex input and gave an objective output.

Table 1 from the paper
Table 1 from the paper

Right-Hand Part of Equation

For the final part of our equation, we are going to take the Kullback-Leibler Divergence between the new policy and our reference policy. Like we had discussed before in my blog post on Direct Preference Optimization, this is a way to measure the difference between two continuous probability distributions.

Zooming in on the Right-Hand part of Equation 1 from the paper
Zooming in on the Right-Hand part of Equation 1 from the paper

While the Left-Hand part of the equation uses min to reduce dramatic changes to the policy, it apparently isn’t enough, as we have this subtraction to ensure the policy doesn’t deviate too radically from a reference policy. The paper doesn’t specify what the reference policy is, but likely this is a previous policy that had performed well, which we are benchmarking off to ensure similar or better performance.


With the reinforcement modeling out of the way, let’s go through how they were actually training!

The researchers trained two models in different ways: DeepSeek-R1-Zero and DeepSeek-R1

DeepSeek-R1-Zero

Here, we are beginning with a base model and then using reinforcement learning to have the model develop reasoning capabilities. One might reasonably ask why avoid using supervised fine-tuning (SFT) to train the model? Well, to do fine-tuning where you enhance the model’s capability, you typically need a lot of good data. As this is hard to come by, finding ways around this requirement is useful. Additionally, it’s possible with reinforcement learning to find better ways of solving problems than supervised models.

While the authors don’t say what problems they were giving the model to test its reasoning ability, it’s reasonable to assume that these were a fairly large dataset. They would run the model to get the answer to a group of queries, and then use the GRPO equation from above to determine the loss. The loss was then used to perform backwards gradient updates and improve the model.

There were a number of really interesting takeaways from this session. The most immediate one is that the researchers were able to effectively recreate the accuracy of OpenAI’s o1 model using this approach. This alone signals that many other labs will likely follow this approach.

Figure 2 from the paper
Figure 2 from the paper

Interestingly, we also see that DeepSeek-R1-Zero learns on its own to create longer responses. The graph below shows how long its average response tends to be as the training run continues. Note that neither the questions asked nor GRPO give any bonus points for length. This result gives strong evidence that Chain Of Thought reasoning is the way to boost accuracy for these models.

Figure 3 from the paper
Figure 3 from the paper

Finally, the authors also saw the model begin to develop reflective tendencies. Similar to how prompt engineers would ask the model to reflect on its answers, here we can see the model appears to develop this tendency on its own. One of the fascinating things here is seeing the model use reinforcement learning to discover many of the prompting techniques we’ve been using.

Table 3 from the paper
Table 3 from the paper

Nevertheless, R1-Zero had some issues after its training run. For one thing, it would occasionally mix languages – meaning it would splice in words from different languages. Additionally, the researchers found R1-Zero would occasionally have responses with poor readability. To address these issues while maintaining their gains, the researchers began work on an improved methodology.

DeepSeek-R1

From here, the authors once again began with a base model, but this time went through 4 distinct steps to train the model: Cold Start, Reasoning-oriented Reinforcement Learning, Rejection Sampling and Supervised Fine-Tuning, Reinforcement Learning for all Scenarios.

Stage 1: Cold Start

Unlike R1-Zero, the authors wanted to avoid having the model begin the training run going in a bad direction (a cold start problem). They began by using their DeepSeek-V3 model to generate thousands of examples of Chain Of Thought Reasoning. By having it begin with supervised fine-tuning, the authors could ensure that the <think> tokens were always understandable to the user (something R1-Zero struggled with, either mixing languages, or having unintelligible responses).

Stage 2: Reasoning-oriented Reinforcement Learning

Once the fine-tune was completed, they began a reinforcement learning run focused entirely on reasoning-oriented questions. This would involve coding, mathematics, science, and logical reasoning skills. All of these are precise problems with correct answers.

While training, they noticed that the model would often begin language-mixing in its <think> tokens. To avoid this, they introduced a language-consistency reward into their Advantage calculation, ensuring the model would steer towards using just one language. The authors note that this alignment does degrade performance slightly. Once the model converges on all its reasoning tasks, they stop the run.

Stage 3: Rejection Sampling and Supervised Fine-Tuning

The researchers now expand the type of questions they are curating for their fine-tuning. They create both reasoning and non-reasoning questions to improve the model’s performance.

For reasoning questions, they no longer require that the question have an objective answer. Instead to determine correctness, they use DeepSeek-V3 as their judge. After filtering out for wrong answers, mixed language use, long paragraphs, and code blocks, they had roughly 600 thousand samples to train on.

For the non-reasoning questions (think writing, translation, story-telling), the authors reused their data pipeline from DeepSeek-V3. For some of the data, the authors required the full Chain-of-Thought reasoning, but when they felt the question was simple enough (for example "hi"), they did not generate the full Chain-of-Thought. They ended up with roughly 200 thousand samples here.

They took this data and fine-tuned for two epochs.

Stage 4: Reinforcement Learning for all Scenarios

They end with a final reinforcement learning run. They adjusted their Advantage calculation so that they could capture both proper reasoning (they used the same rules-based approach as before) and human preferences for helpfulness and harmlessness. For the latter, they seem to have fallen back to using external reward models to estimate human preferences. They used a similar data corpus to build this reward model as they did for DeepSeek-V3’s pipeline. Note, for helpfulness the final summary was what they focused the reward model on, but for harmlessness they took into account the thinking tokens as well.

With this final step, they were able to get incredible performance from DeepSeek-R1.

Results

As the table below shows, the model is directly comparable to some of the best models from Anthropic and OpenAI. Even more so, in a few benchmarks Deepseek R1 is decidedly better.

Table 4 from the paper
Table 4 from the paper

Perhaps the most interesting gauge of performance has been the public’s reaction. On Chatbot Arena DeepSeek-R1 ranks very high, being the only model with an MIT license in the top 10.

Image by Author - Screen Capture of Chatbot Arena from 1/27/2025
Image by Author – Screen Capture of Chatbot Arena from 1/27/2025

Additionally, many in the tech community have also weighed in on how impressed they are with this.

Additionally, the authors also shared with us the other methods they tried (including using a Process Reward Model, which I explain here). It is worth going through to their analysis to see why they walked away from other methods.

Conclusion

This paper is a great way to see how the most cutting-edge labs are training foundation models. Given the success seen here, I think it is only a matter of time until people switch to reinforcement learning for a lot of (if not all) fine-tuning. I think it’s fair to say this is a seminal moment in the history of AI.

It is an exciting time to be building!


[1] DeepSeek Research, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025), GitHub


Related Articles