Recreating the “Aha Moment” of DeepSeek R1 with GRPO and the Countdown Game
This blog post was automatically generated (and translated). It is based on the following original, which I selected for publication on this blog:
Mini-R1: Reproduce Deepseek R1 „aha moment“ a RL tutorial.
# Recreating the “Aha Moment” of DeepSeek R1 with GRPO and the Countdown Game
The release of Deepseek R1 sparked considerable interest in the AI community. As an open model rivaling OpenAI’s models in complex reasoning, it was introduced using Group Relative Policy Optimization (GRPO) and a multi-stage training approach focused on reinforcement learning (RL). The DeepSeek team also released a research paper detailing their methods, highlighting a key “aha moment” during pure RL training. This moment occurred when DeepSeek-R1-Zero learned to allocate more thinking time to a problem by reevaluating its initial approach, without any explicit human feedback.
This article explores the potential to recreate a similar “aha moment” by training an open model using reinforcement learning. The goal is to teach the model self-verification and search abilities to solve the Countdown Game, a numbers puzzle requiring players to reach a target number using basic arithmetic operations.
## Group Relative Policy Optimization (GRPO)
Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm designed to enhance the reasoning capabilities of Large Language Models (LLMs). Originating from the DeepSeekMath paper, GRPO modifies the traditional Proximal Policy Optimization (PPO) by removing the necessity for a value function model. Instead, it estimates baselines from group scores, thus decreasing memory usage and computational overhead. GRPO, now adopted by teams like Qwen, can be combined with rule/binary-based rewards or general reward models to improve model helpfulness.
The process involves several steps:
1. **Sampling:** Generate multiple outputs for each prompt using the current policy.
2. **Reward Scoring:** Score each generation using a reward function (rule-based or outcome-based).
3. **Advantage Calculation:** Use the average reward of the generated outputs as a baseline. Calculate the advantage of each solution relative to this baseline, normalizing the reward within each group.
4. **Policy Optimization:** Optimize the policy to maximize the GRPO objective, incorporating the calculated advantages and a KL divergence term.
## Training the Model
The training process, inspired by Jiayi Pan’s initial exploration, involves the following steps:
1. **Setting up the development environment:** This includes installing necessary libraries like Hugging Face Transformers, PyTorch, vLLM, and TRL.
“`bash
%pip install “torch==2.5.1” tensorboard “setuptools<71.0.0" --index-url https://download.pytorch.org/whl/cu121
%pip install flash-attn
%pip install --upgrade \
"transformers==4.48.1" \
"datasets==3.1.0" \
"accelerate==1.3.0" \
"hf-transfer==0.1.9" \
"deepspeed==0.15.4" \
"trl==0.14.0"
%pip install "vllm==0.7.0"
```
2. **Generating training samples:** Using the "Jiayi-Pan/Countdown-Tasks-3to4" dataset, which contains samples with 3 to 4 numbers and solutions, combined with a reasoning prefix.
```python
from transformers import AutoTokenizer
from datasets import load_dataset
dataset_id = "Jiayi-Pan/Countdown-Tasks-3to4"
dataset = load_dataset(dataset_id, split="train")
dataset = dataset.shuffle(seed=42).select(range(50000))
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")
def generate_r1_prompt(numbers, target):
r1_prefix = [{
"role": "system",
"content": "You are a helpful assistant. You first thinks about the reasoning process in the mind and then provides the user with the answer."
},
{
"role": "user",
"content": f"Using the numbers {numbers}, create an equation that equals {target}. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in
},
{
“role”: “assistant”,
“content”: “Let me solve this step by step.\n
}]
return {“prompt”: tokenizer.apply_chat_template(r1_prefix, tokenize=False, continue_final_message=True), “target”: target}
dataset = dataset.map(lambda x: generate_r1_prompt(x[“nums”], x[“target”]))
train_test_split = dataset.train_test_split(test_size=0.1)
train_dataset = train_test_split[“train”]
test_dataset = train_test_split[“test”]
“`
3. **Training with GRPO:** Utilizing the TRL library’s `GRPOTrainer` with custom reward functions.
The reward functions used include:
* **Format Reward:** Verifies if the generated output adheres to the correct format: `
* **Accuracy Reward:** Extracts the equation from the `
“`python
import re
def format_reward_func(completions, target, **kwargs):
rewards = []
for completion, gt in zip(completions, target):
try:
completion = “
regex = r”^
match = re.search(regex, completion, re.DOTALL)
if match is None or len(match.groups()) != 2:
rewards.append(0.0)
else:
rewards.append(1.0)
except Exception:
rewards.append(0.0)
return rewards
def equation_reward_func(completions, target, nums, **kwargs):
rewards = []
for completion, gt, numbers in zip(completions, target, nums):
try:
completion = “
match = re.search(r”
if match is None:
rewards.append(0.0)
continue
equation = match.group(1).strip()
used_numbers = [int(n) for n in re.findall(r’\d+’, equation)]
if sorted(used_numbers) != sorted(numbers):
rewards.append(0.0)
continue
allowed_pattern = r’^[\d+\-*/().\s]+$’
if not re.match(allowed_pattern, equation):
rewards.append(0.0)
continue
result = eval(equation, {“__builtins__”: None}, {})
if abs(float(result) – float(gt)) < 1e-5:
rewards.append(1.0)
else:
rewards.append(0.0)
except Exception:
rewards.append(0.0)
return rewards
```
4. **Distributed Training:** Utilizing Deepspeed and vLLM for faster generation on multi-GPU setups.
## Training Observations and Results
During training, several key observations were noted:
* Around 50 steps, the model learns the correct format.
* By 100 steps, the model achieves approximately a 25% success rate in solving the equation and starts to incorporate "reasoning" in its responses.
* At 200 steps, the performance improvement slows down, reaching around 40% success rate. The model begins to adopt a new approach, solving the equation in a more programmatic manner by testing various combinations and evaluating the results.
* After 450 steps, the model achieves a 50% success rate, maintaining the programmatic execution format developed after step 200.
Possible explanations for the shift from word-based reasoning to programmatic execution include:
* The Qwen 2.5 3B model may not be sufficiently robust for this task.
* The reward functions may not be adequately defined, leading to the model finding exploits to solve the equation.
* Training solely on Countdown Game tasks might compel the model to discover the most efficient method for solving the equation.
* The training duration might be insufficient.
## Conclusion
This experiment demonstrates a simplified reproduction of DeepSeek R1's learned reasoning capabilities using GRPO and the Countdown Game. While focusing on a specific task rather than general reasoning, the results highlight the potential of the method. It also underscores the substantial computational resources required for reinforcement learning, suggesting that advancements in RL will demand even greater computing power in the future. As RL becomes more accessible, further exploration and progress in open-source AI development are anticipated. Is the DeepSeek release a turning point for open-science in AI, and what impact will it have on the future of the field?
[/markdown]