DeepScaleR: Scaling Language Model Reasoning with Reinforcement Learning

2025-02-12

ℹ️Note on the source

This blog post was automatically generated (and translated). It is based on the following original, which I selected for publication on this blog:
Notion – The all-in-one workspace for your notes, tasks, wikis, and databases..

DeepScaleR: Scaling Language Model Reasoning with Reinforcement Learning

Researchers have introduced DeepScaleR-1.5B-Preview, a language model fine-tuned from Deepseek-R1-Distilled-Qwen-1.5B using reinforcement learning (RL). This model achieves a 43.1% Pass@1 accuracy on AIME2024, a significant improvement over the base model, surpassing OpenAI's o1-preview with only 1.5B parameters.

Model	AIME 2024	MATH 500	AMC 2023	Minerva Math	Olympiad Bench	Avg.
DeepScaleR-1.5B-Preview	43.1	87.8	73.6	30.2	50.0	57.0
DeepSeek-R1-Distill-Qwen-1.5B	28.8	82.8	62.9	26.5	43.3	48.9
O1-Preview	40.0	81.4	-	-	-	-

Democratizing RL for LLMs

The development of reasoning models is advancing, but replicating the training processes of models like Deepseek-R1 is computationally expensive. It can take a long time and a lot of computer power. To address this, research is focusing on distilled models and iterative lengthening schemes for RL, significantly reducing the computational demands. This work demonstrates that customized reasoning models can be developed through RL in a scalable and cost-efficient manner.

DeepScaleR's Approach

Dataset Curation

The training dataset comprised AIME and AMC problems, along with questions from the Omni-MATH and Still datasets. The data processing pipeline included extracting answers using gemini-1.5-pro-002, removing redundant questions with RAG embeddings from sentence-transformers/all-MiniLM-L6-v2, and filtering ungradable questions. The final training dataset consisted of approximately 40,000 unique problem-answer pairs.

Reward Function

An Outcome Reward Model (ORM) was employed to avoid reward hacking. The reward function returned 1 if the LLM's answer passed basic LaTeX/Sympy checks, and 0 if the answer was incorrect or formatted incorrectly.

Iterative Context Lengthening

Selecting the optimal context window for training is a challenge in scaling RL for reasoning tasks. Longer contexts provide models more space to think but slow training, while shorter contexts accelerate training but may limit the model’s ability to solve harder problems. An iterative approach was adopted:

RL training with 8K max context for effective reasoning and efficient training.
Scaling up training to 16K and 24K contexts to solve more challenging problems.

Bootstrapping with 8K Context

Analysis of Deepseek-R1-Distilled-Qwen-1.5B on AIME2024 revealed that incorrect responses contained three times more tokens than correct ones. Training was initiated with an 8K context, achieving an initial AIME2024 accuracy of 22.9%. Constraining output to 8K tokens led the model to utilize context more effectively.

	Base model	DeepScaleR-1.5b-8k	Change
AIME Pass@1	28.9%	33.9%	+5%
Average tokens for correct responses	6396.0	3661.2	-2734.8
Average tokens for incorrect responses	20346.3	6976.8	-13369.5
Average tokens overall	16335.6	5850.9	−10484.7

Extending to 16K Context

After approximately 1,000 steps, response length began to increase again, leading to diminishing returns. The response clipping ratio also rose, indicating that more responses were being truncated at the context limit. Training was relaunched with a 16K context window. This two-stage approach is more efficient than training at 16K from the start.

Surpassing O1-preview with 24K

To push towards o1-level performance, the context window was increased to 24K, leading to the model surpassing 40% AIME accuracy and eventually reaching 43%.

Overall, the training run consisted of ~1,750 steps, taking around 3,800 A100 hours.

Evaluation

The model was evaluated on competition-level mathematics benchmarks. DeepScaleR outperformed the base model across all benchmarks, achieving a 14.4% absolute gain on AIME2024 and an 8.1% overall improvement.

Model	AIME 2024	MATH 500	AMC 2023	Minerva Math	OlympiadBench	Avg.
Qwen-2.5-Math-7B-Instruct	13.3	79.8	50.6	34.6	40.7	43.8
rStar-Math-7B	26.7	78.4	47.5	-	47.1	-
Eurus-2-7B-PRIME	26.7	79.2	57.8	38.6	42.1	48.9
Qwen2.5-7B-SimpleRL	26.7	82.4	62.5	39.7	43.3	50.9
DeepSeek-R1-Distill-Qwen-1.5B	28.8	82.8	62.9	26.5	43.3	48.9
Still-1.5B	32.5	84.4	66.7	29.0	45.4	51.6
DeepScaleR-1.5B-Preview	43.1	87.8	73.6	30.2	50.0	57.0
O1-Preview	40.0	81.4	-	-	-	-

Key Takeaways

RL scaling can manifest in small models. Combining high-quality SFT distillation with RL scaling can unlock the reasoning potential of LLMs. RL scaling improved AIME accuracy from 28.9% to 43.1%.
Iterative lengthening enables more effective length scaling. Optimizing reasoning at shorter contexts (8K) enables faster and more effective training in subsequent 16K and 24K runs. This iterative approach grounds the model in effective thinking patterns before scaling to longer contexts, making RL-based length scaling more efficient.

Conclusion

DeepScaleR-1.5B-Preview surpasses o1-preview with 43.1% Pass@1 accuracy, demonstrating the scaling effects of RL on LLMs.

@misc{deepscaler2025,
 title={DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL},
 author={Michael Luo and Sijun Tan and Justin Wong and Xiaoxiang Shi and William Tang and Manan Roongta and Colin Cai and Jeffrey Luo and Tianjun Zhang and Erran Li and Raluca Ada Popa and Ion Stoica},
 year={2025},
 howpublished={\url{https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2}},
 note={Notion Blog}
 year={2025}
}

Comments are closed.