Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning
Paper • 2505.24850 • Published • 8
High-fidelity knowledge distillation following the DeepSeek-R1 recipe (arxiv:2501.12948):
From DeepSeek-R1 §2.4 and REDI (arxiv:2505.24850):
Pre/post distillation comparison on:
pip install trl transformers torch datasets accelerate peft trackio
# Distill with Sovereign-Omega + OpenR1 traces
STUDENT_MODEL=Qwen/Qwen2.5-3B-Instruct \
OPENR1_SAMPLES=50000 \
python distillation_pipeline.py
# Larger student on multi-GPU
STUDENT_MODEL=Qwen/Qwen2.5-7B-Instruct \
OPENR1_SAMPLES=220000 \
accelerate launch --config_file deepspeed_zero3.yaml distillation_pipeline.py