distillation
knowledge-distillation
sovereign-omega
reasoning

Edge-Omega-Distilled-V1

Distillation Pipeline

High-fidelity knowledge distillation following the DeepSeek-R1 recipe (arxiv:2501.12948):

  • Teacher traces: Sovereign-Omega-SFT-V1 + OpenR1-Math-220k (verified)
  • Method: SFT on teacher reasoning traces (trace-based KD)
  • Student: Qwen2.5-3B-Instruct + LoRA (r=64, α=128)
  • Config: lr=1e-5, 3 epochs, cosine schedule, packing=True

Why Trace-Based KD (Not Logit-Based)

From DeepSeek-R1 §2.4 and REDI (arxiv:2505.24850):

  • Trace-based SFT: 72.6% AIME2024 (32B student)
  • Equivalent RL without distillation: 47.0% AIME2024 (same base)
  • 25-point gap proves trace KD is decisively superior for reasoning

Evaluation

Pre/post distillation comparison on:

  • GPQA Diamond (198 PhD-level MCQs)
  • ARC-Challenge (1172 science MCQs)

Launch

pip install trl transformers torch datasets accelerate peft trackio

# Distill with Sovereign-Omega + OpenR1 traces
STUDENT_MODEL=Qwen/Qwen2.5-3B-Instruct \
OPENR1_SAMPLES=50000 \
python distillation_pipeline.py

# Larger student on multi-GPU
STUDENT_MODEL=Qwen/Qwen2.5-7B-Instruct \
OPENR1_SAMPLES=220000 \
accelerate launch --config_file deepspeed_zero3.yaml distillation_pipeline.py
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for moro72842/Edge-Omega-Distilled-V1

Base model

Qwen/Qwen2.5-3B
Finetuned
(1238)
this model

Datasets used to train moro72842/Edge-Omega-Distilled-V1

Papers for moro72842/Edge-Omega-Distilled-V1