moro72842
/

Edge-Omega-Distilled-V1

+---
+license: apache-2.0
+tags:
+- distillation
+- knowledge-distillation
+- sovereign-omega
+- reasoning
+base_model:
+- Qwen/Qwen2.5-3B-Instruct
+datasets:
+- moro72842/Sovereign-Omega-SFT-V1
+- open-r1/OpenR1-Math-220k
+---
+# Edge-Omega-Distilled-V1
+## Distillation Pipeline
+High-fidelity knowledge distillation following the DeepSeek-R1 recipe (arxiv:2501.12948):
+- **Teacher traces**: Sovereign-Omega-SFT-V1 + OpenR1-Math-220k (verified)
+- **Method**: SFT on teacher reasoning traces (trace-based KD)
+- **Student**: Qwen2.5-3B-Instruct + LoRA (r=64, α=128)
+- **Config**: lr=1e-5, 3 epochs, cosine schedule, packing=True
+## Why Trace-Based KD (Not Logit-Based)
+From DeepSeek-R1 §2.4 and REDI (arxiv:2505.24850):
+- Trace-based SFT: **72.6% AIME2024** (32B student)
+- Equivalent RL without distillation: **47.0% AIME2024** (same base)
+- 25-point gap proves trace KD is decisively superior for reasoning
+## Evaluation
+Pre/post distillation comparison on:
+- GPQA Diamond (198 PhD-level MCQs)
+- ARC-Challenge (1172 science MCQs)
+## Launch
+```bash
+pip install trl transformers torch datasets accelerate peft trackio
+# Distill with Sovereign-Omega + OpenR1 traces
+STUDENT_MODEL=Qwen/Qwen2.5-3B-Instruct \
+OPENR1_SAMPLES=50000 \
+python distillation_pipeline.py
+# Larger student on multi-GPU
+STUDENT_MODEL=Qwen/Qwen2.5-7B-Instruct \
+OPENR1_SAMPLES=220000 \
+accelerate launch --config_file deepspeed_zero3.yaml distillation_pipeline.py
+```