Add distillation pipeline + evaluation harness + report generator
Browse files
README.md
ADDED
|
@@ -0,0 +1,52 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
tags:
|
| 4 |
+
- distillation
|
| 5 |
+
- knowledge-distillation
|
| 6 |
+
- sovereign-omega
|
| 7 |
+
- reasoning
|
| 8 |
+
base_model:
|
| 9 |
+
- Qwen/Qwen2.5-3B-Instruct
|
| 10 |
+
datasets:
|
| 11 |
+
- moro72842/Sovereign-Omega-SFT-V1
|
| 12 |
+
- open-r1/OpenR1-Math-220k
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
# Edge-Omega-Distilled-V1
|
| 16 |
+
|
| 17 |
+
## Distillation Pipeline
|
| 18 |
+
|
| 19 |
+
High-fidelity knowledge distillation following the DeepSeek-R1 recipe (arxiv:2501.12948):
|
| 20 |
+
- **Teacher traces**: Sovereign-Omega-SFT-V1 + OpenR1-Math-220k (verified)
|
| 21 |
+
- **Method**: SFT on teacher reasoning traces (trace-based KD)
|
| 22 |
+
- **Student**: Qwen2.5-3B-Instruct + LoRA (r=64, α=128)
|
| 23 |
+
- **Config**: lr=1e-5, 3 epochs, cosine schedule, packing=True
|
| 24 |
+
|
| 25 |
+
## Why Trace-Based KD (Not Logit-Based)
|
| 26 |
+
|
| 27 |
+
From DeepSeek-R1 §2.4 and REDI (arxiv:2505.24850):
|
| 28 |
+
- Trace-based SFT: **72.6% AIME2024** (32B student)
|
| 29 |
+
- Equivalent RL without distillation: **47.0% AIME2024** (same base)
|
| 30 |
+
- 25-point gap proves trace KD is decisively superior for reasoning
|
| 31 |
+
|
| 32 |
+
## Evaluation
|
| 33 |
+
|
| 34 |
+
Pre/post distillation comparison on:
|
| 35 |
+
- GPQA Diamond (198 PhD-level MCQs)
|
| 36 |
+
- ARC-Challenge (1172 science MCQs)
|
| 37 |
+
|
| 38 |
+
## Launch
|
| 39 |
+
|
| 40 |
+
```bash
|
| 41 |
+
pip install trl transformers torch datasets accelerate peft trackio
|
| 42 |
+
|
| 43 |
+
# Distill with Sovereign-Omega + OpenR1 traces
|
| 44 |
+
STUDENT_MODEL=Qwen/Qwen2.5-3B-Instruct \
|
| 45 |
+
OPENR1_SAMPLES=50000 \
|
| 46 |
+
python distillation_pipeline.py
|
| 47 |
+
|
| 48 |
+
# Larger student on multi-GPU
|
| 49 |
+
STUDENT_MODEL=Qwen/Qwen2.5-7B-Instruct \
|
| 50 |
+
OPENR1_SAMPLES=220000 \
|
| 51 |
+
accelerate launch --config_file deepspeed_zero3.yaml distillation_pipeline.py
|
| 52 |
+
```
|