| --- |
| library_name: transformers |
| license: apache-2.0 |
| base_model: Qwen/Qwen3-Coder-30B-A3B-Instruct |
| tags: |
| - eagle |
| - eagle3 |
| - speculative-decoding |
| - draft-model |
| - sglang |
| - qwen3 |
| - code |
| language: |
| - en |
| - zh |
| pipeline_tag: text-generation |
| --- |
| |
| # SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct |
|
|
| This is an **EAGLE3 draft model** for speculative decoding with [Qwen/Qwen3-Coder-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct). |
|
|
| ## Model Description |
|
|
| EAGLE3 (Efficient Auto-regressive Language model Generation with Learned Embeddings) is a speculative decoding technique that uses a lightweight draft model to predict future tokens, which are then verified by the target model in parallel. This can significantly accelerate inference speed (2-3x) without any loss in output quality. |
|
|
| ### Key Features |
|
|
| - **Target Model**: Qwen3-Coder-30B-A3B-Instruct (30B parameters, 3B active) |
| - **Draft Model Size**: ~350MB (single transformer layer) |
| - **Training Data**: OpenPromptContainer (OPC) regenerated dataset |
| - **Training Steps**: 295,000 (Epoch 1) |
| - **Framework**: Trained with [SpecForge](https://github.com/sgl-project/SpecForge) |
|
|
| ### Training Metrics |
|
|
| | Metric | Value | |
| |--------|-------| |
| | First Token Accuracy (acc_0) | 88.19% | |
| | Average Accuracy (7 positions) | 85.19% | |
| | Training Epochs | 1+ (295k steps) | |
| |
| ## Usage |
| |
| ### With SGLang |
| |
| ```python |
| import sglang as sgl |
| |
| # Launch with EAGLE3 speculative decoding |
| llm = sgl.Engine( |
| model_path="Qwen/Qwen3-Coder-30B-A3B-Instruct", |
| speculative_algorithm="EAGLE", |
| speculative_draft_model_path="sgl-project/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct", |
| speculative_num_steps=5, |
| speculative_eagle_topk=8, |
| speculative_num_draft_tokens=64, |
| ) |
| |
| # Generate text |
| output = llm.generate("Write a Python function to sort a list:") |
| print(output) |
| ``` |
| |
| ### With SGLang Server |
| |
| ```bash |
| python -m sglang.launch_server \ |
| --model-path Qwen/Qwen3-Coder-30B-A3B-Instruct \ |
| --speculative-algorithm EAGLE \ |
| --speculative-draft-model-path sgl-project/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct \ |
| --speculative-num-steps 5 \ |
| --speculative-eagle-topk 8 \ |
| --speculative-num-draft-tokens 64 \ |
| --tp 8 |
| ``` |
| |
| ## Model Architecture |
| |
| The EAGLE3 draft model is a lightweight transformer that: |
| - Shares embeddings with the target model |
| - Uses a single transformer layer (hidden_size=2048, intermediate_size=12288) |
| - Predicts multiple future tokens autoregressively |
| - Uses the target model's hidden states as input |
| |
| ```json |
| { |
| "architectures": ["LlamaForCausalLMEagle3"], |
| "hidden_size": 2048, |
| "intermediate_size": 12288, |
| "num_attention_heads": 32, |
| "num_key_value_heads": 4, |
| "num_hidden_layers": 1, |
| "vocab_size": 151936 |
| } |
| ``` |
| |
| ## Training Details |
| |
| - **Framework**: SpecForge with SGLang backend |
| - **Hardware**: 4x NVIDIA H200 GPUs (TP=4) |
| - **Batch Size**: 1 per GPU |
| - **Learning Rate**: 1e-4 with cosine annealing |
| - **Max Sequence Length**: 4096 |
| - **Attention Backend**: FlexAttention |
| |
| ## Citation |
| |
| If you use this model, please cite: |
| |
| ```bibtex |
| @article{li2024eagle, |
| title={EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty}, |
| author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang}, |
| journal={arXiv preprint arXiv:2401.15077}, |
| year={2024} |
| } |
| |
| @misc{sglang2024, |
| title={SGLang: Efficient Execution of Structured Language Model Programs}, |
| author={Zheng, Lianmin and others}, |
| year={2024}, |
| url={https://github.com/sgl-project/sglang} |
| } |
| ``` |
| |
| ## License |
| |
| This model is released under the Apache 2.0 License, following the base model's license. |
| |