File size: 3,593 Bytes
1e9f710
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
library_name: transformers
license: apache-2.0
base_model: Qwen/Qwen3-Coder-30B-A3B-Instruct
tags:
  - eagle
  - eagle3
  - speculative-decoding
  - draft-model
  - sglang
  - qwen3
  - code
language:
  - en
  - zh
pipeline_tag: text-generation
---

# SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct

This is an **EAGLE3 draft model** for speculative decoding with [Qwen/Qwen3-Coder-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct).

## Model Description

EAGLE3 (Efficient Auto-regressive Language model Generation with Learned Embeddings) is a speculative decoding technique that uses a lightweight draft model to predict future tokens, which are then verified by the target model in parallel. This can significantly accelerate inference speed (2-3x) without any loss in output quality.

### Key Features

- **Target Model**: Qwen3-Coder-30B-A3B-Instruct (30B parameters, 3B active)
- **Draft Model Size**: ~350MB (single transformer layer)
- **Training Data**: OpenPromptContainer (OPC) regenerated dataset
- **Training Steps**: 295,000 (Epoch 1)
- **Framework**: Trained with [SpecForge](https://github.com/sgl-project/SpecForge)

### Training Metrics

| Metric | Value |
|--------|-------|
| First Token Accuracy (acc_0) | 88.19% |
| Average Accuracy (7 positions) | 85.19% |
| Training Epochs | 1+ (295k steps) |

## Usage

### With SGLang

```python
import sglang as sgl

# Launch with EAGLE3 speculative decoding
llm = sgl.Engine(
    model_path="Qwen/Qwen3-Coder-30B-A3B-Instruct",
    speculative_algorithm="EAGLE",
    speculative_draft_model_path="sgl-project/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct",
    speculative_num_steps=5,
    speculative_eagle_topk=8,
    speculative_num_draft_tokens=64,
)

# Generate text
output = llm.generate("Write a Python function to sort a list:")
print(output)
```

### With SGLang Server

```bash
python -m sglang.launch_server \
    --model-path Qwen/Qwen3-Coder-30B-A3B-Instruct \
    --speculative-algorithm EAGLE \
    --speculative-draft-model-path sgl-project/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct \
    --speculative-num-steps 5 \
    --speculative-eagle-topk 8 \
    --speculative-num-draft-tokens 64 \
    --tp 8
```

## Model Architecture

The EAGLE3 draft model is a lightweight transformer that:
- Shares embeddings with the target model
- Uses a single transformer layer (hidden_size=2048, intermediate_size=12288)
- Predicts multiple future tokens autoregressively
- Uses the target model's hidden states as input

```json
{
  "architectures": ["LlamaForCausalLMEagle3"],
  "hidden_size": 2048,
  "intermediate_size": 12288,
  "num_attention_heads": 32,
  "num_key_value_heads": 4,
  "num_hidden_layers": 1,
  "vocab_size": 151936
}
```

## Training Details

- **Framework**: SpecForge with SGLang backend
- **Hardware**: 4x NVIDIA H200 GPUs (TP=4)
- **Batch Size**: 1 per GPU
- **Learning Rate**: 1e-4 with cosine annealing
- **Max Sequence Length**: 4096
- **Attention Backend**: FlexAttention

## Citation

If you use this model, please cite:

```bibtex
@article{li2024eagle,
  title={EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty},
  author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
  journal={arXiv preprint arXiv:2401.15077},
  year={2024}
}

@misc{sglang2024,
  title={SGLang: Efficient Execution of Structured Language Model Programs},
  author={Zheng, Lianmin and others},
  year={2024},
  url={https://github.com/sgl-project/sglang}
}
```

## License

This model is released under the Apache 2.0 License, following the base model's license.