README.md · JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct at main

SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct / README.md

JinnP

Upload folder using huggingface_hub

1e9f710 verified 5 months ago

preview code

raw

history blame contribute delete

3.59 kB

	---
	library_name: transformers
	license: apache-2.0
	base_model: Qwen/Qwen3-Coder-30B-A3B-Instruct
	tags:
	- eagle
	- eagle3
	- speculative-decoding
	- draft-model
	- sglang
	- qwen3
	- code
	language:
	- en
	- zh
	pipeline_tag: text-generation
	---

	# SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct

	This is an EAGLE3 draft model for speculative decoding with [Qwen/Qwen3-Coder-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct).

	## Model Description

	EAGLE3 (Efficient Auto-regressive Language model Generation with Learned Embeddings) is a speculative decoding technique that uses a lightweight draft model to predict future tokens, which are then verified by the target model in parallel. This can significantly accelerate inference speed (2-3x) without any loss in output quality.

	### Key Features

	- Target Model: Qwen3-Coder-30B-A3B-Instruct (30B parameters, 3B active)
	- Draft Model Size: ~350MB (single transformer layer)
	- Training Data: OpenPromptContainer (OPC) regenerated dataset
	- Training Steps: 295,000 (Epoch 1)
	- Framework: Trained with [SpecForge](https://github.com/sgl-project/SpecForge)

	### Training Metrics

	\| Metric \| Value \|
	\|--------\|-------\|
	\| First Token Accuracy (acc_0) \| 88.19% \|
	\| Average Accuracy (7 positions) \| 85.19% \|
	\| Training Epochs \| 1+ (295k steps) \|

	## Usage

	### With SGLang

	```python
	import sglang as sgl

	# Launch with EAGLE3 speculative decoding
	llm = sgl.Engine(
	model_path="Qwen/Qwen3-Coder-30B-A3B-Instruct",
	speculative_algorithm="EAGLE",
	speculative_draft_model_path="sgl-project/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct",
	speculative_num_steps=5,
	speculative_eagle_topk=8,
	speculative_num_draft_tokens=64,
	)

	# Generate text
	output = llm.generate("Write a Python function to sort a list:")
	print(output)
	```

	### With SGLang Server

	```bash
	python -m sglang.launch_server \
	--model-path Qwen/Qwen3-Coder-30B-A3B-Instruct \
	--speculative-algorithm EAGLE \
	--speculative-draft-model-path sgl-project/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct \
	--speculative-num-steps 5 \
	--speculative-eagle-topk 8 \
	--speculative-num-draft-tokens 64 \
	--tp 8
	```

	## Model Architecture

	The EAGLE3 draft model is a lightweight transformer that:
	- Shares embeddings with the target model
	- Uses a single transformer layer (hidden_size=2048, intermediate_size=12288)
	- Predicts multiple future tokens autoregressively
	- Uses the target model's hidden states as input

	```json
	{
	"architectures": ["LlamaForCausalLMEagle3"],
	"hidden_size": 2048,
	"intermediate_size": 12288,
	"num_attention_heads": 32,
	"num_key_value_heads": 4,
	"num_hidden_layers": 1,
	"vocab_size": 151936
	}
	```

	## Training Details

	- Framework: SpecForge with SGLang backend
	- Hardware: 4x NVIDIA H200 GPUs (TP=4)
	- Batch Size: 1 per GPU
	- Learning Rate: 1e-4 with cosine annealing
	- Max Sequence Length: 4096
	- Attention Backend: FlexAttention

	## Citation

	If you use this model, please cite:

	```bibtex
	@article{li2024eagle,
	title={EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty},
	author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
	journal={arXiv preprint arXiv:2401.15077},
	year={2024}
	}

	@misc{sglang2024,
	title={SGLang: Efficient Execution of Structured Language Model Programs},
	author={Zheng, Lianmin and others},
	year={2024},
	url={https://github.com/sgl-project/sglang}
	}
	```

	## License

	This model is released under the Apache 2.0 License, following the base model's license.