Eagle-Qwen2.5-14B-Instruct
Introduction
Eagle-Qwen2.5-14B-Instruct is retrained based on the open-source Qwen2.5-14B-Instruct model, and can be used for the Eagle-2 speculative decoding algorithm to speed up the inference of large language models during the decoding stage.
Note: The configuration files in the project follow the SGLang framework. If you intend to use other frameworks like vLLM, please modify the relevant configurations by yourself.
Training Configuration
Based on the open-source code repository Eagle, building upon the open-source codebase Eagle, we developed additional training code for the Qwen2.5-14B-Instruct model and conducted a secondary training process using the following configurations. (Note: We will open source the training code in the follow-up.)
- Dataset: Utilized the ShareGPT-68K conversation dataset.
- Training environment: The training was conducted on 4 NVIDIA RTX 4090 GPUs with 24GB VRAM each, leveraging the DeepSpeed framework. The total training duration was approximately one day.
Model Inference Launch Command
To launch the EAGLE-2 algorithm service using SGLang, here is the instruction:
python3 -m sglang.launch_server \
--model Qwen/Qwen2.5-14B-Instruct \
--speculative-algo EAGLE \
--speculative-draft Zjcxy-SmartAI/Eagle-Qwen2.5-14B-Instruct \
--speculative-num-steps 5 \
--speculative-eagle-topk 4 \
--speculative-num-draft-tokens 16 \
--dtype float16 \
--port 30000 \
--mem-fraction 0.7 \
--tp-size 2 \
--cuda-graph-max-bs 16 \
--cuda-graph-bs {1,2,3,4}
To launch the original model service (for comparative experiments) using SGLang, here is the instruction:
python -m sglang.launch_server \
--model Qwen/Qwen2.5-14B-Instruct \
--port 30000 \
--mem-fraction 0.7 \
--tp-size 2 \
--cuda-graph-max-bs 16 \
--cuda-graph-bs {1,2,3,4}
Performance Evaluation
We conducted performance tests in a 2-GPU RTX 4090 environment, leveraging multi-turn dialogue MT-bench and single-turn dialogue Alpaca datasets. The results are as follows:
| Dataset | parallel | Throughput/parallel(tokens/s) | Accept length |
|---|---|---|---|
| MT-bench | 1 | 89.76 | 3.15 |
| alpaca | 1 | 81.90 | 2.94 |
| MT-bench | 4 | 60.15 | 3.17 |
| alpaca | 4 | 55.19 | 2.93 |
The original model performance is as follows:
| Dataset | parallel | Throughput/parallel(tokens/s) |
|---|---|---|
| MT-bench | 1 | 55.14 |
| alpaca | 1 | 54.95 |
| MT-bench | 4 | 49.43 |
| alpaca | 4 | 47.68 |
Through comparison, we have drawn the following conclusions:
- MT-bench Dataset: In single-concurrency scenarios, the optimized model achieves a 62.8% increase in Throughput compared to the baseline model.
- Alpaca Dataset: In single-concurrency scenarios, the optimized model achieves a 49.0% increase in Throughput compared to the baseline model.
Note:
- Calculation Formula: Performance Improvement(%) = (Optimized Model Value - Baseline Model Value) / Baseline Model Value × 100
- Through extensive testing, the selected service parameter configuration achieves the optimal inference speed in the current setup, although it does sacrifice some Accept length.
Relevant Link
Qwen2.5-14B-Instruct Open-source Weights: https://modelscope.cn/models/Qwen/Qwen2.5-14B-Instruct
Eagle Open-source Repository:https://github.com/SafeAILab/EAGLE
Eagle3 wieghts for Qwen3-4B-Instruct-2507: https://huggingface.co/Zjcxy-SmartAI/Eagle3-Qwen3-4B-Instruct-2507-zh
- Downloads last month
- 4