Eagle-Qwen2.5-14B-Instruct

Introduction

Eagle-Qwen2.5-14B-Instruct is retrained based on the open-source Qwen2.5-14B-Instruct model, and can be used for the Eagle-2 speculative decoding algorithm to speed up the inference of large language models during the decoding stage.

Note: The configuration files in the project follow the SGLang framework. If you intend to use other frameworks like vLLM, please modify the relevant configurations by yourself.

Training Configuration

Based on the open-source code repository Eagle, building upon the open-source codebase Eagle, we developed additional training code for the Qwen2.5-14B-Instruct model and conducted a secondary training process using the following configurations. (Note: We will open source the training code in the follow-up.)

Dataset: Utilized the ShareGPT-68K conversation dataset.
Training environment: The training was conducted on 4 NVIDIA RTX 4090 GPUs with 24GB VRAM each, leveraging the DeepSpeed framework. The total training duration was approximately one day.

Model Inference Launch Command

To launch the EAGLE-2 algorithm service using SGLang, here is the instruction:

python3 -m sglang.launch_server \		
--model Qwen/Qwen2.5-14B-Instruct \		
--speculative-algo EAGLE \		
--speculative-draft Zjcxy-SmartAI/Eagle-Qwen2.5-14B-Instruct \		
--speculative-num-steps 5 \		
--speculative-eagle-topk 4 \		
--speculative-num-draft-tokens 16 \		
--dtype float16 \		
--port 30000 \		
--mem-fraction 0.7 \		
--tp-size 2 \		    
--cuda-graph-max-bs 16 \		
--cuda-graph-bs {1,2,3,4}

To launch the original model service (for comparative experiments) using SGLang， here is the instruction:

python -m sglang.launch_server \
--model Qwen/Qwen2.5-14B-Instruct \	
--port 30000 \
--mem-fraction 0.7 \	
--tp-size 2 \	   
--cuda-graph-max-bs 16 \	
--cuda-graph-bs {1,2,3,4}

Performance Evaluation

We conducted performance tests in a 2-GPU RTX 4090 environment, leveraging multi-turn dialogue MT-bench and single-turn dialogue Alpaca datasets. The results are as follows:

Dataset	parallel	Throughput/parallel(tokens/s)	Accept length
MT-bench	1	89.76	3.15
alpaca	1	81.90	2.94
MT-bench	4	60.15	3.17
alpaca	4	55.19	2.93

The original model performance is as follows:

Dataset	parallel	Throughput/parallel(tokens/s)
MT-bench	1	55.14
alpaca	1	54.95
MT-bench	4	49.43
alpaca	4	47.68

Through comparison, we have drawn the following conclusions:

MT-bench Dataset： In single-concurrency scenarios, the optimized model achieves a 62.8% increase in Throughput compared to the baseline model.
Alpaca Dataset： In single-concurrency scenarios, the optimized model achieves a 49.0% increase in Throughput compared to the baseline model.

Note：

Calculation Formula: Performance Improvement(%) = (Optimized Model Value - Baseline Model Value) / Baseline Model Value × 100

Through extensive testing, the selected service parameter configuration achieves the optimal inference speed in the current setup, although it does sacrifice some Accept length.

Relevant Link

Qwen2.5-14B-Instruct Open-source Weights: https://modelscope.cn/models/Qwen/Qwen2.5-14B-Instruct

Eagle Open-source Repository：https://github.com/SafeAILab/EAGLE

Eagle3 wieghts for Qwen3-4B-Instruct-2507: https://huggingface.co/Zjcxy-SmartAI/Eagle3-Qwen3-4B-Instruct-2507-zh

Downloads last month: 4

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support