Instructions to use lkevincc0/Step-3.5-Flash-REAP-128B-A11B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use lkevincc0/Step-3.5-Flash-REAP-128B-A11B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="lkevincc0/Step-3.5-Flash-REAP-128B-A11B", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("lkevincc0/Step-3.5-Flash-REAP-128B-A11B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use lkevincc0/Step-3.5-Flash-REAP-128B-A11B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "lkevincc0/Step-3.5-Flash-REAP-128B-A11B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lkevincc0/Step-3.5-Flash-REAP-128B-A11B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/lkevincc0/Step-3.5-Flash-REAP-128B-A11B
- SGLang
How to use lkevincc0/Step-3.5-Flash-REAP-128B-A11B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "lkevincc0/Step-3.5-Flash-REAP-128B-A11B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lkevincc0/Step-3.5-Flash-REAP-128B-A11B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "lkevincc0/Step-3.5-Flash-REAP-128B-A11B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lkevincc0/Step-3.5-Flash-REAP-128B-A11B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use lkevincc0/Step-3.5-Flash-REAP-128B-A11B with Docker Model Runner:
docker model run hf.co/lkevincc0/Step-3.5-Flash-REAP-128B-A11B
Step-3.5-128B-A11B | Technical Specifications
| Attribute | Detail |
|---|---|
| Base Model | Step-3.5-Flash |
| Architecture | Sparse Mixture-of-Experts (SMoE) |
| Model Type | Causal Language Model |
| Total Parameters | 128B |
| Active Parameters | 11B (per token) |
| Compression Ratio | 40% (Expert Pruning) |
| Pruning Strategy | Pruning observation (REAP) |
| Calibration Set | lkevincc0/glm47-math-code-calibration-1024 |
Local Deployment
Step 3.5 Flash is optimized for local inference and supports industry-standard backends including vLLM, SGLang, Hugging Face Transformers, and llama.cpp.
vLLM
We recommend using the latest nightly build of vLLM.
1. Install vLLM
# via Docker
docker pull vllm/vllm-openai:nightly
# or via pip (nightly wheels)
pip install -U vllm --pre \
--index-url https://pypi.org/simple \
--extra-index-url https://wheels.vllm.ai/nightly
2. Launch the server
Note: Full MTP3 support is not yet available in vLLM. We are actively working on a Pull Request to integrate this feature, which is expected to significantly enhance decoding performance.
- For FP8 model
vllm serve <MODEL_PATH_OR_HF_ID> \
--served-model-name step3p5-flash \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--disable-cascade-attn \
--reasoning-parser step3p5 \
--enable-auto-tool-choice \
--tool-call-parser step3p5 \
--hf-overrides '{"num_nextn_predict_layers": 1}' \
--speculative_config '{"method": "step3p5_mtp", "num_speculative_tokens": 1}' \
--trust-remote-code \
--quantization fp8
- For BF16 model
vllm serve <MODEL_PATH_OR_HF_ID> \
--served-model-name step3p5-flash \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--disable-cascade-attn \
--reasoning-parser step3p5 \
--enable-auto-tool-choice \
--tool-call-parser step3p5 \
--hf-overrides '{"num_nextn_predict_layers": 1}' \
--speculative_config '{"method": "step3p5_mtp", "num_speculative_tokens": 1}' \
--trust-remote-code
You can also refer to the Step-3.5-Flash recipe.
SGLang
1. Install SGLang
# via Docker
docker pull lmsysorg/sglang:dev-pr-18084
# or from source (pip)
pip install "sglang[all] @ git+https://github.com/sgl-project/sglang.git"
2. Launch the server
- For BF16 model
sglang serve --model-path <MODEL_PATH_OR_HF_ID> \
--served-model-name step3p5-flash \
--tp-size 8 \
--tool-call-parser step3p5 \
--reasoning-parser step3p5 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--enable-multi-layer-eagle \
--host 0.0.0.0 \
--port 8000
- For FP8 model
sglang serve --model-path <MODEL_PATH_OR_HF_ID> \
--served-model-name step3p5-flash \
--tp-size 8 \
--ep-size 8 \
--tool-call-parser step3p5 \
--reasoning-parser step3p5 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--enable-multi-layer-eagle \
--host 0.0.0.0 \
--port 8000
Reference
- Hugging Face: stepfun-ai/Step-3.5-Flash
- Optimization Tech: CerebrasResearch/reap
- Downloads last month
- 21