Question will it work in vllm or sglang with rtx 6000 blackwells? cuda arch sm120

#1
by Fernanda24 - opened

Hi anyone know if this could work in vllm or sglang with rtx 6000 blackwells? cuda arch sm120

update it works!

launch with (use the chat template in the vllm repo for seamless toolcalling):

VLLM_SLEEP_WHEN_IDLE=1 CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve Intel/DeepSeek-V3.1-Terminus-int4-mixed-AutoRound/ --tensor-parallel-size 4 --served-model-name deepseek --max-model-len 58776 --tool-call-parser deepseek_v31 --enable-auto-tool-choice --tool-call-parser deepseek_v31 --enable-expert-parallel --gpu-memory-utilization 0.945 --trust-remote-code --port 8080 --enable-chunked-prefill --max-num-batched-tokens 2048 --block-size 8 --max-num-seqs 2 --chat-template examples/tool_chat_template_deepseekv31.jinja

you need this chat template to make tool calling fly seemlessly: https://github.com/vllm-project/vllm/blob/main/examples/tool_chat_template_deepseekv31.jinja

you can probably experiment with the settinsgs in the launch command. not sure if those settings were what i finally landed on

@Fernanda24
thank you, this works indeed
(4xGPU) vllm docker 0.12.0 #50k context 1x43tps 10x175tps

just for statistics
this is running with path:
(Worker_TP0_DCP0 pid=84) INFO 12-04 23:52:20 [gptq_marlin.py:377] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(Worker_TP3_DCP3 pid=87) INFO 12-04 23:52:20 [cuda.py:411] Using TRITON_MLA attention backend out of potential backends: ['TRITON_MLA']
(Worker_TP3_DCP3 pid=87) INFO 12-04 23:52:20 [layer.py:379] Enabled separate cuda stream for MoE shared_experts

KV cache
(Worker_TP0_DCP0 pid=84) INFO 12-04 23:58:19 [gpu_model_runner.py:3549] Model loading took 84.8058 GiB memory and 358.932576 seconds
(Worker_TP0_DCP0 pid=84) INFO 12-04 23:58:52 [gpu_worker.py:359] Available KV cache memory: 3.68 GiB
(EngineCore_DP0 pid=46) INFO 12-04 23:58:53 [kv_cache_utils.py:1286] GPU KV cache size: 225,216 tokens
(EngineCore_DP0 pid=46) INFO 12-04 23:58:53 [kv_cache_utils.py:1291] Maximum concurrency for 131,072 tokens per request: 1.72x

@willfalco those console outputs look really good! 225K tokens that must be fp8 kv cache? is thuis sglang or vllm and what was your launch command and env variables? looks better than mine this is the Autoround or the AWQ?

@Fernanda24 I used your running flags almost unchanged
OMP_NUM_THREADS=16 VLLM_MARLIN_USE_ATOMIC_ADD=1
set - --reasoning-parser minimax_m2_append_think (to add think at the start of the answer)
and --decode-context-parallel-size to get more KV
--compilation-config '{"pass_config":{"enable_fi_allreduce_fusion":true,"enable_noop":true},"custom_ops":["+rms_norm"],"cudagraph_mode":"FULL_AND_PIECEWISE"}'

Sign up or log in to comment