File size: 9,151 Bytes
8816dfd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 |
"""
vLLM Engine Arguments Documentation
This module contains the complete documentation for vLLM engine arguments
from https://docs.vllm.ai/en/v0.11.0/serving/engine_args.html
This is used by the deployment system to generate optimal vLLM commands
without requiring online access.
Author: ComputeAgent Team
"""
VLLM_ENGINE_ARGS_DOC = """
# vLLM Engine Arguments (v0.11.0)
## Model Configuration
--model
Name or path of the huggingface model to use.
Default: "facebook/opt-125m"
--task
Possible choices: auto, generate, embedding, embed, classify, score, reward
The task to use the model for. Each vLLM instance only supports one task.
Default: "auto"
--tokenizer
Name or path of the huggingface tokenizer to use. If unspecified, model name or path will be used.
--skip-tokenizer-init
Skip initialization of tokenizer and detokenizer.
--revision
The specific model version to use. It can be a branch name, a tag name, or a commit id.
--code-revision
The specific revision to use for the model code on Hugging Face Hub.
--tokenizer-revision
Revision of the huggingface tokenizer to use.
--tokenizer-mode
Possible choices: auto, slow, mistral
The tokenizer mode. "auto" will use the fast tokenizer if available.
Default: "auto"
--trust-remote-code
Trust remote code from huggingface.
--download-dir
Directory to download and load the weights, default to the default cache dir of huggingface.
--load-format
Possible choices: auto, pt, safetensors, npcache, dummy, tensorizer, sharded_state, gguf, bitsandbytes, mistral, runai_streamer
The format of the model weights to load.
Default: "auto"
--config-format
Possible choices: auto, hf, mistral
The format of the model config to load.
Default: "ConfigFormat.AUTO"
--dtype
Possible choices: auto, half, float16, bfloat16, float, float32
Data type for model weights and activations.
- "auto" will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models.
- "half" for FP16. Recommended for AWQ quantization.
- "bfloat16" for a balance between precision and range.
Default: "auto"
--kv-cache-dtype
Possible choices: auto, fp8, fp8_e5m2, fp8_e4m3
Data type for kv cache storage. If "auto", will use model data type.
Default: "auto"
--max-model-len
Model context length. If unspecified, will be automatically derived from the model config.
## Performance & Memory
--gpu-memory-utilization
The fraction of GPU memory to be used for the model executor (0.0-1.0).
This is a per-instance limit. For example, 0.5 would use 50% GPU memory.
Default: 0.9
--max-num-batched-tokens
Maximum number of batched tokens per iteration.
--max-num-seqs
Maximum number of sequences per iteration.
--swap-space
CPU swap space size (GiB) per GPU.
Default: 4
--cpu-offload-gb
The space in GiB to offload to CPU, per GPU. Default is 0 (no offloading).
This can virtually increase GPU memory. For example, if you have 24GB GPU and set this to 10,
it's like having a 34GB GPU.
Default: 0
--num-gpu-blocks-override
If specified, ignore GPU profiling result and use this number of GPU blocks.
## Distributed Execution
--tensor-parallel-size, -tp
Number of tensor parallel replicas. Use for multi-GPU inference.
Default: 1
--pipeline-parallel-size, -pp
Number of pipeline stages.
Default: 1
--distributed-executor-backend
Possible choices: ray, mp, uni, external_launcher
Backend to use for distributed model workers. "mp" for single host, "ray" for multi-host.
--max-parallel-loading-workers
Load model sequentially in multiple batches, to avoid RAM OOM when using tensor parallel.
## Caching & Optimization
--enable-prefix-caching, --no-enable-prefix-caching
Enables automatic prefix caching. Highly recommended for better performance.
--disable-sliding-window
Disables sliding window, capping to sliding window size.
--block-size
Possible choices: 8, 16, 32, 64, 128
Token block size for contiguous chunks of tokens.
Default depends on device (CUDA: up to 32, HPU: 128).
--enable-chunked-prefill
Enable chunked prefill for long context processing. Recommended for max-model-len > 8192.
--max-seq-len-to-capture
Maximum sequence length covered by CUDA graphs. Falls back to eager mode for longer sequences.
Default: 8192
## Quantization
--quantization, -q
Possible choices: aqlm, awq, deepspeedfp, tpu_int8, fp8, fbgemm_fp8, modelopt, marlin, gguf,
gptq_marlin_24, gptq_marlin, awq_marlin, gptq, compressed-tensors, bitsandbytes, qqq, hqq,
experts_int8, neuron_quant, ipex, quark, moe_wna16, None
Method used to quantize the weights.
## Speculative Decoding
--speculative-model
The name of the draft model to be used in speculative decoding.
--num-speculative-tokens
The number of speculative tokens to sample from the draft model.
--speculative-max-model-len
The maximum sequence length supported by the draft model.
--speculative-disable-by-batch-size
Disable speculative decoding if the number of enqueue requests is larger than this value.
--ngram-prompt-lookup-max
Max size of window for ngram prompt lookup in speculative decoding.
--ngram-prompt-lookup-min
Min size of window for ngram prompt lookup in speculative decoding.
## LoRA Support
--enable-lora
If True, enable handling of LoRA adapters.
--max-loras
Max number of LoRAs in a single batch.
Default: 1
--max-lora-rank
Max LoRA rank.
Default: 16
--lora-dtype
Possible choices: auto, float16, bfloat16
Data type for LoRA. If auto, will default to base model dtype.
Default: "auto"
--fully-sharded-loras
Use fully sharded LoRA layers. Likely faster at high sequence length or tensor parallel size.
## Scheduling & Execution
--scheduling-policy
Possible choices: fcfs, priority
The scheduling policy to use. "fcfs" (first come first served) or "priority".
Default: "fcfs"
--num-scheduler-steps
Maximum number of forward steps per scheduler call.
Default: 1
--scheduler-delay-factor
Apply a delay before scheduling next prompt (delay factor * previous prompt latency).
Default: 0.0
--device
Possible choices: auto, cuda, neuron, cpu, openvino, tpu, xpu, hpu
Device type for vLLM execution.
Default: "auto"
## Logging & Monitoring
--disable-log-stats
Disable logging statistics.
--max-logprobs
Max number of log probs to return when logprobs is specified in SamplingParams.
Default: 20
--disable-async-output-proc
Disable async output processing. May result in lower performance.
--otlp-traces-endpoint
Target URL to which OpenTelemetry traces will be sent.
--collect-detailed-traces
Valid choices: model, worker, all
Collect detailed traces for specified modules (requires --otlp-traces-endpoint).
## Advanced Options
--rope-scaling
RoPE scaling configuration in JSON format. Example: {"rope_type":"dynamic","factor":2.0}
--rope-theta
RoPE theta. Use with rope_scaling to improve scaled model performance.
--enforce-eager
Always use eager-mode PyTorch. If False, uses hybrid eager/CUDA graph mode.
--seed
Random seed for operations.
Default: 0
--compilation-config, -O
torch.compile configuration for the model (0, 1, 2, 3 or JSON string).
Level 3 is recommended for production.
--worker-cls
The worker class to use for distributed execution.
Default: "auto"
--enable-sleep-mode
Enable sleep mode for the engine (CUDA platform only).
--calculate-kv-scales
Enable dynamic calculation of k_scale and v_scale when kv-cache-dtype is fp8.
## Serving Options
--host
Host address for the server.
Default: "0.0.0.0"
--port
Port number for the server.
Default: 8000
--served-model-name
The model name(s) used in the API. Can be multiple comma-separated names.
## Multimodal
--limit-mm-per-prompt
Limit how many multimodal inputs per prompt (e.g., image=16,video=2).
--mm-processor-kwargs
Overrides for multimodal input processing (JSON format).
--disable-mm-preprocessor-cache
Disable caching of multi-modal preprocessor/mapper (not recommended).
"""
def get_vllm_docs() -> str:
"""
Get the vLLM engine arguments documentation.
Returns:
str: Complete vLLM engine arguments documentation
"""
return VLLM_ENGINE_ARGS_DOC
def get_common_parameters_summary() -> str:
"""
Get a summary of the most commonly used vLLM parameters.
Returns:
str: Summary of key vLLM parameters
"""
return """
## Most Common vLLM Parameters:
**Performance:**
- --gpu-memory-utilization: Fraction of GPU memory to use (0.0-1.0, default: 0.9)
- --max-model-len: Maximum context length
- --max-num-seqs: Maximum sequences per iteration
- --max-num-batched-tokens: Maximum batched tokens per iteration
- --enable-prefix-caching: Enable prefix caching (recommended)
- --enable-chunked-prefill: For long contexts (>8192 tokens)
**Model Configuration:**
- --dtype: Data type (auto, half, float16, bfloat16, float32)
- --kv-cache-dtype: KV cache type (auto, fp8, fp16, bf16)
- --quantization: Quantization method (fp8, awq, gptq, etc.)
**Distributed:**
- --tensor-parallel-size: Number of GPUs for tensor parallelism
- --pipeline-parallel-size: Number of pipeline stages
**Server:**
- --host: Server host address (default: 0.0.0.0)
- --port: Server port (default: 8000)
"""
|