File size: 9,151 Bytes
8816dfd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
"""
vLLM Engine Arguments Documentation

This module contains the complete documentation for vLLM engine arguments
from https://docs.vllm.ai/en/v0.11.0/serving/engine_args.html

This is used by the deployment system to generate optimal vLLM commands
without requiring online access.

Author: ComputeAgent Team
"""

VLLM_ENGINE_ARGS_DOC = """
# vLLM Engine Arguments (v0.11.0)

## Model Configuration

--model
Name or path of the huggingface model to use.
Default: "facebook/opt-125m"

--task
Possible choices: auto, generate, embedding, embed, classify, score, reward
The task to use the model for. Each vLLM instance only supports one task.
Default: "auto"

--tokenizer
Name or path of the huggingface tokenizer to use. If unspecified, model name or path will be used.

--skip-tokenizer-init
Skip initialization of tokenizer and detokenizer.

--revision
The specific model version to use. It can be a branch name, a tag name, or a commit id.

--code-revision
The specific revision to use for the model code on Hugging Face Hub.

--tokenizer-revision
Revision of the huggingface tokenizer to use.

--tokenizer-mode
Possible choices: auto, slow, mistral
The tokenizer mode. "auto" will use the fast tokenizer if available.
Default: "auto"

--trust-remote-code
Trust remote code from huggingface.

--download-dir
Directory to download and load the weights, default to the default cache dir of huggingface.

--load-format
Possible choices: auto, pt, safetensors, npcache, dummy, tensorizer, sharded_state, gguf, bitsandbytes, mistral, runai_streamer
The format of the model weights to load.
Default: "auto"

--config-format
Possible choices: auto, hf, mistral
The format of the model config to load.
Default: "ConfigFormat.AUTO"

--dtype
Possible choices: auto, half, float16, bfloat16, float, float32
Data type for model weights and activations.
- "auto" will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models.
- "half" for FP16. Recommended for AWQ quantization.
- "bfloat16" for a balance between precision and range.
Default: "auto"

--kv-cache-dtype
Possible choices: auto, fp8, fp8_e5m2, fp8_e4m3
Data type for kv cache storage. If "auto", will use model data type.
Default: "auto"

--max-model-len
Model context length. If unspecified, will be automatically derived from the model config.

## Performance & Memory

--gpu-memory-utilization
The fraction of GPU memory to be used for the model executor (0.0-1.0).
This is a per-instance limit. For example, 0.5 would use 50% GPU memory.
Default: 0.9

--max-num-batched-tokens
Maximum number of batched tokens per iteration.

--max-num-seqs
Maximum number of sequences per iteration.

--swap-space
CPU swap space size (GiB) per GPU.
Default: 4

--cpu-offload-gb
The space in GiB to offload to CPU, per GPU. Default is 0 (no offloading).
This can virtually increase GPU memory. For example, if you have 24GB GPU and set this to 10,
it's like having a 34GB GPU.
Default: 0

--num-gpu-blocks-override
If specified, ignore GPU profiling result and use this number of GPU blocks.

## Distributed Execution

--tensor-parallel-size, -tp
Number of tensor parallel replicas. Use for multi-GPU inference.
Default: 1

--pipeline-parallel-size, -pp
Number of pipeline stages.
Default: 1

--distributed-executor-backend
Possible choices: ray, mp, uni, external_launcher
Backend to use for distributed model workers. "mp" for single host, "ray" for multi-host.

--max-parallel-loading-workers
Load model sequentially in multiple batches, to avoid RAM OOM when using tensor parallel.

## Caching & Optimization

--enable-prefix-caching, --no-enable-prefix-caching
Enables automatic prefix caching. Highly recommended for better performance.

--disable-sliding-window
Disables sliding window, capping to sliding window size.

--block-size
Possible choices: 8, 16, 32, 64, 128
Token block size for contiguous chunks of tokens.
Default depends on device (CUDA: up to 32, HPU: 128).

--enable-chunked-prefill
Enable chunked prefill for long context processing. Recommended for max-model-len > 8192.

--max-seq-len-to-capture
Maximum sequence length covered by CUDA graphs. Falls back to eager mode for longer sequences.
Default: 8192

## Quantization

--quantization, -q
Possible choices: aqlm, awq, deepspeedfp, tpu_int8, fp8, fbgemm_fp8, modelopt, marlin, gguf,
gptq_marlin_24, gptq_marlin, awq_marlin, gptq, compressed-tensors, bitsandbytes, qqq, hqq,
experts_int8, neuron_quant, ipex, quark, moe_wna16, None
Method used to quantize the weights.

## Speculative Decoding

--speculative-model
The name of the draft model to be used in speculative decoding.

--num-speculative-tokens
The number of speculative tokens to sample from the draft model.

--speculative-max-model-len
The maximum sequence length supported by the draft model.

--speculative-disable-by-batch-size
Disable speculative decoding if the number of enqueue requests is larger than this value.

--ngram-prompt-lookup-max
Max size of window for ngram prompt lookup in speculative decoding.

--ngram-prompt-lookup-min
Min size of window for ngram prompt lookup in speculative decoding.

## LoRA Support

--enable-lora
If True, enable handling of LoRA adapters.

--max-loras
Max number of LoRAs in a single batch.
Default: 1

--max-lora-rank
Max LoRA rank.
Default: 16

--lora-dtype
Possible choices: auto, float16, bfloat16
Data type for LoRA. If auto, will default to base model dtype.
Default: "auto"

--fully-sharded-loras
Use fully sharded LoRA layers. Likely faster at high sequence length or tensor parallel size.

## Scheduling & Execution

--scheduling-policy
Possible choices: fcfs, priority
The scheduling policy to use. "fcfs" (first come first served) or "priority".
Default: "fcfs"

--num-scheduler-steps
Maximum number of forward steps per scheduler call.
Default: 1

--scheduler-delay-factor
Apply a delay before scheduling next prompt (delay factor * previous prompt latency).
Default: 0.0

--device
Possible choices: auto, cuda, neuron, cpu, openvino, tpu, xpu, hpu
Device type for vLLM execution.
Default: "auto"

## Logging & Monitoring

--disable-log-stats
Disable logging statistics.

--max-logprobs
Max number of log probs to return when logprobs is specified in SamplingParams.
Default: 20

--disable-async-output-proc
Disable async output processing. May result in lower performance.

--otlp-traces-endpoint
Target URL to which OpenTelemetry traces will be sent.

--collect-detailed-traces
Valid choices: model, worker, all
Collect detailed traces for specified modules (requires --otlp-traces-endpoint).

## Advanced Options

--rope-scaling
RoPE scaling configuration in JSON format. Example: {"rope_type":"dynamic","factor":2.0}

--rope-theta
RoPE theta. Use with rope_scaling to improve scaled model performance.

--enforce-eager
Always use eager-mode PyTorch. If False, uses hybrid eager/CUDA graph mode.

--seed
Random seed for operations.
Default: 0

--compilation-config, -O
torch.compile configuration for the model (0, 1, 2, 3 or JSON string).
Level 3 is recommended for production.

--worker-cls
The worker class to use for distributed execution.
Default: "auto"

--enable-sleep-mode
Enable sleep mode for the engine (CUDA platform only).

--calculate-kv-scales
Enable dynamic calculation of k_scale and v_scale when kv-cache-dtype is fp8.

## Serving Options

--host
Host address for the server.
Default: "0.0.0.0"

--port
Port number for the server.
Default: 8000

--served-model-name
The model name(s) used in the API. Can be multiple comma-separated names.

## Multimodal

--limit-mm-per-prompt
Limit how many multimodal inputs per prompt (e.g., image=16,video=2).

--mm-processor-kwargs
Overrides for multimodal input processing (JSON format).

--disable-mm-preprocessor-cache
Disable caching of multi-modal preprocessor/mapper (not recommended).
"""


def get_vllm_docs() -> str:
    """
    Get the vLLM engine arguments documentation.

    Returns:
        str: Complete vLLM engine arguments documentation
    """
    return VLLM_ENGINE_ARGS_DOC


def get_common_parameters_summary() -> str:
    """
    Get a summary of the most commonly used vLLM parameters.

    Returns:
        str: Summary of key vLLM parameters
    """
    return """
    ## Most Common vLLM Parameters:

    **Performance:**
    - --gpu-memory-utilization: Fraction of GPU memory to use (0.0-1.0, default: 0.9)
    - --max-model-len: Maximum context length
    - --max-num-seqs: Maximum sequences per iteration
    - --max-num-batched-tokens: Maximum batched tokens per iteration
    - --enable-prefix-caching: Enable prefix caching (recommended)
    - --enable-chunked-prefill: For long contexts (>8192 tokens)

    **Model Configuration:**
    - --dtype: Data type (auto, half, float16, bfloat16, float32)
    - --kv-cache-dtype: KV cache type (auto, fp8, fp16, bf16)
    - --quantization: Quantization method (fp8, awq, gptq, etc.)

    **Distributed:**
    - --tensor-parallel-size: Number of GPUs for tensor parallelism
    - --pipeline-parallel-size: Number of pipeline stages

    **Server:**
    - --host: Server host address (default: 0.0.0.0)
    - --port: Server port (default: 8000)
    """