| """ | |
| vLLM Engine Arguments Documentation | |
| This module contains the complete documentation for vLLM engine arguments | |
| from https://docs.vllm.ai/en/v0.11.0/serving/engine_args.html | |
| This is used by the deployment system to generate optimal vLLM commands | |
| without requiring online access. | |
| Author: ComputeAgent Team | |
| """ | |
| VLLM_ENGINE_ARGS_DOC = """ | |
| # vLLM Engine Arguments (v0.11.0) | |
| ## Model Configuration | |
| --model | |
| Name or path of the huggingface model to use. | |
| Default: "facebook/opt-125m" | |
| --task | |
| Possible choices: auto, generate, embedding, embed, classify, score, reward | |
| The task to use the model for. Each vLLM instance only supports one task. | |
| Default: "auto" | |
| --tokenizer | |
| Name or path of the huggingface tokenizer to use. If unspecified, model name or path will be used. | |
| --skip-tokenizer-init | |
| Skip initialization of tokenizer and detokenizer. | |
| --revision | |
| The specific model version to use. It can be a branch name, a tag name, or a commit id. | |
| --code-revision | |
| The specific revision to use for the model code on Hugging Face Hub. | |
| --tokenizer-revision | |
| Revision of the huggingface tokenizer to use. | |
| --tokenizer-mode | |
| Possible choices: auto, slow, mistral | |
| The tokenizer mode. "auto" will use the fast tokenizer if available. | |
| Default: "auto" | |
| --trust-remote-code | |
| Trust remote code from huggingface. | |
| --download-dir | |
| Directory to download and load the weights, default to the default cache dir of huggingface. | |
| --load-format | |
| Possible choices: auto, pt, safetensors, npcache, dummy, tensorizer, sharded_state, gguf, bitsandbytes, mistral, runai_streamer | |
| The format of the model weights to load. | |
| Default: "auto" | |
| --config-format | |
| Possible choices: auto, hf, mistral | |
| The format of the model config to load. | |
| Default: "ConfigFormat.AUTO" | |
| --dtype | |
| Possible choices: auto, half, float16, bfloat16, float, float32 | |
| Data type for model weights and activations. | |
| - "auto" will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models. | |
| - "half" for FP16. Recommended for AWQ quantization. | |
| - "bfloat16" for a balance between precision and range. | |
| Default: "auto" | |
| --kv-cache-dtype | |
| Possible choices: auto, fp8, fp8_e5m2, fp8_e4m3 | |
| Data type for kv cache storage. If "auto", will use model data type. | |
| Default: "auto" | |
| --max-model-len | |
| Model context length. If unspecified, will be automatically derived from the model config. | |
| ## Performance & Memory | |
| --gpu-memory-utilization | |
| The fraction of GPU memory to be used for the model executor (0.0-1.0). | |
| This is a per-instance limit. For example, 0.5 would use 50% GPU memory. | |
| Default: 0.9 | |
| --max-num-batched-tokens | |
| Maximum number of batched tokens per iteration. | |
| --max-num-seqs | |
| Maximum number of sequences per iteration. | |
| --swap-space | |
| CPU swap space size (GiB) per GPU. | |
| Default: 4 | |
| --cpu-offload-gb | |
| The space in GiB to offload to CPU, per GPU. Default is 0 (no offloading). | |
| This can virtually increase GPU memory. For example, if you have 24GB GPU and set this to 10, | |
| it's like having a 34GB GPU. | |
| Default: 0 | |
| --num-gpu-blocks-override | |
| If specified, ignore GPU profiling result and use this number of GPU blocks. | |
| ## Distributed Execution | |
| --tensor-parallel-size, -tp | |
| Number of tensor parallel replicas. Use for multi-GPU inference. | |
| Default: 1 | |
| --pipeline-parallel-size, -pp | |
| Number of pipeline stages. | |
| Default: 1 | |
| --distributed-executor-backend | |
| Possible choices: ray, mp, uni, external_launcher | |
| Backend to use for distributed model workers. "mp" for single host, "ray" for multi-host. | |
| --max-parallel-loading-workers | |
| Load model sequentially in multiple batches, to avoid RAM OOM when using tensor parallel. | |
| ## Caching & Optimization | |
| --enable-prefix-caching, --no-enable-prefix-caching | |
| Enables automatic prefix caching. Highly recommended for better performance. | |
| --disable-sliding-window | |
| Disables sliding window, capping to sliding window size. | |
| --block-size | |
| Possible choices: 8, 16, 32, 64, 128 | |
| Token block size for contiguous chunks of tokens. | |
| Default depends on device (CUDA: up to 32, HPU: 128). | |
| --enable-chunked-prefill | |
| Enable chunked prefill for long context processing. Recommended for max-model-len > 8192. | |
| --max-seq-len-to-capture | |
| Maximum sequence length covered by CUDA graphs. Falls back to eager mode for longer sequences. | |
| Default: 8192 | |
| ## Quantization | |
| --quantization, -q | |
| Possible choices: aqlm, awq, deepspeedfp, tpu_int8, fp8, fbgemm_fp8, modelopt, marlin, gguf, | |
| gptq_marlin_24, gptq_marlin, awq_marlin, gptq, compressed-tensors, bitsandbytes, qqq, hqq, | |
| experts_int8, neuron_quant, ipex, quark, moe_wna16, None | |
| Method used to quantize the weights. | |
| ## Speculative Decoding | |
| --speculative-model | |
| The name of the draft model to be used in speculative decoding. | |
| --num-speculative-tokens | |
| The number of speculative tokens to sample from the draft model. | |
| --speculative-max-model-len | |
| The maximum sequence length supported by the draft model. | |
| --speculative-disable-by-batch-size | |
| Disable speculative decoding if the number of enqueue requests is larger than this value. | |
| --ngram-prompt-lookup-max | |
| Max size of window for ngram prompt lookup in speculative decoding. | |
| --ngram-prompt-lookup-min | |
| Min size of window for ngram prompt lookup in speculative decoding. | |
| ## LoRA Support | |
| --enable-lora | |
| If True, enable handling of LoRA adapters. | |
| --max-loras | |
| Max number of LoRAs in a single batch. | |
| Default: 1 | |
| --max-lora-rank | |
| Max LoRA rank. | |
| Default: 16 | |
| --lora-dtype | |
| Possible choices: auto, float16, bfloat16 | |
| Data type for LoRA. If auto, will default to base model dtype. | |
| Default: "auto" | |
| --fully-sharded-loras | |
| Use fully sharded LoRA layers. Likely faster at high sequence length or tensor parallel size. | |
| ## Scheduling & Execution | |
| --scheduling-policy | |
| Possible choices: fcfs, priority | |
| The scheduling policy to use. "fcfs" (first come first served) or "priority". | |
| Default: "fcfs" | |
| --num-scheduler-steps | |
| Maximum number of forward steps per scheduler call. | |
| Default: 1 | |
| --scheduler-delay-factor | |
| Apply a delay before scheduling next prompt (delay factor * previous prompt latency). | |
| Default: 0.0 | |
| --device | |
| Possible choices: auto, cuda, neuron, cpu, openvino, tpu, xpu, hpu | |
| Device type for vLLM execution. | |
| Default: "auto" | |
| ## Logging & Monitoring | |
| --disable-log-stats | |
| Disable logging statistics. | |
| --max-logprobs | |
| Max number of log probs to return when logprobs is specified in SamplingParams. | |
| Default: 20 | |
| --disable-async-output-proc | |
| Disable async output processing. May result in lower performance. | |
| --otlp-traces-endpoint | |
| Target URL to which OpenTelemetry traces will be sent. | |
| --collect-detailed-traces | |
| Valid choices: model, worker, all | |
| Collect detailed traces for specified modules (requires --otlp-traces-endpoint). | |
| ## Advanced Options | |
| --rope-scaling | |
| RoPE scaling configuration in JSON format. Example: {"rope_type":"dynamic","factor":2.0} | |
| --rope-theta | |
| RoPE theta. Use with rope_scaling to improve scaled model performance. | |
| --enforce-eager | |
| Always use eager-mode PyTorch. If False, uses hybrid eager/CUDA graph mode. | |
| --seed | |
| Random seed for operations. | |
| Default: 0 | |
| --compilation-config, -O | |
| torch.compile configuration for the model (0, 1, 2, 3 or JSON string). | |
| Level 3 is recommended for production. | |
| --worker-cls | |
| The worker class to use for distributed execution. | |
| Default: "auto" | |
| --enable-sleep-mode | |
| Enable sleep mode for the engine (CUDA platform only). | |
| --calculate-kv-scales | |
| Enable dynamic calculation of k_scale and v_scale when kv-cache-dtype is fp8. | |
| ## Serving Options | |
| --host | |
| Host address for the server. | |
| Default: "0.0.0.0" | |
| --port | |
| Port number for the server. | |
| Default: 8000 | |
| --served-model-name | |
| The model name(s) used in the API. Can be multiple comma-separated names. | |
| ## Multimodal | |
| --limit-mm-per-prompt | |
| Limit how many multimodal inputs per prompt (e.g., image=16,video=2). | |
| --mm-processor-kwargs | |
| Overrides for multimodal input processing (JSON format). | |
| --disable-mm-preprocessor-cache | |
| Disable caching of multi-modal preprocessor/mapper (not recommended). | |
| """ | |
| def get_vllm_docs() -> str: | |
| """ | |
| Get the vLLM engine arguments documentation. | |
| Returns: | |
| str: Complete vLLM engine arguments documentation | |
| """ | |
| return VLLM_ENGINE_ARGS_DOC | |
| def get_common_parameters_summary() -> str: | |
| """ | |
| Get a summary of the most commonly used vLLM parameters. | |
| Returns: | |
| str: Summary of key vLLM parameters | |
| """ | |
| return """ | |
| ## Most Common vLLM Parameters: | |
| **Performance:** | |
| - --gpu-memory-utilization: Fraction of GPU memory to use (0.0-1.0, default: 0.9) | |
| - --max-model-len: Maximum context length | |
| - --max-num-seqs: Maximum sequences per iteration | |
| - --max-num-batched-tokens: Maximum batched tokens per iteration | |
| - --enable-prefix-caching: Enable prefix caching (recommended) | |
| - --enable-chunked-prefill: For long contexts (>8192 tokens) | |
| **Model Configuration:** | |
| - --dtype: Data type (auto, half, float16, bfloat16, float32) | |
| - --kv-cache-dtype: KV cache type (auto, fp8, fp16, bf16) | |
| - --quantization: Quantization method (fp8, awq, gptq, etc.) | |
| **Distributed:** | |
| - --tensor-parallel-size: Number of GPUs for tensor parallelism | |
| - --pipeline-parallel-size: Number of pipeline stages | |
| **Server:** | |
| - --host: Server host address (default: 0.0.0.0) | |
| - --port: Server port (default: 8000) | |
| """ | |