""" vLLM Engine Arguments Documentation This module contains the complete documentation for vLLM engine arguments from https://docs.vllm.ai/en/v0.11.0/serving/engine_args.html This is used by the deployment system to generate optimal vLLM commands without requiring online access. Author: ComputeAgent Team """ VLLM_ENGINE_ARGS_DOC = """ # vLLM Engine Arguments (v0.11.0) ## Model Configuration --model Name or path of the huggingface model to use. Default: "facebook/opt-125m" --task Possible choices: auto, generate, embedding, embed, classify, score, reward The task to use the model for. Each vLLM instance only supports one task. Default: "auto" --tokenizer Name or path of the huggingface tokenizer to use. If unspecified, model name or path will be used. --skip-tokenizer-init Skip initialization of tokenizer and detokenizer. --revision The specific model version to use. It can be a branch name, a tag name, or a commit id. --code-revision The specific revision to use for the model code on Hugging Face Hub. --tokenizer-revision Revision of the huggingface tokenizer to use. --tokenizer-mode Possible choices: auto, slow, mistral The tokenizer mode. "auto" will use the fast tokenizer if available. Default: "auto" --trust-remote-code Trust remote code from huggingface. --download-dir Directory to download and load the weights, default to the default cache dir of huggingface. --load-format Possible choices: auto, pt, safetensors, npcache, dummy, tensorizer, sharded_state, gguf, bitsandbytes, mistral, runai_streamer The format of the model weights to load. Default: "auto" --config-format Possible choices: auto, hf, mistral The format of the model config to load. Default: "ConfigFormat.AUTO" --dtype Possible choices: auto, half, float16, bfloat16, float, float32 Data type for model weights and activations. - "auto" will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models. - "half" for FP16. Recommended for AWQ quantization. - "bfloat16" for a balance between precision and range. Default: "auto" --kv-cache-dtype Possible choices: auto, fp8, fp8_e5m2, fp8_e4m3 Data type for kv cache storage. If "auto", will use model data type. Default: "auto" --max-model-len Model context length. If unspecified, will be automatically derived from the model config. ## Performance & Memory --gpu-memory-utilization The fraction of GPU memory to be used for the model executor (0.0-1.0). This is a per-instance limit. For example, 0.5 would use 50% GPU memory. Default: 0.9 --max-num-batched-tokens Maximum number of batched tokens per iteration. --max-num-seqs Maximum number of sequences per iteration. --swap-space CPU swap space size (GiB) per GPU. Default: 4 --cpu-offload-gb The space in GiB to offload to CPU, per GPU. Default is 0 (no offloading). This can virtually increase GPU memory. For example, if you have 24GB GPU and set this to 10, it's like having a 34GB GPU. Default: 0 --num-gpu-blocks-override If specified, ignore GPU profiling result and use this number of GPU blocks. ## Distributed Execution --tensor-parallel-size, -tp Number of tensor parallel replicas. Use for multi-GPU inference. Default: 1 --pipeline-parallel-size, -pp Number of pipeline stages. Default: 1 --distributed-executor-backend Possible choices: ray, mp, uni, external_launcher Backend to use for distributed model workers. "mp" for single host, "ray" for multi-host. --max-parallel-loading-workers Load model sequentially in multiple batches, to avoid RAM OOM when using tensor parallel. ## Caching & Optimization --enable-prefix-caching, --no-enable-prefix-caching Enables automatic prefix caching. Highly recommended for better performance. --disable-sliding-window Disables sliding window, capping to sliding window size. --block-size Possible choices: 8, 16, 32, 64, 128 Token block size for contiguous chunks of tokens. Default depends on device (CUDA: up to 32, HPU: 128). --enable-chunked-prefill Enable chunked prefill for long context processing. Recommended for max-model-len > 8192. --max-seq-len-to-capture Maximum sequence length covered by CUDA graphs. Falls back to eager mode for longer sequences. Default: 8192 ## Quantization --quantization, -q Possible choices: aqlm, awq, deepspeedfp, tpu_int8, fp8, fbgemm_fp8, modelopt, marlin, gguf, gptq_marlin_24, gptq_marlin, awq_marlin, gptq, compressed-tensors, bitsandbytes, qqq, hqq, experts_int8, neuron_quant, ipex, quark, moe_wna16, None Method used to quantize the weights. ## Speculative Decoding --speculative-model The name of the draft model to be used in speculative decoding. --num-speculative-tokens The number of speculative tokens to sample from the draft model. --speculative-max-model-len The maximum sequence length supported by the draft model. --speculative-disable-by-batch-size Disable speculative decoding if the number of enqueue requests is larger than this value. --ngram-prompt-lookup-max Max size of window for ngram prompt lookup in speculative decoding. --ngram-prompt-lookup-min Min size of window for ngram prompt lookup in speculative decoding. ## LoRA Support --enable-lora If True, enable handling of LoRA adapters. --max-loras Max number of LoRAs in a single batch. Default: 1 --max-lora-rank Max LoRA rank. Default: 16 --lora-dtype Possible choices: auto, float16, bfloat16 Data type for LoRA. If auto, will default to base model dtype. Default: "auto" --fully-sharded-loras Use fully sharded LoRA layers. Likely faster at high sequence length or tensor parallel size. ## Scheduling & Execution --scheduling-policy Possible choices: fcfs, priority The scheduling policy to use. "fcfs" (first come first served) or "priority". Default: "fcfs" --num-scheduler-steps Maximum number of forward steps per scheduler call. Default: 1 --scheduler-delay-factor Apply a delay before scheduling next prompt (delay factor * previous prompt latency). Default: 0.0 --device Possible choices: auto, cuda, neuron, cpu, openvino, tpu, xpu, hpu Device type for vLLM execution. Default: "auto" ## Logging & Monitoring --disable-log-stats Disable logging statistics. --max-logprobs Max number of log probs to return when logprobs is specified in SamplingParams. Default: 20 --disable-async-output-proc Disable async output processing. May result in lower performance. --otlp-traces-endpoint Target URL to which OpenTelemetry traces will be sent. --collect-detailed-traces Valid choices: model, worker, all Collect detailed traces for specified modules (requires --otlp-traces-endpoint). ## Advanced Options --rope-scaling RoPE scaling configuration in JSON format. Example: {"rope_type":"dynamic","factor":2.0} --rope-theta RoPE theta. Use with rope_scaling to improve scaled model performance. --enforce-eager Always use eager-mode PyTorch. If False, uses hybrid eager/CUDA graph mode. --seed Random seed for operations. Default: 0 --compilation-config, -O torch.compile configuration for the model (0, 1, 2, 3 or JSON string). Level 3 is recommended for production. --worker-cls The worker class to use for distributed execution. Default: "auto" --enable-sleep-mode Enable sleep mode for the engine (CUDA platform only). --calculate-kv-scales Enable dynamic calculation of k_scale and v_scale when kv-cache-dtype is fp8. ## Serving Options --host Host address for the server. Default: "0.0.0.0" --port Port number for the server. Default: 8000 --served-model-name The model name(s) used in the API. Can be multiple comma-separated names. ## Multimodal --limit-mm-per-prompt Limit how many multimodal inputs per prompt (e.g., image=16,video=2). --mm-processor-kwargs Overrides for multimodal input processing (JSON format). --disable-mm-preprocessor-cache Disable caching of multi-modal preprocessor/mapper (not recommended). """ def get_vllm_docs() -> str: """ Get the vLLM engine arguments documentation. Returns: str: Complete vLLM engine arguments documentation """ return VLLM_ENGINE_ARGS_DOC def get_common_parameters_summary() -> str: """ Get a summary of the most commonly used vLLM parameters. Returns: str: Summary of key vLLM parameters """ return """ ## Most Common vLLM Parameters: **Performance:** - --gpu-memory-utilization: Fraction of GPU memory to use (0.0-1.0, default: 0.9) - --max-model-len: Maximum context length - --max-num-seqs: Maximum sequences per iteration - --max-num-batched-tokens: Maximum batched tokens per iteration - --enable-prefix-caching: Enable prefix caching (recommended) - --enable-chunked-prefill: For long contexts (>8192 tokens) **Model Configuration:** - --dtype: Data type (auto, half, float16, bfloat16, float32) - --kv-cache-dtype: KV cache type (auto, fp8, fp16, bf16) - --quantization: Quantization method (fp8, awq, gptq, etc.) **Distributed:** - --tensor-parallel-size: Number of GPUs for tensor parallelism - --pipeline-parallel-size: Number of pipeline stages **Server:** - --host: Server host address (default: 0.0.0.0) - --port: Server port (default: 8000) """