Hivenet_ComputeAgent / ComputeAgent /vllm_engine_args.py
carraraig's picture
Hello
8816dfd
"""
vLLM Engine Arguments Documentation
This module contains the complete documentation for vLLM engine arguments
from https://docs.vllm.ai/en/v0.11.0/serving/engine_args.html
This is used by the deployment system to generate optimal vLLM commands
without requiring online access.
Author: ComputeAgent Team
"""
VLLM_ENGINE_ARGS_DOC = """
# vLLM Engine Arguments (v0.11.0)
## Model Configuration
--model
Name or path of the huggingface model to use.
Default: "facebook/opt-125m"
--task
Possible choices: auto, generate, embedding, embed, classify, score, reward
The task to use the model for. Each vLLM instance only supports one task.
Default: "auto"
--tokenizer
Name or path of the huggingface tokenizer to use. If unspecified, model name or path will be used.
--skip-tokenizer-init
Skip initialization of tokenizer and detokenizer.
--revision
The specific model version to use. It can be a branch name, a tag name, or a commit id.
--code-revision
The specific revision to use for the model code on Hugging Face Hub.
--tokenizer-revision
Revision of the huggingface tokenizer to use.
--tokenizer-mode
Possible choices: auto, slow, mistral
The tokenizer mode. "auto" will use the fast tokenizer if available.
Default: "auto"
--trust-remote-code
Trust remote code from huggingface.
--download-dir
Directory to download and load the weights, default to the default cache dir of huggingface.
--load-format
Possible choices: auto, pt, safetensors, npcache, dummy, tensorizer, sharded_state, gguf, bitsandbytes, mistral, runai_streamer
The format of the model weights to load.
Default: "auto"
--config-format
Possible choices: auto, hf, mistral
The format of the model config to load.
Default: "ConfigFormat.AUTO"
--dtype
Possible choices: auto, half, float16, bfloat16, float, float32
Data type for model weights and activations.
- "auto" will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models.
- "half" for FP16. Recommended for AWQ quantization.
- "bfloat16" for a balance between precision and range.
Default: "auto"
--kv-cache-dtype
Possible choices: auto, fp8, fp8_e5m2, fp8_e4m3
Data type for kv cache storage. If "auto", will use model data type.
Default: "auto"
--max-model-len
Model context length. If unspecified, will be automatically derived from the model config.
## Performance & Memory
--gpu-memory-utilization
The fraction of GPU memory to be used for the model executor (0.0-1.0).
This is a per-instance limit. For example, 0.5 would use 50% GPU memory.
Default: 0.9
--max-num-batched-tokens
Maximum number of batched tokens per iteration.
--max-num-seqs
Maximum number of sequences per iteration.
--swap-space
CPU swap space size (GiB) per GPU.
Default: 4
--cpu-offload-gb
The space in GiB to offload to CPU, per GPU. Default is 0 (no offloading).
This can virtually increase GPU memory. For example, if you have 24GB GPU and set this to 10,
it's like having a 34GB GPU.
Default: 0
--num-gpu-blocks-override
If specified, ignore GPU profiling result and use this number of GPU blocks.
## Distributed Execution
--tensor-parallel-size, -tp
Number of tensor parallel replicas. Use for multi-GPU inference.
Default: 1
--pipeline-parallel-size, -pp
Number of pipeline stages.
Default: 1
--distributed-executor-backend
Possible choices: ray, mp, uni, external_launcher
Backend to use for distributed model workers. "mp" for single host, "ray" for multi-host.
--max-parallel-loading-workers
Load model sequentially in multiple batches, to avoid RAM OOM when using tensor parallel.
## Caching & Optimization
--enable-prefix-caching, --no-enable-prefix-caching
Enables automatic prefix caching. Highly recommended for better performance.
--disable-sliding-window
Disables sliding window, capping to sliding window size.
--block-size
Possible choices: 8, 16, 32, 64, 128
Token block size for contiguous chunks of tokens.
Default depends on device (CUDA: up to 32, HPU: 128).
--enable-chunked-prefill
Enable chunked prefill for long context processing. Recommended for max-model-len > 8192.
--max-seq-len-to-capture
Maximum sequence length covered by CUDA graphs. Falls back to eager mode for longer sequences.
Default: 8192
## Quantization
--quantization, -q
Possible choices: aqlm, awq, deepspeedfp, tpu_int8, fp8, fbgemm_fp8, modelopt, marlin, gguf,
gptq_marlin_24, gptq_marlin, awq_marlin, gptq, compressed-tensors, bitsandbytes, qqq, hqq,
experts_int8, neuron_quant, ipex, quark, moe_wna16, None
Method used to quantize the weights.
## Speculative Decoding
--speculative-model
The name of the draft model to be used in speculative decoding.
--num-speculative-tokens
The number of speculative tokens to sample from the draft model.
--speculative-max-model-len
The maximum sequence length supported by the draft model.
--speculative-disable-by-batch-size
Disable speculative decoding if the number of enqueue requests is larger than this value.
--ngram-prompt-lookup-max
Max size of window for ngram prompt lookup in speculative decoding.
--ngram-prompt-lookup-min
Min size of window for ngram prompt lookup in speculative decoding.
## LoRA Support
--enable-lora
If True, enable handling of LoRA adapters.
--max-loras
Max number of LoRAs in a single batch.
Default: 1
--max-lora-rank
Max LoRA rank.
Default: 16
--lora-dtype
Possible choices: auto, float16, bfloat16
Data type for LoRA. If auto, will default to base model dtype.
Default: "auto"
--fully-sharded-loras
Use fully sharded LoRA layers. Likely faster at high sequence length or tensor parallel size.
## Scheduling & Execution
--scheduling-policy
Possible choices: fcfs, priority
The scheduling policy to use. "fcfs" (first come first served) or "priority".
Default: "fcfs"
--num-scheduler-steps
Maximum number of forward steps per scheduler call.
Default: 1
--scheduler-delay-factor
Apply a delay before scheduling next prompt (delay factor * previous prompt latency).
Default: 0.0
--device
Possible choices: auto, cuda, neuron, cpu, openvino, tpu, xpu, hpu
Device type for vLLM execution.
Default: "auto"
## Logging & Monitoring
--disable-log-stats
Disable logging statistics.
--max-logprobs
Max number of log probs to return when logprobs is specified in SamplingParams.
Default: 20
--disable-async-output-proc
Disable async output processing. May result in lower performance.
--otlp-traces-endpoint
Target URL to which OpenTelemetry traces will be sent.
--collect-detailed-traces
Valid choices: model, worker, all
Collect detailed traces for specified modules (requires --otlp-traces-endpoint).
## Advanced Options
--rope-scaling
RoPE scaling configuration in JSON format. Example: {"rope_type":"dynamic","factor":2.0}
--rope-theta
RoPE theta. Use with rope_scaling to improve scaled model performance.
--enforce-eager
Always use eager-mode PyTorch. If False, uses hybrid eager/CUDA graph mode.
--seed
Random seed for operations.
Default: 0
--compilation-config, -O
torch.compile configuration for the model (0, 1, 2, 3 or JSON string).
Level 3 is recommended for production.
--worker-cls
The worker class to use for distributed execution.
Default: "auto"
--enable-sleep-mode
Enable sleep mode for the engine (CUDA platform only).
--calculate-kv-scales
Enable dynamic calculation of k_scale and v_scale when kv-cache-dtype is fp8.
## Serving Options
--host
Host address for the server.
Default: "0.0.0.0"
--port
Port number for the server.
Default: 8000
--served-model-name
The model name(s) used in the API. Can be multiple comma-separated names.
## Multimodal
--limit-mm-per-prompt
Limit how many multimodal inputs per prompt (e.g., image=16,video=2).
--mm-processor-kwargs
Overrides for multimodal input processing (JSON format).
--disable-mm-preprocessor-cache
Disable caching of multi-modal preprocessor/mapper (not recommended).
"""
def get_vllm_docs() -> str:
"""
Get the vLLM engine arguments documentation.
Returns:
str: Complete vLLM engine arguments documentation
"""
return VLLM_ENGINE_ARGS_DOC
def get_common_parameters_summary() -> str:
"""
Get a summary of the most commonly used vLLM parameters.
Returns:
str: Summary of key vLLM parameters
"""
return """
## Most Common vLLM Parameters:
**Performance:**
- --gpu-memory-utilization: Fraction of GPU memory to use (0.0-1.0, default: 0.9)
- --max-model-len: Maximum context length
- --max-num-seqs: Maximum sequences per iteration
- --max-num-batched-tokens: Maximum batched tokens per iteration
- --enable-prefix-caching: Enable prefix caching (recommended)
- --enable-chunked-prefill: For long contexts (>8192 tokens)
**Model Configuration:**
- --dtype: Data type (auto, half, float16, bfloat16, float32)
- --kv-cache-dtype: KV cache type (auto, fp8, fp16, bf16)
- --quantization: Quantization method (fp8, awq, gptq, etc.)
**Distributed:**
- --tensor-parallel-size: Number of GPUs for tensor parallelism
- --pipeline-parallel-size: Number of pipeline stages
**Server:**
- --host: Server host address (default: 0.0.0.0)
- --port: Server port (default: 8000)
"""