Spaces:

MCP-1st-Birthday
/

Hivenet_ComputeAgent

Running

File size: 9,151 Bytes

8816dfd

"""
vLLM Engine Arguments Documentation

This module contains the complete documentation for vLLM engine arguments
from https://docs.vllm.ai/en/v0.11.0/serving/engine_args.html

This is used by the deployment system to generate optimal vLLM commands
without requiring online access.

Author: ComputeAgent Team
"""

VLLM_ENGINE_ARGS_DOC = """
# vLLM Engine Arguments (v0.11.0)

## Model Configuration

--model
Name or path of the huggingface model to use.
Default: "facebook/opt-125m"

--task
Possible choices: auto, generate, embedding, embed, classify, score, reward
The task to use the model for. Each vLLM instance only supports one task.
Default: "auto"

--tokenizer
Name or path of the huggingface tokenizer to use. If unspecified, model name or path will be used.

--skip-tokenizer-init
Skip initialization of tokenizer and detokenizer.

--revision
The specific model version to use. It can be a branch name, a tag name, or a commit id.

--code-revision
The specific revision to use for the model code on Hugging Face Hub.

--tokenizer-revision
Revision of the huggingface tokenizer to use.

--tokenizer-mode
Possible choices: auto, slow, mistral
The tokenizer mode. "auto" will use the fast tokenizer if available.
Default: "auto"

--trust-remote-code
Trust remote code from huggingface.

--download-dir
Directory to download and load the weights, default to the default cache dir of huggingface.

--load-format
Possible choices: auto, pt, safetensors, npcache, dummy, tensorizer, sharded_state, gguf, bitsandbytes, mistral, runai_streamer
The format of the model weights to load.
Default: "auto"

--config-format
Possible choices: auto, hf, mistral
The format of the model config to load.
Default: "ConfigFormat.AUTO"

--dtype
Possible choices: auto, half, float16, bfloat16, float, float32
Data type for model weights and activations.
- "auto" will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models.
- "half" for FP16. Recommended for AWQ quantization.
- "bfloat16" for a balance between precision and range.
Default: "auto"

--kv-cache-dtype
Possible choices: auto, fp8, fp8_e5m2, fp8_e4m3
Data type for kv cache storage. If "auto", will use model data type.
Default: "auto"

--max-model-len
Model context length. If unspecified, will be automatically derived from the model config.

## Performance & Memory

--gpu-memory-utilization
The fraction of GPU memory to be used for the model executor (0.0-1.0).
This is a per-instance limit. For example, 0.5 would use 50% GPU memory.
Default: 0.9

--max-num-batched-tokens
Maximum number of batched tokens per iteration.

--max-num-seqs
Maximum number of sequences per iteration.

--swap-space
CPU swap space size (GiB) per GPU.
Default: 4

--cpu-offload-gb
The space in GiB to offload to CPU, per GPU. Default is 0 (no offloading).
This can virtually increase GPU memory. For example, if you have 24GB GPU and set this to 10,
it's like having a 34GB GPU.
Default: 0

--num-gpu-blocks-override
If specified, ignore GPU profiling result and use this number of GPU blocks.

## Distributed Execution

--tensor-parallel-size, -tp
Number of tensor parallel replicas. Use for multi-GPU inference.
Default: 1

--pipeline-parallel-size, -pp
Number of pipeline stages.
Default: 1

--distributed-executor-backend
Possible choices: ray, mp, uni, external_launcher
Backend to use for distributed model workers. "mp" for single host, "ray" for multi-host.

--max-parallel-loading-workers
Load model sequentially in multiple batches, to avoid RAM OOM when using tensor parallel.

## Caching & Optimization

--enable-prefix-caching, --no-enable-prefix-caching
Enables automatic prefix caching. Highly recommended for better performance.

--disable-sliding-window
Disables sliding window, capping to sliding window size.

--block-size
Possible choices: 8, 16, 32, 64, 128
Token block size for contiguous chunks of tokens.
Default depends on device (CUDA: up to 32, HPU: 128).

--enable-chunked-prefill
Enable chunked prefill for long context processing. Recommended for max-model-len > 8192.

--max-seq-len-to-capture
Maximum sequence length covered by CUDA graphs. Falls back to eager mode for longer sequences.
Default: 8192

## Quantization

--quantization, -q
Possible choices: aqlm, awq, deepspeedfp, tpu_int8, fp8, fbgemm_fp8, modelopt, marlin, gguf,
gptq_marlin_24, gptq_marlin, awq_marlin, gptq, compressed-tensors, bitsandbytes, qqq, hqq,
experts_int8, neuron_quant, ipex, quark, moe_wna16, None
Method used to quantize the weights.

## Speculative Decoding

--speculative-model
The name of the draft model to be used in speculative decoding.

--num-speculative-tokens
The number of speculative tokens to sample from the draft model.

--speculative-max-model-len
The maximum sequence length supported by the draft model.

--speculative-disable-by-batch-size
Disable speculative decoding if the number of enqueue requests is larger than this value.

--ngram-prompt-lookup-max
Max size of window for ngram prompt lookup in speculative decoding.

--ngram-prompt-lookup-min
Min size of window for ngram prompt lookup in speculative decoding.

## LoRA Support

--enable-lora
If True, enable handling of LoRA adapters.

--max-loras
Max number of LoRAs in a single batch.
Default: 1

--max-lora-rank
Max LoRA rank.
Default: 16

--lora-dtype
Possible choices: auto, float16, bfloat16
Data type for LoRA. If auto, will default to base model dtype.
Default: "auto"

--fully-sharded-loras
Use fully sharded LoRA layers. Likely faster at high sequence length or tensor parallel size.

## Scheduling & Execution

--scheduling-policy
Possible choices: fcfs, priority
The scheduling policy to use. "fcfs" (first come first served) or "priority".
Default: "fcfs"

--num-scheduler-steps
Maximum number of forward steps per scheduler call.
Default: 1

--scheduler-delay-factor
Apply a delay before scheduling next prompt (delay factor * previous prompt latency).
Default: 0.0

--device
Possible choices: auto, cuda, neuron, cpu, openvino, tpu, xpu, hpu
Device type for vLLM execution.
Default: "auto"

## Logging & Monitoring

--disable-log-stats
Disable logging statistics.

--max-logprobs
Max number of log probs to return when logprobs is specified in SamplingParams.
Default: 20

--disable-async-output-proc
Disable async output processing. May result in lower performance.

--otlp-traces-endpoint
Target URL to which OpenTelemetry traces will be sent.

--collect-detailed-traces
Valid choices: model, worker, all
Collect detailed traces for specified modules (requires --otlp-traces-endpoint).

## Advanced Options

--rope-scaling
RoPE scaling configuration in JSON format. Example: {"rope_type":"dynamic","factor":2.0}

--rope-theta
RoPE theta. Use with rope_scaling to improve scaled model performance.

--enforce-eager
Always use eager-mode PyTorch. If False, uses hybrid eager/CUDA graph mode.

--seed
Random seed for operations.
Default: 0

--compilation-config, -O
torch.compile configuration for the model (0, 1, 2, 3 or JSON string).
Level 3 is recommended for production.

--worker-cls
The worker class to use for distributed execution.
Default: "auto"

--enable-sleep-mode
Enable sleep mode for the engine (CUDA platform only).

--calculate-kv-scales
Enable dynamic calculation of k_scale and v_scale when kv-cache-dtype is fp8.

## Serving Options

--host
Host address for the server.
Default: "0.0.0.0"

--port
Port number for the server.
Default: 8000

--served-model-name
The model name(s) used in the API. Can be multiple comma-separated names.

## Multimodal

--limit-mm-per-prompt
Limit how many multimodal inputs per prompt (e.g., image=16,video=2).

--mm-processor-kwargs
Overrides for multimodal input processing (JSON format).

--disable-mm-preprocessor-cache
Disable caching of multi-modal preprocessor/mapper (not recommended).
"""


def get_vllm_docs() -> str:
    """
    Get the vLLM engine arguments documentation.

    Returns:
        str: Complete vLLM engine arguments documentation
    """
    return VLLM_ENGINE_ARGS_DOC


def get_common_parameters_summary() -> str:
    """
    Get a summary of the most commonly used vLLM parameters.

    Returns:
        str: Summary of key vLLM parameters
    """
    return """
    ## Most Common vLLM Parameters:

    **Performance:**
    - --gpu-memory-utilization: Fraction of GPU memory to use (0.0-1.0, default: 0.9)
    - --max-model-len: Maximum context length
    - --max-num-seqs: Maximum sequences per iteration
    - --max-num-batched-tokens: Maximum batched tokens per iteration
    - --enable-prefix-caching: Enable prefix caching (recommended)
    - --enable-chunked-prefill: For long contexts (>8192 tokens)

    **Model Configuration:**
    - --dtype: Data type (auto, half, float16, bfloat16, float32)
    - --kv-cache-dtype: KV cache type (auto, fp8, fp16, bf16)
    - --quantization: Quantization method (fp8, awq, gptq, etc.)

    **Distributed:**
    - --tensor-parallel-size: Number of GPUs for tensor parallelism
    - --pipeline-parallel-size: Number of pipeline stages

    **Server:**
    - --host: Server host address (default: 0.0.0.0)
    - --port: Server port (default: 8000)
    """