Spaces:

MCP-1st-Birthday
/

Hivenet_ComputeAgent

Running

App Files Files Community

Hivenet_ComputeAgent / ComputeAgent /vllm_engine_args.py

carraraig

Hello

8816dfd 18 days ago

raw

history blame contribute delete

9.15 kB

	"""
	vLLM Engine Arguments Documentation

	This module contains the complete documentation for vLLM engine arguments
	from https://docs.vllm.ai/en/v0.11.0/serving/engine_args.html

	This is used by the deployment system to generate optimal vLLM commands
	without requiring online access.

	Author: ComputeAgent Team
	"""

	VLLM_ENGINE_ARGS_DOC = """
	# vLLM Engine Arguments (v0.11.0)

	## Model Configuration

	--model
	Name or path of the huggingface model to use.
	Default: "facebook/opt-125m"

	--task
	Possible choices: auto, generate, embedding, embed, classify, score, reward
	The task to use the model for. Each vLLM instance only supports one task.
	Default: "auto"

	--tokenizer
	Name or path of the huggingface tokenizer to use. If unspecified, model name or path will be used.

	--skip-tokenizer-init
	Skip initialization of tokenizer and detokenizer.

	--revision
	The specific model version to use. It can be a branch name, a tag name, or a commit id.

	--code-revision
	The specific revision to use for the model code on Hugging Face Hub.

	--tokenizer-revision
	Revision of the huggingface tokenizer to use.

	--tokenizer-mode
	Possible choices: auto, slow, mistral
	The tokenizer mode. "auto" will use the fast tokenizer if available.
	Default: "auto"

	--trust-remote-code
	Trust remote code from huggingface.

	--download-dir
	Directory to download and load the weights, default to the default cache dir of huggingface.

	--load-format
	Possible choices: auto, pt, safetensors, npcache, dummy, tensorizer, sharded_state, gguf, bitsandbytes, mistral, runai_streamer
	The format of the model weights to load.
	Default: "auto"

	--config-format
	Possible choices: auto, hf, mistral
	The format of the model config to load.
	Default: "ConfigFormat.AUTO"

	--dtype
	Possible choices: auto, half, float16, bfloat16, float, float32
	Data type for model weights and activations.
	- "auto" will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models.
	- "half" for FP16. Recommended for AWQ quantization.
	- "bfloat16" for a balance between precision and range.
	Default: "auto"

	--kv-cache-dtype
	Possible choices: auto, fp8, fp8_e5m2, fp8_e4m3
	Data type for kv cache storage. If "auto", will use model data type.
	Default: "auto"

	--max-model-len
	Model context length. If unspecified, will be automatically derived from the model config.

	## Performance & Memory

	--gpu-memory-utilization
	The fraction of GPU memory to be used for the model executor (0.0-1.0).
	This is a per-instance limit. For example, 0.5 would use 50% GPU memory.
	Default: 0.9

	--max-num-batched-tokens
	Maximum number of batched tokens per iteration.

	--max-num-seqs
	Maximum number of sequences per iteration.

	--swap-space
	CPU swap space size (GiB) per GPU.
	Default: 4

	--cpu-offload-gb
	The space in GiB to offload to CPU, per GPU. Default is 0 (no offloading).
	This can virtually increase GPU memory. For example, if you have 24GB GPU and set this to 10,
	it's like having a 34GB GPU.
	Default: 0

	--num-gpu-blocks-override
	If specified, ignore GPU profiling result and use this number of GPU blocks.

	## Distributed Execution

	--tensor-parallel-size, -tp
	Number of tensor parallel replicas. Use for multi-GPU inference.
	Default: 1

	--pipeline-parallel-size, -pp
	Number of pipeline stages.
	Default: 1

	--distributed-executor-backend
	Possible choices: ray, mp, uni, external_launcher
	Backend to use for distributed model workers. "mp" for single host, "ray" for multi-host.

	--max-parallel-loading-workers
	Load model sequentially in multiple batches, to avoid RAM OOM when using tensor parallel.

	## Caching & Optimization

	--enable-prefix-caching, --no-enable-prefix-caching
	Enables automatic prefix caching. Highly recommended for better performance.

	--disable-sliding-window
	Disables sliding window, capping to sliding window size.

	--block-size
	Possible choices: 8, 16, 32, 64, 128
	Token block size for contiguous chunks of tokens.
	Default depends on device (CUDA: up to 32, HPU: 128).

	--enable-chunked-prefill
	Enable chunked prefill for long context processing. Recommended for max-model-len > 8192.

	--max-seq-len-to-capture
	Maximum sequence length covered by CUDA graphs. Falls back to eager mode for longer sequences.
	Default: 8192

	## Quantization

	--quantization, -q
	Possible choices: aqlm, awq, deepspeedfp, tpu_int8, fp8, fbgemm_fp8, modelopt, marlin, gguf,
	gptq_marlin_24, gptq_marlin, awq_marlin, gptq, compressed-tensors, bitsandbytes, qqq, hqq,
	experts_int8, neuron_quant, ipex, quark, moe_wna16, None
	Method used to quantize the weights.

	## Speculative Decoding

	--speculative-model
	The name of the draft model to be used in speculative decoding.

	--num-speculative-tokens
	The number of speculative tokens to sample from the draft model.

	--speculative-max-model-len
	The maximum sequence length supported by the draft model.

	--speculative-disable-by-batch-size
	Disable speculative decoding if the number of enqueue requests is larger than this value.

	--ngram-prompt-lookup-max
	Max size of window for ngram prompt lookup in speculative decoding.

	--ngram-prompt-lookup-min
	Min size of window for ngram prompt lookup in speculative decoding.

	## LoRA Support

	--enable-lora
	If True, enable handling of LoRA adapters.

	--max-loras
	Max number of LoRAs in a single batch.
	Default: 1

	--max-lora-rank
	Max LoRA rank.
	Default: 16

	--lora-dtype
	Possible choices: auto, float16, bfloat16
	Data type for LoRA. If auto, will default to base model dtype.
	Default: "auto"

	--fully-sharded-loras
	Use fully sharded LoRA layers. Likely faster at high sequence length or tensor parallel size.

	## Scheduling & Execution

	--scheduling-policy
	Possible choices: fcfs, priority
	The scheduling policy to use. "fcfs" (first come first served) or "priority".
	Default: "fcfs"

	--num-scheduler-steps
	Maximum number of forward steps per scheduler call.
	Default: 1

	--scheduler-delay-factor
	Apply a delay before scheduling next prompt (delay factor * previous prompt latency).
	Default: 0.0

	--device
	Possible choices: auto, cuda, neuron, cpu, openvino, tpu, xpu, hpu
	Device type for vLLM execution.
	Default: "auto"

	## Logging & Monitoring

	--disable-log-stats
	Disable logging statistics.

	--max-logprobs
	Max number of log probs to return when logprobs is specified in SamplingParams.
	Default: 20

	--disable-async-output-proc
	Disable async output processing. May result in lower performance.

	--otlp-traces-endpoint
	Target URL to which OpenTelemetry traces will be sent.

	--collect-detailed-traces
	Valid choices: model, worker, all
	Collect detailed traces for specified modules (requires --otlp-traces-endpoint).

	## Advanced Options

	--rope-scaling
	RoPE scaling configuration in JSON format. Example: {"rope_type":"dynamic","factor":2.0}

	--rope-theta
	RoPE theta. Use with rope_scaling to improve scaled model performance.

	--enforce-eager
	Always use eager-mode PyTorch. If False, uses hybrid eager/CUDA graph mode.

	--seed
	Random seed for operations.
	Default: 0

	--compilation-config, -O
	torch.compile configuration for the model (0, 1, 2, 3 or JSON string).
	Level 3 is recommended for production.

	--worker-cls
	The worker class to use for distributed execution.
	Default: "auto"

	--enable-sleep-mode
	Enable sleep mode for the engine (CUDA platform only).

	--calculate-kv-scales
	Enable dynamic calculation of k_scale and v_scale when kv-cache-dtype is fp8.

	## Serving Options

	--host
	Host address for the server.
	Default: "0.0.0.0"

	--port
	Port number for the server.
	Default: 8000

	--served-model-name
	The model name(s) used in the API. Can be multiple comma-separated names.

	## Multimodal

	--limit-mm-per-prompt
	Limit how many multimodal inputs per prompt (e.g., image=16,video=2).

	--mm-processor-kwargs
	Overrides for multimodal input processing (JSON format).

	--disable-mm-preprocessor-cache
	Disable caching of multi-modal preprocessor/mapper (not recommended).
	"""


	def get_vllm_docs() -> str:
	"""
	Get the vLLM engine arguments documentation.

	Returns:
	str: Complete vLLM engine arguments documentation
	"""
	return VLLM_ENGINE_ARGS_DOC


	def get_common_parameters_summary() -> str:
	"""
	Get a summary of the most commonly used vLLM parameters.

	Returns:
	str: Summary of key vLLM parameters
	"""
	return """
	## Most Common vLLM Parameters:

	Performance:
	- --gpu-memory-utilization: Fraction of GPU memory to use (0.0-1.0, default: 0.9)
	- --max-model-len: Maximum context length
	- --max-num-seqs: Maximum sequences per iteration
	- --max-num-batched-tokens: Maximum batched tokens per iteration
	- --enable-prefix-caching: Enable prefix caching (recommended)
	- --enable-chunked-prefill: For long contexts (>8192 tokens)

	Model Configuration:
	- --dtype: Data type (auto, half, float16, bfloat16, float32)
	- --kv-cache-dtype: KV cache type (auto, fp8, fp16, bf16)
	- --quantization: Quantization method (fp8, awq, gptq, etc.)

	Distributed:
	- --tensor-parallel-size: Number of GPUs for tensor parallelism
	- --pipeline-parallel-size: Number of pipeline stages

	Server:
	- --host: Server host address (default: 0.0.0.0)
	- --port: Server port (default: 8000)
	"""