Instructions to use EliovpAI/Qwen3-14B-FP8-KV with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use EliovpAI/Qwen3-14B-FP8-KV with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="EliovpAI/Qwen3-14B-FP8-KV") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("EliovpAI/Qwen3-14B-FP8-KV") model = AutoModelForCausalLM.from_pretrained("EliovpAI/Qwen3-14B-FP8-KV") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use EliovpAI/Qwen3-14B-FP8-KV with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "EliovpAI/Qwen3-14B-FP8-KV" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "EliovpAI/Qwen3-14B-FP8-KV", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/EliovpAI/Qwen3-14B-FP8-KV
- SGLang
How to use EliovpAI/Qwen3-14B-FP8-KV with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "EliovpAI/Qwen3-14B-FP8-KV" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "EliovpAI/Qwen3-14B-FP8-KV", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "EliovpAI/Qwen3-14B-FP8-KV" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "EliovpAI/Qwen3-14B-FP8-KV", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use EliovpAI/Qwen3-14B-FP8-KV with Docker Model Runner:
docker model run hf.co/EliovpAI/Qwen3-14B-FP8-KV
Qwen3-14B-FP8-KV
Enterprise-grade OCP FP8 quantized Qwen-3 14B for AMD ROCm, end-to-end KV-cache in FP8 with Quark
Introduction
Qwen3-14B-FP8-KV is a full-pipeline, OCP-compliant FP8_e4m3 quant of Qwen/Qwen3-14B, built with AMD Quark and optimized for AMD Instinct GPUs. This model delivers ~1.8× memory savings and throughput boost vs. FP16, with only a nominal perplexity uplift (≈8.7 PPL on WikiText2).
Quantization Strategy
- Quantizer: AMD Quark v0.9+
- Numeric Format: OCP FP8_e4m3 symmetric, per-tensor
- Scope: All
Linearlayers (excludinglm_head), activations, and KV cache - Group Size: 128 (block-aligned)
- Calibration: 128 Pile samples (default)
- Metadata: scales embedded in JSON + SafeTensors
Performance Snapshot
| Metric | FP16 Baseline | FP8_e4m3 Quantized |
|---|---|---|
| Wikitext2 Perplexity | 6.38 | 8.73 |
| Memory Footprint | 1.0× | 0.56× |
Quick Start
Serve with vLLM
Override model’s context:
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
Serve
HIP_VISIBLE_DEVICES=0
vllm serve EliovpAI/Qwen3-14B-FP8-KV
--kv-cache-dtype fp8
----num-scheduler-steps 10
.. other arguments
Benchmark
python3 /vllm/benchmarks/benchmark_serving.py
--backend vllm
--model EliovpAI/Qwen3-14B-FP8-KV
--dataset-name sharegpt
--dataset-path /vllm/ShareGPT_V3_unfiltered_cleaned_split.json
--num-prompts 32
--random-range-ratio 1.0
--percentile-metrics ttft,tpot,itl,e2el
--sharegpt-output-len 256
Evaluation
We benchmarked on WikiText2 using vLLM’s /v1/completions PPL metric:
- FP16 (Qwen3-14B) → 6.38 PPL
- FP8_e4m3 (this model) → 8.73 PPL
The ~2.3-point PPL delta yields massive ROI in memory and speed—with virtually imperceptible quality loss in most benchmarks.
License
This model reuses the Qwen3-14B license.
- Downloads last month
- 25