Instructions to use lemuriandezapada/VibeVoice-ASR-gptq-int4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use lemuriandezapada/VibeVoice-ASR-gptq-int4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="lemuriandezapada/VibeVoice-ASR-gptq-int4")# Load model directly from transformers import VibeVoiceForASRTraining model = VibeVoiceForASRTraining.from_pretrained("lemuriandezapada/VibeVoice-ASR-gptq-int4", dtype="auto") - VibeVoice
How to use lemuriandezapada/VibeVoice-ASR-gptq-int4 with VibeVoice:
import torch, soundfile as sf, librosa, numpy as np from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference # Load voice sample (should be 24kHz mono) voice, sr = sf.read("path/to/voice_sample.wav") if voice.ndim > 1: voice = voice.mean(axis=1) if sr != 24000: voice = librosa.resample(voice, sr, 24000) processor = VibeVoiceProcessor.from_pretrained("lemuriandezapada/VibeVoice-ASR-gptq-int4") model = VibeVoiceForConditionalGenerationInference.from_pretrained( "lemuriandezapada/VibeVoice-ASR-gptq-int4", torch_dtype=torch.bfloat16 ).to("cuda").eval() model.set_ddpm_inference_steps(5) inputs = processor(text=["Speaker 0: Hello!\nSpeaker 1: Hi there!"], voice_samples=[[voice]], return_tensors="pt") audio = model.generate(**inputs, cfg_scale=1.3, tokenizer=processor.tokenizer).speech_outputs[0] sf.write("output.wav", audio.cpu().numpy().squeeze(), 24000) - Notebooks
- Google Colab
- Kaggle
VibeVoice-ASR GPTQ INT4
This repository contains a 4-bit GPTQ quantized export of microsoft/VibeVoice-ASR.
Quantization
- Method: GPTQ
- Bits: 4
- Group size: 128
- Logical parameter count: 8,674,021,857
Repository layout
This model is stored in a split VibeVoice layout:
- root directory: VibeVoice audio and non-decoder weights
decoder-gptq/: quantized Qwen2 decoder weights
Keep this layout intact when downloading or mirroring the repository.
Metadata
The root config.json includes:
vibevoice_metadatavibevoice_decoder_model_pathvibevoice_decoder_quantization
These fields identify the split decoder path and preserve the logical source-model metadata.
Validation
This GPTQ export was validated against the full upstream VibeVoice-ASR model on short audio samples.
- outputs remained valid JSON transcript arrays
- output similarity to the full model remained high on tested samples
Serving note for vLLM 0.17.x
On current vLLM 0.17.x CUDA builds, this checkpoint is compatible with the Marlin GPTQ path.
- prefer letting vLLM infer the backend from
config.json - if you must set it explicitly, use
gptq_marlinrather than plaingptq
Note: in current split-VibeVoice testing, this GPTQ export did not show the same VRAM reduction as the AWQ export under vLLM 0.17.x, even when served through the Marlin path. The checkpoint is still published for reproducibility, but AWQ is currently the recommended low-VRAM variant.
Upstream references
- Code: https://github.com/microsoft/VibeVoice
- Base model: https://huggingface.co/microsoft/VibeVoice-ASR
- Report: https://arxiv.org/pdf/2601.18184
Notes
- This is a quantized derivative export, not the original upstream checkpoint.
- Base model licensing and usage terms follow the upstream VibeVoice-ASR release.
- Pure-VibeVoice compatibility patches for vLLM 0.17.x are included under
patches/vllm_0_17/.
- Downloads last month
- 9
Model tree for lemuriandezapada/VibeVoice-ASR-gptq-int4
Base model
microsoft/VibeVoice-ASR