HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios

Official Repository of Paper: "Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios"(AAAI 2026)

HQ-SVC Logo

arXiv Demos Models Access

πŸ—ž News

  • [2025-11-08] πŸŽ‰ Paper accepted by AAAI 2026
  • [2025-11-12] πŸŽ‰ arXiv paper released
  • [2025-11-12] πŸŽ‰ Demo released
  • [2025-12-24] πŸŽ‰ Inference codes and pre-trained models released

πŸ“… Release Plan

  • arXiv preprint
  • Online demo
  • Inference codes
  • Pre-trained models
  • Training codes

✨ New features

  • Singing style control
  • Improved quality

HQ-SVC is an efficient framework for high-quality zero-shot singing voice conversion (SVC) in low-resource scenarios. It achieves disentanglement of content and speaker features via a unified decoupled codec, and enhances synthesis quality through multi-feature fusion and progressive optimization.

Unlike existing methods that demand large datasets or heavy computational resources, HQ-SVC unifies:

  • πŸš€ Zero-shot conversion for unseen speakers without fine-tuning
  • ⚑ Low-resource training (single consumer-grade GPU, <80h data)
  • 🎧 Dual capabilities: high-quality singing voice conversion + voice super-resolution
  • 🎯 Superior naturalness and speaker similarity compared to SOTA methods

🎸 Try Inference

1. Download Codes and Environment δΈ‹θ½½δ»£η ε’ŒηŽ―ε’ƒ

git clone https://github.com/ShawnPi233/HQ-SVC.git
cd HQ-SVC
wget -c https://huggingface.co/shawnpi/HQ-SVC/resolve/main/environment.tar.gz
wget -c https://hf-mirror.com/shawnpi/HQ-SVC/resolve/main/environment.tar.gz # ε―ι€‰ι•œεƒζΊ

2. Unzip Environment θ§£εŽ‹ηŽ―ε’ƒ

mkdir -p venv
tar -xzf environment.tar.gz -C venv

3. Activate Environment ζΏ€ζ΄»ηŽ―ε’ƒ

source venv/bin/activate

4. Download Pretrained Models 下载权重

export HF_HUB_ENABLE_HF_TRANSFER=0
huggingface-cli download shawnpi/HQ-SVC --include "utils/pretrain/*" --local-dir . --local-dir-use-symlinks False

5. Running 运葌

python gradio_app.py
  • ε¦‚ζžœζŠ₯ι”™ Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
  • θ―·ζ‰§θ‘Œδ»₯δΈ‹δ»£η εŽε†ε―εŠ¨δΈŠθΏ°δ»£η 
unset LD_LIBRARY_PATH
sr

Zero-shot Super-Resolution (16 kHz to 44.1 kHz): Input only source audio

svc

Zero-shot Singing Voice Conversion: Input both source audio and target audio

πŸ“œ Citation

If you use HQ-SVC in your research, please cite our work:

@article{bai2025hq,
  title={HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios},
  author={Bai, Bingsong and Geng, Yizhong and Wang, Fengping and Wang, Cong and Guo, Puyuan and Gao, Yingming and Li, Ya},
  journal={arXiv preprint arXiv:2511.08496},
  year={2025}
}

πŸ™ Acknowledgement

We thank the open-source communities behind:

⭐️ Star History

Star History Chart

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support