CapSpeech NAR β Vietnamese Instruction TTS
Vietnamese-adapted CapSpeech NAR (Non-Autoregressive) model for instruction-guided text-to-speech synthesis.
Model Details
- Base model: OpenSound/CapSpeech-models (
nar_CapTTS.pt) - Architecture: CrossDiT (dim=1024, depth=24, heads=16, ff_mult=4)
- Vocoder: BigVGAN v2 24kHz 100-band
- Text encoder: Character-level Vietnamese (176 chars)
- Caption encoder: ViT5-large (VietAI/vit5-large, dim=1024)
- Parameters: 614.10M trainable
- Training steps: 30000 optimizer updates
- Training data: ~1.05M Vietnamese speech samples (general task)
- Loss: Flow Matching (MSE)
Training Config
| Parameter | Value |
|---|---|
| Learning rate | 2e-5 |
| Batch size | 32 Γ 2 GPUs |
| Gradient accumulation | 2 |
| Effective batch | 128 |
| Mixed precision | fp16 |
| Optimizer | AdamW |
| GPU | 2Γ NVIDIA A100 40GB |
Usage
# Load checkpoint
import torch
from capspeech.nar.network.crossdit import CrossDiT
model = CrossDiT(
dim=1024, depth=24, heads=16, ff_mult=4,
text_dim=512, conv_layers=0,
text_num_embeds=176, mel_dim=100,
t5_dim=1024, clap_dim=512,
use_checkpoint=False, qk_norm=True, skip=True
)
ckpt = torch.load("checkpoint.pt", map_location="cpu")
model.load_state_dict(ckpt["model"])
Files
checkpoint.ptβ Model + optimizer state dictfinetune_vn.yamlβ Training configvocab.txtβ Vietnamese character vocabulary (176 chars)
Citation
Based on CapSpeech.