CapSpeech NAR — Vietnamese Instruction TTS

Vietnamese-adapted CapSpeech NAR (Non-Autoregressive) model for instruction-guided text-to-speech synthesis.

Model Details

Base model: OpenSound/CapSpeech-models (nar_CapTTS.pt)
Architecture: CrossDiT (dim=1024, depth=24, heads=16, ff_mult=4)
Vocoder: BigVGAN v2 24kHz 100-band
Text encoder: Character-level Vietnamese (176 chars)
Caption encoder: ViT5-large (VietAI/vit5-large, dim=1024)
Parameters: 614.10M trainable
Training steps: 30000 optimizer updates
Training data: ~1.05M Vietnamese speech samples (general task)
Loss: Flow Matching (MSE)

Training Config

Parameter	Value
Learning rate	2e-5
Batch size	32 × 2 GPUs
Gradient accumulation	2
Effective batch	128
Mixed precision	fp16
Optimizer	AdamW
GPU	2× NVIDIA A100 40GB

Usage

# Load checkpoint
import torch
from capspeech.nar.network.crossdit import CrossDiT

model = CrossDiT(
    dim=1024, depth=24, heads=16, ff_mult=4,
    text_dim=512, conv_layers=0,
    text_num_embeds=176, mel_dim=100,
    t5_dim=1024, clap_dim=512,
    use_checkpoint=False, qk_norm=True, skip=True
)

ckpt = torch.load("checkpoint.pt", map_location="cpu")
model.load_state_dict(ckpt["model"])

Files

checkpoint.pt — Model + optimizer state dict
finetune_vn.yaml — Training config
vocab.txt — Vietnamese character vocabulary (176 chars)

Citation

Based on CapSpeech.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for thangquang09/capspeech-nar-vietnamese

Base model

OpenSound/CapSpeech-models

Finetuned

(1)

this model

Finetunes

1 model