CapSpeech NAR β€” Vietnamese Instruction TTS

Vietnamese-adapted CapSpeech NAR (Non-Autoregressive) model for instruction-guided text-to-speech synthesis.

Model Details

  • Base model: OpenSound/CapSpeech-models (nar_CapTTS.pt)
  • Architecture: CrossDiT (dim=1024, depth=24, heads=16, ff_mult=4)
  • Vocoder: BigVGAN v2 24kHz 100-band
  • Text encoder: Character-level Vietnamese (176 chars)
  • Caption encoder: ViT5-large (VietAI/vit5-large, dim=1024)
  • Parameters: 614.10M trainable
  • Training steps: 30000 optimizer updates
  • Training data: ~1.05M Vietnamese speech samples (general task)
  • Loss: Flow Matching (MSE)

Training Config

Parameter Value
Learning rate 2e-5
Batch size 32 Γ— 2 GPUs
Gradient accumulation 2
Effective batch 128
Mixed precision fp16
Optimizer AdamW
GPU 2Γ— NVIDIA A100 40GB

Usage

# Load checkpoint
import torch
from capspeech.nar.network.crossdit import CrossDiT

model = CrossDiT(
    dim=1024, depth=24, heads=16, ff_mult=4,
    text_dim=512, conv_layers=0,
    text_num_embeds=176, mel_dim=100,
    t5_dim=1024, clap_dim=512,
    use_checkpoint=False, qk_norm=True, skip=True
)

ckpt = torch.load("checkpoint.pt", map_location="cpu")
model.load_state_dict(ckpt["model"])

Files

  • checkpoint.pt β€” Model + optimizer state dict
  • finetune_vn.yaml β€” Training config
  • vocab.txt β€” Vietnamese character vocabulary (176 chars)

Citation

Based on CapSpeech.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for thangquang09/capspeech-nar-vietnamese

Finetuned
(1)
this model
Finetunes
1 model