--- language: - fi license: mit tags: - text-to-speech - tts - zero-shot - voice-cloning - finnish datasets: - mozilla-foundation/common_voice_15_0 base_model: ResembleAI/chatterbox pipeline_tag: text-to-speech library_name: pytorch model-index: - name: Chatterbox Finnish Fine-Tuned (Step 986) results: - task: type: text-to-speech name: Text to Speech dataset: name: Mozilla Common Voice 15.0 (Finnish OOD) type: mozilla-foundation/common_voice_15_0 config: fi split: test metrics: - name: Word Error Rate (WER) type: wer value: 2.76 verified: true - name: Mean Opinion Score (MOS) type: mos value: 4.34 --- # Chatterbox Finnish Fine-Tuning: High-Fidelity Zero-Shot TTS This repository hosts a high-fidelity fine-tuned version of the Chatterbox TTS model, specifically optimized for the Finnish language. By leveraging a multilingual base and large-scale Finnish data, we achieved exceptional zero-shot generalization to unseen speakers, surpassing commercial-grade quality thresholds. ## 🚀 Performance (Zero-Shot OOD) The following metrics were calculated on **Out-of-Distribution (OOD)** speakers who were strictly excluded from the training and validation sets. This measures how well the model can speak Finnish in voices it has never heard before. | Metric | Baseline (Original Multilingual) | Fine-Tuned (Step 986) | Improvement | | :--- | :---: | :---: | :---: | | **Avg Word Error Rate (WER)** | 28.94% | **2.76%** | **~10.5x Accuracy Increase** | | **Mean Opinion Score (MOS)** | 2.29 / 5.0 | **4.34 / 5.0** | **+2.05 Quality Points** | *Note: MOS was evaluated using the Gemini 3 Flash API, and WER was calculated using Faster-Whisper Finnish Large v3. The 4.34 MOS indicates a "Professional Grade" output comparable to human speech.* --- ## 🎧 Audio Comparison (OOD Speakers) Listen to the difference between the generic multilingual baseline and our high-fidelity Finnish fine-tuning. These samples are from speakers **never seen during training**. | Speaker ID | Baseline (Generic Multilingual) | Fine-Tuned (Finnish Golden) | | :--- | :--- | :--- | | **cv-15_11** | | | | **cv-15_16** | | | | **cv-15_2** | | | --- ## 🛠 Data Processing & Transparency The model was trained on a diverse corpus of **16,604 samples** to capture the nuances of Finnish phonetics, including vowel length and gemination. * **Sources**: Mozilla Common Voice (cv-15, lisence CC0-1.0)), Filmot (CC BY), YouTube (CC BY), and Parliament data (CLARIN PUB +BY +PRIV). * **Zero-Shot Integrity**: Specific speakers (`cv-15_11`, `cv-15_16`, `cv-15_2`) were strictly excluded from training to ensure valid OOD testing. * **Traceability**: Full attribution and filtering lineage are provided in `attribution.csv`. --- ## 🔬 Phase 2 Research: Single-Speaker Fine-Tuning As a separate research phase, we tested the model's capacity for deep voice cloning by fine-tuning the Phase 1 base on a specific high-quality Finnish dataset (GrowthMindset). ### Results & Optimization We used `sweep_params.py` to identify the "Golden Settings" for the most natural Finnish inference. By evaluating against holdout samples and everyday phrases, we achieved a peak quality of **4.63 MOS**. **Best Parameters for Finnish:** * `repetition_penalty`: 1.5 (Balanced for Finnish long vowels) * `temperature`: 0.8 * `exaggeration`: 0.5 * `cfg_weight`: 0.3 ### Research Samples (Cloned Voice) * **Everyday Phrases**: [Polite Request](eval_results_stage2_lean/checkpoint-16_everyday_0.wav) | [Morning Greeting](eval_results_stage2_lean/checkpoint-16_everyday_2.wav) *Note: The single-speaker weights are not included in this repository.* --- ## 💻 Hardware & Infrastructure * **Platform**: Verda (NVIDIA A100 80GB) * **Mixed Precision**: BF16 for stability. * **Repetition Guard**: Custom threshold of **10 tokens** in `AlignmentStreamAnalyzer` to support Finnish phonology. --- ## 🚀 Quick Start ### Option A — Dev Container (recommended) Open this repo in VS Code with the [Dev Containers](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers) extension. Everything — dependencies, base model weights, GPU detection — is handled automatically by `postCreateCommand`. ### Option B — Manual Setup ```bash # 1. Clone (with LFS for model weights) git clone https://huggingface.co/Finnish-NLP/Chatterbox-Finnish cd Chatterbox-Finnish # 2. Install dependencies (auto-detects your GPU architecture) bash install_dependencies.sh # 3. Download pretrained base models from ResembleAI python setup.py # 4. Run inference python inference_example.py ``` > **GPU compatibility:** The install script detects your GPU and picks the right PyTorch build automatically: > - **Blackwell (sm_120+)** e.g. RTX PRO 6000 → PyTorch 2.10.0 + CUDA 12.8 > - **Older GPUs (A100, RTX 30/40xx, etc.)** → PyTorch 2.5.1 + CUDA 12.4 --- ## 🏃 Running Inference ```python import torch import soundfile as sf from src.chatterbox_.tts import ChatterboxTTS from safetensors.torch import load_file device = "cuda" if torch.cuda.is_available() else "cpu" # 1. Load the base engine engine = ChatterboxTTS.from_local("./pretrained_models", device=device) # 2. Inject Finnish fine-tuned weights checkpoint = load_file("./models/best_finnish_multilingual_cp986.safetensors") t3_state = {k[3:] if k.startswith("t3.") else k: v for k, v in checkpoint.items()} engine.t3.load_state_dict(t3_state, strict=False) # 3. Generate with Finnish-optimized parameters wav = engine.generate( text="Tervetuloa kokeilemaan hienoviritettyä suomenkielistä Chatterbox-puhesynteesiä.", audio_prompt_path="./samples/reference_finnish.wav", repetition_penalty=1.2, temperature=0.8, exaggeration=0.6, ) sf.write("output.wav", wav.squeeze().cpu().numpy(), engine.sr) ``` Or just run the included example script directly: ```bash python inference_example.py # outputs output_finnish.wav ``` --- ## 🙏 Acknowledgments & Credits - **Exploration Foundation**: Initial fine-tuning exploration was based on the [chatterbox-finetuning](https://github.com/gokhaneraslan/chatterbox-finetuning) toolkit by gokhaneraslan. - **Model Authors**: Deep thanks to the team at **ResembleAI** for the [Chatterbox](https://huggingface.co/ResembleAI/chatterbox) model. - **Single speaker finetuning**: Huge thanks to Mape for letting me fine-tune using audio from the Growth Mindset Builder YouTube channel. (https://www.youtube.com/@Growthmindsetbuilder) - **Data Sourcing**: Thanks to **#Jobik** at **Nordic AI** Discord for the dataset insights. ## Disclaimer - **Don't use this model to do bad things.**