---
language:
- fi
license: mit
tags:
- text-to-speech
- tts
- zero-shot
- voice-cloning
- finnish
datasets:
- mozilla-foundation/common_voice_15_0
base_model: ResembleAI/chatterbox
pipeline_tag: text-to-speech
library_name: pytorch
model-index:
- name: Chatterbox Finnish Fine-Tuned (Step 986)
results:
- task:
type: text-to-speech
name: Text to Speech
dataset:
name: Mozilla Common Voice 15.0 (Finnish OOD)
type: mozilla-foundation/common_voice_15_0
config: fi
split: test
metrics:
- name: Word Error Rate (WER)
type: wer
value: 2.76
verified: true
- name: Mean Opinion Score (MOS)
type: mos
value: 4.34
---
# Chatterbox Finnish Fine-Tuning: High-Fidelity Zero-Shot TTS
This repository hosts a high-fidelity fine-tuned version of the Chatterbox TTS model, specifically optimized for the Finnish language. By leveraging a multilingual base and large-scale Finnish data, we achieved exceptional zero-shot generalization to unseen speakers, surpassing commercial-grade quality thresholds.
## 🚀 Performance (Zero-Shot OOD)
The following metrics were calculated on **Out-of-Distribution (OOD)** speakers who were strictly excluded from the training and validation sets. This measures how well the model can speak Finnish in voices it has never heard before.
| Metric | Baseline (Original Multilingual) | Fine-Tuned (Step 986) | Improvement |
| :--- | :---: | :---: | :---: |
| **Avg Word Error Rate (WER)** | 28.94% | **2.76%** | **~10.5x Accuracy Increase** |
| **Mean Opinion Score (MOS)** | 2.29 / 5.0 | **4.34 / 5.0** | **+2.05 Quality Points** |
*Note: MOS was evaluated using the Gemini 3 Flash API, and WER was calculated using Faster-Whisper Finnish Large v3. The 4.34 MOS indicates a "Professional Grade" output comparable to human speech.*
---
## 🎧 Audio Comparison (OOD Speakers)
Listen to the difference between the generic multilingual baseline and our high-fidelity Finnish fine-tuning. These samples are from speakers **never seen during training**.
| Speaker ID | Baseline (Generic Multilingual) | Fine-Tuned (Finnish Golden) |
| :--- | :--- | :--- |
| **cv-15_11** | | |
| **cv-15_16** | | |
| **cv-15_2** | | |
---
## 🛠 Data Processing & Transparency
The model was trained on a diverse corpus of **16,604 samples** to capture the nuances of Finnish phonetics, including vowel length and gemination.
* **Sources**: Mozilla Common Voice (cv-15, lisence CC0-1.0)), Filmot (CC BY), YouTube (CC BY), and Parliament data (CLARIN PUB +BY +PRIV).
* **Zero-Shot Integrity**: Specific speakers (`cv-15_11`, `cv-15_16`, `cv-15_2`) were strictly excluded from training to ensure valid OOD testing.
* **Traceability**: Full attribution and filtering lineage are provided in `attribution.csv`.
---
## 🔬 Phase 2 Research: Single-Speaker Fine-Tuning
As a separate research phase, we tested the model's capacity for deep voice cloning by fine-tuning the Phase 1 base on a specific high-quality Finnish dataset (GrowthMindset).
### Results & Optimization
We used `sweep_params.py` to identify the "Golden Settings" for the most natural Finnish inference. By evaluating against holdout samples and everyday phrases, we achieved a peak quality of **4.63 MOS**.
**Best Parameters for Finnish:**
* `repetition_penalty`: 1.5 (Balanced for Finnish long vowels)
* `temperature`: 0.8
* `exaggeration`: 0.5
* `cfg_weight`: 0.3
### Research Samples (Cloned Voice)
* **Everyday Phrases**: [Polite Request](eval_results_stage2_lean/checkpoint-16_everyday_0.wav) | [Morning Greeting](eval_results_stage2_lean/checkpoint-16_everyday_2.wav)
*Note: The single-speaker weights are not included in this repository.*
---
## 💻 Hardware & Infrastructure
* **Platform**: Verda (NVIDIA A100 80GB)
* **Mixed Precision**: BF16 for stability.
* **Repetition Guard**: Custom threshold of **10 tokens** in `AlignmentStreamAnalyzer` to support Finnish phonology.
---
## 🚀 Quick Start
### Option A — Dev Container (recommended)
Open this repo in VS Code with the [Dev Containers](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers) extension. Everything — dependencies, base model weights, GPU detection — is handled automatically by `postCreateCommand`.
### Option B — Manual Setup
```bash
# 1. Clone (with LFS for model weights)
git clone https://huggingface.co/Finnish-NLP/Chatterbox-Finnish
cd Chatterbox-Finnish
# 2. Install dependencies (auto-detects your GPU architecture)
bash install_dependencies.sh
# 3. Download pretrained base models from ResembleAI
python setup.py
# 4. Run inference
python inference_example.py
```
> **GPU compatibility:** The install script detects your GPU and picks the right PyTorch build automatically:
> - **Blackwell (sm_120+)** e.g. RTX PRO 6000 → PyTorch 2.10.0 + CUDA 12.8
> - **Older GPUs (A100, RTX 30/40xx, etc.)** → PyTorch 2.5.1 + CUDA 12.4
---
## 🏃 Running Inference
```python
import torch
import soundfile as sf
from src.chatterbox_.tts import ChatterboxTTS
from safetensors.torch import load_file
device = "cuda" if torch.cuda.is_available() else "cpu"
# 1. Load the base engine
engine = ChatterboxTTS.from_local("./pretrained_models", device=device)
# 2. Inject Finnish fine-tuned weights
checkpoint = load_file("./models/best_finnish_multilingual_cp986.safetensors")
t3_state = {k[3:] if k.startswith("t3.") else k: v for k, v in checkpoint.items()}
engine.t3.load_state_dict(t3_state, strict=False)
# 3. Generate with Finnish-optimized parameters
wav = engine.generate(
text="Tervetuloa kokeilemaan hienoviritettyä suomenkielistä Chatterbox-puhesynteesiä.",
audio_prompt_path="./samples/reference_finnish.wav",
repetition_penalty=1.2,
temperature=0.8,
exaggeration=0.6,
)
sf.write("output.wav", wav.squeeze().cpu().numpy(), engine.sr)
```
Or just run the included example script directly:
```bash
python inference_example.py # outputs output_finnish.wav
```
---
## 🙏 Acknowledgments & Credits
- **Exploration Foundation**: Initial fine-tuning exploration was based on the [chatterbox-finetuning](https://github.com/gokhaneraslan/chatterbox-finetuning) toolkit by gokhaneraslan.
- **Model Authors**: Deep thanks to the team at **ResembleAI** for the [Chatterbox](https://huggingface.co/ResembleAI/chatterbox) model.
- **Single speaker finetuning**: Huge thanks to Mape for letting me fine-tune using audio from the Growth Mindset Builder YouTube channel. (https://www.youtube.com/@Growthmindsetbuilder)
- **Data Sourcing**: Thanks to **#Jobik** at **Nordic AI** Discord for the dataset insights.
## Disclaimer
- **Don't use this model to do bad things.**