Whisper Small - German-Accented English IPA Transcription
This model is a fine-tuned version of openai/whisper-small trained to transcribe German-accented English speech into IPA (International Phonetic Alphabet) notation instead of standard English orthography.
Model Description
- Base Model: openai/whisper-small (244M parameters)
- Training Data: 516 manually segmented audio samples from 175 German speakers
- Task: Automatic Speech Recognition (ASR) with IPA output
- Language: English (with German accent)
- Output Format: IPA phonetic transcription
Intended Use
This model is designed for:
- Phonetic analysis of German-accented English speech
- Linguistic research on L2 English pronunciation
- Educational tools for pronunciation training
- Speech pathology applications
Training Data
The model was fine-tuned on canpolatbulbul/german-accent-ipa-manual-v1, which contains:
- 516 audio segments (<30 seconds each)
- 175 unique German speakers
- ~8 hours of speech total
- Manually verified IPA transcriptions
Training Procedure
Training Hyperparameters
- Learning Rate: 5e-5
- Training Steps: 2000
- Batch Size: 8 per device
- Gradient Accumulation: 2 (effective batch size: 16)
- Warmup Steps: 100
- Optimizer: AdamW
- Mixed Precision: FP16
Training Results
| Metric | Value |
|---|---|
| Training Steps | 2000 |
| Final Training Loss | 0.027 |
| Final Validation Loss | 0.113 |
| Best Validation Loss | 0.093 (step 100) |
| Final WER | 49.72% |
| Phoneme Error Rate (PER) | 8.26% |
| Training Time | ~2.3 hours |
| Epochs | 133 |
Note: WER is not always the best metric for IPA transcription quality. This model produces higher-quality IPA output (better vowel accuracy, stress patterns, and formatting) compared to models with lower WER but less accurate phonetic details.
Usage
⚠️ CRITICAL: You MUST disable language/task forcing to get IPA output instead of English text!
Basic Usage
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
# Load model and processor
model_id = "canpolatbulbul/whisper-small-ipa"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)
# CRITICAL: Disable forced English output
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []
if model.generation_config is not None:
model.generation_config.forced_decoder_ids = None
model.generation_config.suppress_tokens = []
# Load audio
audio, sr = librosa.load("path/to/audio.wav", sr=16000)
# Process audio
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
# Generate IPA transcription
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
input_features = input_features.to(device)
# Use only start-of-transcript token (no language forcing)
decoder_input_ids = torch.tensor([[50258]]).to(device)
with torch.no_grad():
predicted_ids = model.generate(
input_features,
decoder_input_ids=decoder_input_ids,
forced_decoder_ids=None,
suppress_tokens=[]
)
# Decode to IPA
ipa_transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(ipa_transcription)
Example Output
Input Audio: German speakers reading:
Please call Stella.
Ask her to bring these things with her from the store:
Six spoons of fresh snow peas, five thick slabs of blue cheese,
and maybe a snack for her brother Bob. We also need a small
plastic snake and a big toy frog for the kids. She can scoop
these things into three red bags, and we will go meet her
Wednesday at the train station.
Output 1:
ɪs ˈkɑːd ə ˈbrɪŋ ðɪs ˈθɪŋs wɪθ ˈhɜː frəm ðə ˈsɪks ˈbʊnt ə ˈfrɛʃ
ˈsɜːpɪs ˈfɪft ˈzɪks ˈsɛb əf ˈbɪlʃɪz ənd ˈmeɪvɪ ə ˈsɪs ˈneɪk frəm
ˈbrɛðə ˈbɑːp || wi ˈəʊz ə ˈnɪd ə ˈsɪs ˈməʊ ˈplɛstɪk ə
nd ə bɪkt tɔɪ ˈfrɒɡ frə ðə ˈkɪds || wi ˈkən ˈskʊp ðɪs ˈθɪŋs əntə
ˈθɪŋ | ˈred ˈbæks | ənd vi ˈvɪtʊk ˈmetɔə ˈwɪntzdeɪ | ənd ðə ˈtrænˈstɪtɪŋ
Output 2:
/ə ˈliːst ˈkaːl ˈdʒʌlə | əsk eɪ də ˈbrɪŋ ðɪs ˈθɪŋz wɪθ ˈhɜː frəm ðə ˈstɔː |
ˈsɪks ˈbɔːndz əv ˈfrɛʊst ˈnəʊˈpɪs | ˈfaɪf ˈdɪks ˈlæps əv ˈbluːt ˈtiːs |
ənd ˈmeɪbɪ ə ˈsɪ ˈneɪk frə ˈhɒː ˈbrɛðə ˈbɒp || wi ˈəʊtə ˈnɪd ə ˈsɪ ˈmɒə
ˈplæstɪk ˈsɪk ənd ə ˈbɪktɔɪ ˈfrɒɡ və ðə ˈkɪds || ʃi ˈkɪnd ˈskʊp dɪs ˈθɪŋz
ɪnd də ˈθɪŋ ˈred ˈbæks | ənd wiːl ˈɡəʊp ˈmeɪd ˈhɜː ˈrɛntseɪ ənd ˈtrænˈstɪt̬əl
Limitations
1. Maximum Output Length
Whisper has an architectural limit of 448 tokens. For very long IPA transcriptions (>30 seconds of dense speech), the output may be truncated. This is a Whisper limitation, not a model training issue.
2. IPA Convention
The model outputs IPA in the style of the training data (broad phonemic transcription). Different IPA conventions exist, and this model follows the specific notation used in the training dataset.
3. Accent Specificity
The model is trained on German-accented English. Performance on other accents may vary.
4. Audio Quality
Best results are achieved with:
- Clear speech
- Minimal background noise
- 16kHz sample rate
- Audio segments <30 seconds
Evaluation
The model was evaluated on a held-out validation set of 52 samples (10% of the dataset) using Phoneme Error Rate (PER), the standard metric for phonetic transcription evaluation.
Validation Set Performance
| Metric | Value |
|---|---|
| Phoneme Error Rate (PER) | 8.26% |
| Median PER | 6.34% |
| Standard Deviation | 7.48% |
| Min PER | 0.49% |
| Max PER | 29.38% |
Metrics Explained
Phoneme Error Rate (PER) measures the edit distance (insertions, deletions, substitutions) between predicted and reference IPA transcriptions at the phoneme level. It is more appropriate for evaluating phonetic transcription than Word Error Rate (WER), which operates at the word level.
- <10% PER: Excellent phoneme-level accuracy ✅ (this model)
- <15% PER: Good performance
- <20% PER: Acceptable performance
The model achieves 8.26% mean PER on unseen validation data, indicating excellent generalization to new speakers with German-accented English.
Training vs Validation Performance
- Training Set PER: 3.22% (expected to be lower due to model familiarity)
- Validation Set PER: 8.26% (true measure of generalization)
- Generalization Gap: 5.04 percentage points (healthy, indicates minimal overfitting)
Citation
If you use this model, please cite:
@misc{whisper-small-ipa,
author = {Can Polat Bulbul},
title = {Whisper Small - German-Accented English IPA Transcription},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/canpolatbulbul/whisper-small-ipa}}
}
Acknowledgments
- Base model: OpenAI Whisper
- Training data derived from: canpolatbulbul/german-accent-ipa-clean
- Manual segmentation and verification by Can Polat Bulbul
License
This model inherits the license from the base Whisper model (Apache 2.0).
- Downloads last month
- 2
Model tree for canpolatbulbul/whisper-small-ipa-capt
Base model
openai/whisper-smallEvaluation results
- Phoneme Error Rate on German-Accented English IPAself-reported8.260
- Word Error Rate on German-Accented English IPAself-reported49.720
- Validation Loss on German-Accented English IPAself-reported0.113