Whisper Small - German-Accented English IPA Transcription

This model is a fine-tuned version of openai/whisper-small trained to transcribe German-accented English speech into IPA (International Phonetic Alphabet) notation instead of standard English orthography.

Model Description

Base Model: openai/whisper-small (244M parameters)
Training Data: 516 manually segmented audio samples from 175 German speakers
Task: Automatic Speech Recognition (ASR) with IPA output
Language: English (with German accent)
Output Format: IPA phonetic transcription

Intended Use

This model is designed for:

Phonetic analysis of German-accented English speech
Linguistic research on L2 English pronunciation
Educational tools for pronunciation training
Speech pathology applications

Training Data

The model was fine-tuned on canpolatbulbul/german-accent-ipa-manual-v1, which contains:

516 audio segments (<30 seconds each)
175 unique German speakers
~8 hours of speech total
Manually verified IPA transcriptions

Training Procedure

Training Hyperparameters

Learning Rate: 5e-5
Training Steps: 2000
Batch Size: 8 per device
Gradient Accumulation: 2 (effective batch size: 16)
Warmup Steps: 100
Optimizer: AdamW
Mixed Precision: FP16

Training Results

Metric	Value
Training Steps	2000
Final Training Loss	0.027
Final Validation Loss	0.113
Best Validation Loss	0.093 (step 100)
Final WER	49.72%
Phoneme Error Rate (PER)	8.26%
Training Time	~2.3 hours
Epochs	133

Note: WER is not always the best metric for IPA transcription quality. This model produces higher-quality IPA output (better vowel accuracy, stress patterns, and formatting) compared to models with lower WER but less accurate phonetic details.

Usage

⚠️ CRITICAL: You MUST disable language/task forcing to get IPA output instead of English text!

Basic Usage

import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

# Load model and processor
model_id = "canpolatbulbul/whisper-small-ipa"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)

# CRITICAL: Disable forced English output
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []
if model.generation_config is not None:
    model.generation_config.forced_decoder_ids = None
    model.generation_config.suppress_tokens = []

# Load audio
audio, sr = librosa.load("path/to/audio.wav", sr=16000)

# Process audio
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features

# Generate IPA transcription
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
input_features = input_features.to(device)

# Use only start-of-transcript token (no language forcing)
decoder_input_ids = torch.tensor([[50258]]).to(device)

with torch.no_grad():
    predicted_ids = model.generate(
        input_features,
        decoder_input_ids=decoder_input_ids,
        forced_decoder_ids=None,
        suppress_tokens=[]
    )

# Decode to IPA
ipa_transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(ipa_transcription)

Example Output

Input Audio: German speakers reading:

Please call Stella.  
Ask her to bring these things with her from the store:  
Six spoons of fresh snow peas, five thick slabs of blue cheese, 
and maybe a snack for her brother Bob.  We also need a small 
plastic snake and a big toy frog for the kids.  She can scoop 
these things into three red bags, and we will go meet her 
Wednesday at the train station.

Output 1:

ɪs ˈkɑːd ə ˈbrɪŋ ðɪs ˈθɪŋs wɪθ ˈhɜː frəm ðə ˈsɪks ˈbʊnt ə ˈfrɛʃ 
ˈsɜːpɪs ˈfɪft ˈzɪks ˈsɛb əf ˈbɪlʃɪz ənd ˈmeɪvɪ ə ˈsɪs ˈneɪk frəm 
ˈbrɛðə ˈbɑːp || wi ˈəʊz ə ˈnɪd ə ˈsɪs ˈməʊ ˈplɛstɪk ə
nd ə bɪkt tɔɪ ˈfrɒɡ frə ðə ˈkɪds || wi ˈkən ˈskʊp ðɪs ˈθɪŋs əntə 
ˈθɪŋ | ˈred ˈbæks | ənd vi ˈvɪtʊk ˈmetɔə ˈwɪntzdeɪ | ənd ðə ˈtrænˈstɪtɪŋ

Output 2:

/ə ˈliːst ˈkaːl ˈdʒʌlə | əsk eɪ də ˈbrɪŋ ðɪs ˈθɪŋz wɪθ ˈhɜː frəm ðə ˈstɔː | 
ˈsɪks ˈbɔːndz əv ˈfrɛʊst ˈnəʊˈpɪs | ˈfaɪf ˈdɪks ˈlæps əv ˈbluːt ˈtiːs | 
ənd ˈmeɪbɪ ə ˈsɪ ˈneɪk frə ˈhɒː ˈbrɛðə ˈbɒp || wi ˈəʊtə ˈnɪd ə ˈsɪ ˈmɒə 
ˈplæstɪk ˈsɪk ənd ə ˈbɪktɔɪ ˈfrɒɡ və ðə ˈkɪds || ʃi ˈkɪnd ˈskʊp dɪs ˈθɪŋz 
ɪnd də ˈθɪŋ ˈred ˈbæks | ənd wiːl ˈɡəʊp ˈmeɪd ˈhɜː ˈrɛntseɪ ənd ˈtrænˈstɪt̬əl

Limitations

1. Maximum Output Length

Whisper has an architectural limit of 448 tokens. For very long IPA transcriptions (>30 seconds of dense speech), the output may be truncated. This is a Whisper limitation, not a model training issue.

2. IPA Convention

The model outputs IPA in the style of the training data (broad phonemic transcription). Different IPA conventions exist, and this model follows the specific notation used in the training dataset.

3. Accent Specificity

The model is trained on German-accented English. Performance on other accents may vary.

4. Audio Quality

Best results are achieved with:

Clear speech
Minimal background noise
16kHz sample rate
Audio segments <30 seconds

Evaluation

The model was evaluated on a held-out validation set of 52 samples (10% of the dataset) using Phoneme Error Rate (PER), the standard metric for phonetic transcription evaluation.

Validation Set Performance

Metric	Value
Phoneme Error Rate (PER)	8.26%
Median PER	6.34%
Standard Deviation	7.48%
Min PER	0.49%
Max PER	29.38%

Metrics Explained

Phoneme Error Rate (PER) measures the edit distance (insertions, deletions, substitutions) between predicted and reference IPA transcriptions at the phoneme level. It is more appropriate for evaluating phonetic transcription than Word Error Rate (WER), which operates at the word level.

<10% PER: Excellent phoneme-level accuracy ✅ (this model)
<15% PER: Good performance
<20% PER: Acceptable performance

The model achieves 8.26% mean PER on unseen validation data, indicating excellent generalization to new speakers with German-accented English.

Training vs Validation Performance

Training Set PER: 3.22% (expected to be lower due to model familiarity)
Validation Set PER: 8.26% (true measure of generalization)
Generalization Gap: 5.04 percentage points (healthy, indicates minimal overfitting)

Citation

If you use this model, please cite:

@misc{whisper-small-ipa,
  author = {Can Polat Bulbul},
  title = {Whisper Small - German-Accented English IPA Transcription},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/canpolatbulbul/whisper-small-ipa}}
}

Acknowledgments

Base model: OpenAI Whisper
Training data derived from: canpolatbulbul/german-accent-ipa-clean
Manual segmentation and verification by Can Polat Bulbul

License

This model inherits the license from the base Whisper model (Apache 2.0).

Downloads last month: 2

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for canpolatbulbul/whisper-small-ipa-capt

Base model

openai/whisper-small

Finetuned

(3310)

this model

Evaluation results

Phoneme Error Rate on German-Accented English IPA
self-reported

8.260
Word Error Rate on German-Accented English IPA
self-reported

49.720
Validation Loss on German-Accented English IPA
self-reported

0.113