Whisper Small - German-Accented English IPA Transcription

This model is a fine-tuned version of openai/whisper-small trained to transcribe German-accented English speech into IPA (International Phonetic Alphabet) notation instead of standard English orthography.

Model Description

  • Base Model: openai/whisper-small (244M parameters)
  • Training Data: 516 manually segmented audio samples from 175 German speakers
  • Task: Automatic Speech Recognition (ASR) with IPA output
  • Language: English (with German accent)
  • Output Format: IPA phonetic transcription

Intended Use

This model is designed for:

  • Phonetic analysis of German-accented English speech
  • Linguistic research on L2 English pronunciation
  • Educational tools for pronunciation training
  • Speech pathology applications

Training Data

The model was fine-tuned on canpolatbulbul/german-accent-ipa-manual-v1, which contains:

  • 516 audio segments (<30 seconds each)
  • 175 unique German speakers
  • ~8 hours of speech total
  • Manually verified IPA transcriptions

Training Procedure

Training Hyperparameters

  • Learning Rate: 5e-5
  • Training Steps: 2000
  • Batch Size: 8 per device
  • Gradient Accumulation: 2 (effective batch size: 16)
  • Warmup Steps: 100
  • Optimizer: AdamW
  • Mixed Precision: FP16

Training Results

Metric Value
Training Steps 2000
Final Training Loss 0.027
Final Validation Loss 0.113
Best Validation Loss 0.093 (step 100)
Final WER 49.72%
Phoneme Error Rate (PER) 8.26%
Training Time ~2.3 hours
Epochs 133

Note: WER is not always the best metric for IPA transcription quality. This model produces higher-quality IPA output (better vowel accuracy, stress patterns, and formatting) compared to models with lower WER but less accurate phonetic details.

Usage

⚠️ CRITICAL: You MUST disable language/task forcing to get IPA output instead of English text!

Basic Usage

import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

# Load model and processor
model_id = "canpolatbulbul/whisper-small-ipa"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)

# CRITICAL: Disable forced English output
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []
if model.generation_config is not None:
    model.generation_config.forced_decoder_ids = None
    model.generation_config.suppress_tokens = []

# Load audio
audio, sr = librosa.load("path/to/audio.wav", sr=16000)

# Process audio
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features

# Generate IPA transcription
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
input_features = input_features.to(device)

# Use only start-of-transcript token (no language forcing)
decoder_input_ids = torch.tensor([[50258]]).to(device)

with torch.no_grad():
    predicted_ids = model.generate(
        input_features,
        decoder_input_ids=decoder_input_ids,
        forced_decoder_ids=None,
        suppress_tokens=[]
    )

# Decode to IPA
ipa_transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(ipa_transcription)

Example Output

Input Audio: German speakers reading:

Please call Stella.  
Ask her to bring these things with her from the store:  
Six spoons of fresh snow peas, five thick slabs of blue cheese, 
and maybe a snack for her brother Bob.  We also need a small 
plastic snake and a big toy frog for the kids.  She can scoop 
these things into three red bags, and we will go meet her 
Wednesday at the train station.

Output 1:

ɪs ˈkɑːd ə ˈbrɪŋ ðɪs ˈθɪŋs wɪθ ˈhɜː frəm ðə ˈsɪks ˈbʊnt ə ˈfrɛʃ 
ˈsɜːpɪs ˈfɪft ˈzɪks ˈsɛb əf ˈbɪlʃɪz ənd ˈmeɪvɪ ə ˈsɪs ˈneɪk frəm 
ˈbrɛðə ˈbɑːp || wi ˈəʊz ə ˈnɪd ə ˈsɪs ˈməʊ ˈplɛstɪk ə
nd ə bɪkt tɔɪ ˈfrɒɡ frə ðə ˈkɪds || wi ˈkən ˈskʊp ðɪs ˈθɪŋs əntə 
ˈθɪŋ | ˈred ˈbæks | ənd vi ˈvɪtʊk ˈmetɔə ˈwɪntzdeɪ | ənd ðə ˈtrænˈstɪtɪŋ

Output 2:

/ə ˈliːst ˈkaːl ˈdʒʌlə | əsk eɪ də ˈbrɪŋ ðɪs ˈθɪŋz wɪθ ˈhɜː frəm ðə ˈstɔː | 
ˈsɪks ˈbɔːndz əv ˈfrɛʊst ˈnəʊˈpɪs | ˈfaɪf ˈdɪks ˈlæps əv ˈbluːt ˈtiːs | 
ənd ˈmeɪbɪ ə ˈsɪ ˈneɪk frə ˈhɒː ˈbrɛðə ˈbɒp || wi ˈəʊtə ˈnɪd ə ˈsɪ ˈmɒə 
ˈplæstɪk ˈsɪk ənd ə ˈbɪktɔɪ ˈfrɒɡ və ðə ˈkɪds || ʃi ˈkɪnd ˈskʊp dɪs ˈθɪŋz 
ɪnd də ˈθɪŋ ˈred ˈbæks | ənd wiːl ˈɡəʊp ˈmeɪd ˈhɜː ˈrɛntseɪ ənd ˈtrænˈstɪt̬əl

Limitations

1. Maximum Output Length

Whisper has an architectural limit of 448 tokens. For very long IPA transcriptions (>30 seconds of dense speech), the output may be truncated. This is a Whisper limitation, not a model training issue.

2. IPA Convention

The model outputs IPA in the style of the training data (broad phonemic transcription). Different IPA conventions exist, and this model follows the specific notation used in the training dataset.

3. Accent Specificity

The model is trained on German-accented English. Performance on other accents may vary.

4. Audio Quality

Best results are achieved with:

  • Clear speech
  • Minimal background noise
  • 16kHz sample rate
  • Audio segments <30 seconds

Evaluation

The model was evaluated on a held-out validation set of 52 samples (10% of the dataset) using Phoneme Error Rate (PER), the standard metric for phonetic transcription evaluation.

Validation Set Performance

Metric Value
Phoneme Error Rate (PER) 8.26%
Median PER 6.34%
Standard Deviation 7.48%
Min PER 0.49%
Max PER 29.38%

Metrics Explained

Phoneme Error Rate (PER) measures the edit distance (insertions, deletions, substitutions) between predicted and reference IPA transcriptions at the phoneme level. It is more appropriate for evaluating phonetic transcription than Word Error Rate (WER), which operates at the word level.

  • <10% PER: Excellent phoneme-level accuracy ✅ (this model)
  • <15% PER: Good performance
  • <20% PER: Acceptable performance

The model achieves 8.26% mean PER on unseen validation data, indicating excellent generalization to new speakers with German-accented English.

Training vs Validation Performance

  • Training Set PER: 3.22% (expected to be lower due to model familiarity)
  • Validation Set PER: 8.26% (true measure of generalization)
  • Generalization Gap: 5.04 percentage points (healthy, indicates minimal overfitting)

Citation

If you use this model, please cite:

@misc{whisper-small-ipa,
  author = {Can Polat Bulbul},
  title = {Whisper Small - German-Accented English IPA Transcription},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/canpolatbulbul/whisper-small-ipa}}
}

Acknowledgments

License

This model inherits the license from the base Whisper model (Apache 2.0).

Downloads last month
2
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for canpolatbulbul/whisper-small-ipa-capt

Finetuned
(3310)
this model

Evaluation results