Milo-ASR-v2

A promptable Danish speech model. Unlike a plain transcriber, you tell Milo-ASR-v2 what to do with the audio — transcribe it, translate it to English, or return structured JSON — all from a single model, chosen by the instruction you give it.

It's a fine-tune of ibm-granite/granite-speech-4.1-2b-plus on the Danish CoRal-v3 corpus.

If you only need the lowest possible Danish transcription error rate, a dedicated Whisper-style Danish ASR model will be more accurate. Milo-ASR-v2 is for when you want one model that follows instructions about the audio — and an easy path to speaker-attributed transcripts (see below).

How it works

flowchart LR
    A([Danish audio]) --> M{{Milo-ASR-v2}}
    P([Your instruction]) --> M
    M --> T([Danish transcript])
    M --> E([English translation])
    M --> J([JSON output])

The same audio yields a transcript, an English translation, or JSON — selected by the instruction.

What it can do

All outputs below are real model outputs, not hand-written:

Instruction Output
transskriber talen til dansk tekst. Mest beværgelsesværdige er områderne med faciliteter for sportsgrene indenfor de alpine discipliner samt downhill mountainbike i år
Transcribe the speech and translate it into English. Most popular is the areas with facilities for sports activities ... downhill mountain bike
Transcribe and return JSON like {"transcript": "..."}. {"transcription": "Mest beværgelsesværdige er områderne med ..."}

Quickstart

import torch, soundfile as sf, librosa
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

MODEL = "pluttodk/Milo-ASR-v2"
processor = AutoProcessor.from_pretrained(MODEL)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    MODEL, dtype=torch.bfloat16, device_map="auto").eval()

SYSTEM = ("Knowledge Cutoff Date: April 2024.\nToday's Date: December 19, 2024.\n"
          "You are Granite, developed by IBM. You are a helpful AI assistant")

def run(wav_path, instruction, max_new_tokens=200):
    audio, sr = sf.read(wav_path, dtype="float32", always_2d=False)
    if audio.ndim > 1: audio = audio.mean(axis=1)
    if sr != 16000: audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
    audio = torch.from_numpy(audio).unsqueeze(0)
    chat = [{"role": "system", "content": SYSTEM},
            {"role": "user", "content": f"<|audio|> {instruction}"}]
    text = processor.tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
    inputs = processor(text, audio, return_tensors="pt").to(model.device)
    out = model.generate(**inputs, max_new_tokens=max_new_tokens, num_beams=5,
                         repetition_penalty=1.1, no_repeat_ngram_size=4, early_stopping=True)
    return processor.tokenizer.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True)

print(run("clip.wav", "transskriber talen til dansk tekst."))                   # Danish transcript
print(run("clip.wav", "Transcribe the speech and translate it into English."))  # English translation
print(run("clip.wav", 'Transcribe and return JSON like {"transcript": "..."}.'))# JSON

Speaker diarization — "who said what"

Milo-ASR-v2's own inline speaker tags are not reliable. For dependable speaker-attributed transcripts, run a small diarization front-end (voice activity detection → speaker embeddings → clustering) and let Milo-ASR-v2 transcribe each speaker turn:

flowchart LR
    A([Audio]) --> V[Voice activity detection]
    V --> S[Speaker embeddings + clustering]
    S --> G[Speaker turns]
    G --> R{{Milo-ASR-v2 per turn}}
    R --> O(["Speaker 1: ...   Speaker 2: ..."])

A complete, self-contained example using only open tools (pip install silero-vad scikit-learn librosa soundfile):

import torch, numpy as np, librosa, soundfile as sf
from sklearn.cluster import AgglomerativeClustering
from silero_vad import load_silero_vad, get_speech_timestamps
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

SR = 16000
SYSTEM = ("Knowledge Cutoff Date: April 2024.\nToday's Date: December 19, 2024.\n"
          "You are Granite, developed by IBM. You are a helpful AI assistant")
NUM_SPEAKERS = 2          # set the known count, or None to auto-estimate

# load mono 16 kHz audio
wav, sr = sf.read("meeting.wav", dtype="float32", always_2d=False)
if wav.ndim > 1: wav = wav.mean(axis=1)
if sr != SR: wav = librosa.resample(wav, orig_sr=sr, target_sr=SR)

# 1) voice activity detection -> speech segments
vad = load_silero_vad()
segments = get_speech_timestamps(torch.from_numpy(wav), vad, sampling_rate=SR, return_seconds=True)

# 2) speaker embedding per segment (ReDimNet) -> clustering
embedder = torch.hub.load("IDRnD/ReDimNet", "ReDimNet",
                          model_name="b2", train_type="ptn", dataset="vox2", trust_repo=True).eval()
def embed(a):
    with torch.no_grad():
        e = embedder(torch.from_numpy(a).float().unsqueeze(0))[0].cpu().numpy()
    return e / (np.linalg.norm(e) + 1e-9)
emb = np.stack([embed(wav[int(s["start"]*SR):int(s["end"]*SR)]) for s in segments])
labels = AgglomerativeClustering(n_clusters=NUM_SPEAKERS, metric="cosine",
                                 linkage="average").fit_predict(emb)

# 3) transcribe each turn with Milo-ASR-v2
processor = AutoProcessor.from_pretrained("pluttodk/Milo-ASR-v2")
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "pluttodk/Milo-ASR-v2", dtype=torch.bfloat16, device_map="auto").eval()
def transcribe(a):
    chat = [{"role": "system", "content": SYSTEM},
            {"role": "user", "content": "<|audio|> transskriber talen til dansk tekst."}]
    text = processor.tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
    inp = processor(text, torch.from_numpy(a).float().unsqueeze(0), return_tensors="pt").to(model.device)
    out = model.generate(**inp, max_new_tokens=200, num_beams=5, early_stopping=True)
    return processor.tokenizer.decode(out[0, inp["input_ids"].shape[-1]:], skip_special_tokens=True)

# 4) print speaker-attributed transcript in time order
speaker_no = {}
for seg, lab in sorted(zip(segments, labels), key=lambda x: x[0]["start"]):
    n = speaker_no.setdefault(int(lab), len(speaker_no) + 1)
    turn = wav[int(seg["start"]*SR):int(seg["end"]*SR)]
    print(f"[Speaker {n}]: {transcribe(turn)}")

Performance

Danish ASR on the full CoRal-v3 test set (strict normalisation):

Split WER CER
read-aloud 18.1% 8.5%
conversation 33.2% 19.9%
weighted 25.3% 14.0%

On clean read-aloud validation it reaches ~16.8% WER / 6.9% CER. A dedicated Whisper-style Danish ASR model is more accurate on raw transcription; pick Milo-ASR-v2 for its promptable, multi-task behaviour.

Limitations

  • Higher word error rate than a dedicated Whisper Danish ASR model — see above.
  • English translation is serviceable, not publication-grade.
  • Inline speaker tags are unreliable — use the diarization recipe above for speaker-attributed output.
  • Trained on read-aloud and conversational Danish; very noisy or far-field audio is out of distribution.

How it was trained

Milo-ASR-v2 was adapted to Danish from IBM Granite-Speech in two stages: first a Danish ASR fine-tune on CoRal-v3, then a mixed-task instruction fine-tune (transcribe / translate / structured output) so it follows instructions while keeping its transcription quality.

License

Apache-2.0. Base model: IBM Granite-Speech-4.1-2B-Plus (Apache-2.0). Training data: CoRal-v3.

Downloads last month
49
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pluttodk/Milo-ASR-v2

Finetuned
(1)
this model

Dataset used to train pluttodk/Milo-ASR-v2