legacy-datasets/common_voice
Updated • 1.4k • 144
How to use medkit/simsamu-diarization with pyannote.audio:
from pyannote.audio import Model, Inference
model = Model.from_pretrained("medkit/simsamu-diarization")
inference = Inference(model)
# inference on the whole file
inference("file.wav")
# inference on an excerpt
from pyannote.core import Segment
excerpt = Segment(start=2.0, end=5.0)
inference.crop("file.wav", excerpt)This repository contains a pretrained pyannote-audio diarization pipeline that was fine-tuned on the Simsamu dataset.
The pipeline uses a fine-tuned segmentation model based on https://huggingface.co/pyannote/segmentation-3.0 and pretrained embeddings from https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM. The pipeline hyperparameters were optimized.
The pipeline can be used in medkit the following way:
from medkit.core.audio import AudioDocument
from medkit.audio.segmentation.pa_speaker_detector import PASpeakerDetector
# init speaker detector operation
speaker_detector = PASpeakerDetector(
model="medkit/simsamu-diarization",
device=0,
segmentation_batch_size=10,
embedding_batch_size=10,
)
# create audio document
audio_doc = AudioDocument.from_file("path/to/audio.wav")
# apply operation on audio document
speech_segments = speaker_detector.run([audio_doc.raw_segment])
# display each speech turn and corresponding speaker
for speech_seg in speech_segments:
speaker_attr = speech_seg.attrs.get(label="speaker")[0]
print(speech_seg.span.start, speech_seg.span.end, speaker_attr.value)
More info at https://medkit.readthedocs.io/
See also: Simsamu transcription model