| | --- |
| | tags: |
| | - mms |
| | language: |
| | - ab |
| | - af |
| | - ak |
| | - am |
| | - ar |
| | - as |
| | - av |
| | - ay |
| | - az |
| | - ba |
| | - bm |
| | - be |
| | - bn |
| | - bi |
| | - bo |
| | - sh |
| | - br |
| | - bg |
| | - ca |
| | - cs |
| | - ce |
| | - cv |
| | - ku |
| | - cy |
| | - da |
| | - de |
| | - dv |
| | - dz |
| | - el |
| | - en |
| | - eo |
| | - et |
| | - eu |
| | - ee |
| | - fo |
| | - fa |
| | - fj |
| | - fi |
| | - fr |
| | - fy |
| | - ff |
| | - ga |
| | - gl |
| | - gn |
| | - gu |
| | - zh |
| | - ht |
| | - ha |
| | - he |
| | - hi |
| | - sh |
| | - hu |
| | - hy |
| | - ig |
| | - ia |
| | - ms |
| | - is |
| | - it |
| | - jv |
| | - ja |
| | - kn |
| | - ka |
| | - kk |
| | - kr |
| | - km |
| | - ki |
| | - rw |
| | - ky |
| | - ko |
| | - kv |
| | - lo |
| | - la |
| | - lv |
| | - ln |
| | - lt |
| | - lb |
| | - lg |
| | - mh |
| | - ml |
| | - mr |
| | - ms |
| | - mk |
| | - mg |
| | - mt |
| | - mn |
| | - mi |
| | - my |
| | - zh |
| | - nl |
| | - 'no' |
| | - 'no' |
| | - ne |
| | - ny |
| | - oc |
| | - om |
| | - or |
| | - os |
| | - pa |
| | - pl |
| | - pt |
| | - ms |
| | - ps |
| | - qu |
| | - qu |
| | - qu |
| | - qu |
| | - qu |
| | - qu |
| | - qu |
| | - qu |
| | - qu |
| | - qu |
| | - qu |
| | - qu |
| | - qu |
| | - qu |
| | - qu |
| | - qu |
| | - qu |
| | - qu |
| | - qu |
| | - qu |
| | - qu |
| | - qu |
| | - ro |
| | - rn |
| | - ru |
| | - sg |
| | - sk |
| | - sl |
| | - sm |
| | - sn |
| | - sd |
| | - so |
| | - es |
| | - sq |
| | - su |
| | - sv |
| | - sw |
| | - ta |
| | - tt |
| | - te |
| | - tg |
| | - tl |
| | - th |
| | - ti |
| | - ts |
| | - tr |
| | - uk |
| | - ms |
| | - vi |
| | - wo |
| | - xh |
| | - ms |
| | - yo |
| | - ms |
| | - zu |
| | - za |
| | license: cc-by-nc-4.0 |
| | datasets: |
| | - google/fleurs |
| | metrics: |
| | - wer |
| | --- |
| | |
| | # Massively Multilingual Speech (MMS) - Finetuned ASR - FL102 |
| |
|
| | This checkpoint is a model fine-tuned for multi-lingual ASR and part of Facebook's [Massive Multilingual Speech project](https://research.facebook.com/publications/scaling-speech-technology-to-1000-languages/). |
| | This checkpoint is based on the [Wav2Vec2 architecture](https://huggingface.co/docs/transformers/model_doc/wav2vec2) and makes use of adapter models to transcribe 100+ languages. |
| | The checkpoint consists of **1 billion parameters** and has been fine-tuned from [facebook/mms-1b](https://huggingface.co/facebook/mms-1b) on 102 languages of [Fleurs](https://huggingface.co/datasets/google/fleurs). |
| |
|
| | ## Table Of Content |
| |
|
| | - [Example](#example) |
| | - [Supported Languages](#supported-languages) |
| | - [Model details](#model-details) |
| | - [Additional links](#additional-links) |
| |
|
| | ## Example |
| |
|
| | This MMS checkpoint can be used with [Transformers](https://github.com/huggingface/transformers) to transcribe audio of 1107 different |
| | languages. Let's look at a simple example. |
| |
|
| | First, we install transformers and some other libraries |
| | ``` |
| | pip install torch accelerate torchaudio datasets |
| | pip install --upgrade transformers |
| | ```` |
| |
|
| | **Note**: In order to use MMS you need to have at least `transformers >= 4.30` installed. If the `4.30` version |
| | is not yet available [on PyPI](https://pypi.org/project/transformers/) make sure to install `transformers` from |
| | source: |
| | ``` |
| | pip install git+https://github.com/huggingface/transformers.git |
| | ``` |
| |
|
| | Next, we load a couple of audio samples via `datasets`. Make sure that the audio data is sampled to 16000 kHz. |
| |
|
| | ```py |
| | from datasets import load_dataset, Audio |
| | |
| | # English |
| | stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="test", streaming=True) |
| | stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000)) |
| | en_sample = next(iter(stream_data))["audio"]["array"] |
| | |
| | # French |
| | stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "fr", split="test", streaming=True) |
| | stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000)) |
| | fr_sample = next(iter(stream_data))["audio"]["array"] |
| | ``` |
| |
|
| | Next, we load the model and processor |
| |
|
| | ```py |
| | from transformers import Wav2Vec2ForCTC, AutoProcessor |
| | import torch |
| | |
| | model_id = "facebook/mms-1b-fl102" |
| | |
| | processor = AutoProcessor.from_pretrained(model_id) |
| | model = Wav2Vec2ForCTC.from_pretrained(model_id) |
| | ``` |
| |
|
| | Now we process the audio data, pass the processed audio data to the model and transcribe the model output, just like we usually do for Wav2Vec2 models such as [facebook/wav2vec2-base-960h](https://huggingface.co/facebook/wav2vec2-base-960h) |
| |
|
| | ```py |
| | inputs = processor(en_sample, sampling_rate=16_000, return_tensors="pt") |
| | |
| | with torch.no_grad(): |
| | outputs = model(**inputs).logits |
| | |
| | ids = torch.argmax(outputs, dim=-1)[0] |
| | transcription = processor.decode(ids) |
| | # 'joe keton disapproved of films and buster also had reservations about the media' |
| | ``` |
| |
|
| | We can now keep the same model in memory and simply switch out the language adapters by calling the convenient [`load_adapter()`]() function for the model and [`set_target_lang()`]() for the tokenizer. We pass the target language as an input - "fra" for French. |
| |
|
| | ```py |
| | processor.tokenizer.set_target_lang("fra") |
| | model.load_adapter("fra") |
| | |
| | inputs = processor(fr_sample, sampling_rate=16_000, return_tensors="pt") |
| | |
| | with torch.no_grad(): |
| | outputs = model(**inputs).logits |
| | |
| | ids = torch.argmax(outputs, dim=-1)[0] |
| | transcription = processor.decode(ids) |
| | # "ce dernier est volé tout au long de l'histoire romaine" |
| | ``` |
| |
|
| | In the same way the language can be switched out for all other supported languages. Please have a look at: |
| | ```py |
| | processor.tokenizer.vocab.keys() |
| | ``` |
| |
|
| | For more details, please have a look at [the official docs](https://huggingface.co/docs/transformers/main/en/model_doc/mms). |
| |
|
| | ## Supported Languages |
| |
|
| | This model supports 102 languages. Unclick the following to toogle all supported languages of this checkpoint in [ISO 639-3 code](https://en.wikipedia.org/wiki/ISO_639-3). |
| | You can find more details about the languages and their ISO 649-3 codes in the [MMS Language Coverage Overview](https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html). |
| | <details> |
| | <summary>Click to toggle</summary> |
| |
|
| | - afr |
| | - amh |
| | - ara |
| | - asm |
| | - ast |
| | - azj-script_latin |
| | - bel |
| | - ben |
| | - bos |
| | - bul |
| | - cat |
| | - ceb |
| | - ces |
| | - ckb |
| | - cmn-script_simplified |
| | - cym |
| | - dan |
| | - deu |
| | - ell |
| | - eng |
| | - est |
| | - fas |
| | - fin |
| | - fra |
| | - ful |
| | - gle |
| | - glg |
| | - guj |
| | - hau |
| | - heb |
| | - hin |
| | - hrv |
| | - hun |
| | - hye |
| | - ibo |
| | - ind |
| | - isl |
| | - ita |
| | - jav |
| | - jpn |
| | - kam |
| | - kan |
| | - kat |
| | - kaz |
| | - kea |
| | - khm |
| | - kir |
| | - kor |
| | - lao |
| | - lav |
| | - lin |
| | - lit |
| | - ltz |
| | - lug |
| | - luo |
| | - mal |
| | - mar |
| | - mkd |
| | - mlt |
| | - mon |
| | - mri |
| | - mya |
| | - nld |
| | - nob |
| | - npi |
| | - nso |
| | - nya |
| | - oci |
| | - orm |
| | - ory |
| | - pan |
| | - pol |
| | - por |
| | - pus |
| | - ron |
| | - rus |
| | - slk |
| | - slv |
| | - sna |
| | - snd |
| | - som |
| | - spa |
| | - srp-script_latin |
| | - swe |
| | - swh |
| | - tam |
| | - tel |
| | - tgk |
| | - tgl |
| | - tha |
| | - tur |
| | - ukr |
| | - umb |
| | - urd-script_arabic |
| | - uzb-script_latin |
| | - vie |
| | - wol |
| | - xho |
| | - yor |
| | - yue-script_traditional |
| | - zlm |
| | - zul |
| |
|
| | </details> |
| |
|
| | ## Model details |
| |
|
| | - **Developed by:** Vineel Pratap et al. |
| | - **Model type:** Multi-Lingual Automatic Speech Recognition model |
| | - **Language(s):** 100+ languages, see [supported languages](#supported-languages) |
| | - **License:** CC-BY-NC 4.0 license |
| | - **Num parameters**: 1 billion |
| | - **Audio sampling rate**: 16,000 kHz |
| | - **Cite as:** |
| |
|
| | @article{pratap2023mms, |
| | title={Scaling Speech Technology to 1,000+ Languages}, |
| | author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel-Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei-Ning Hsu and Alexis Conneau and Michael Auli}, |
| | journal={arXiv}, |
| | year={2023} |
| | } |
| | |
| | ## Additional Links |
| |
|
| | - [Blog post](https://ai.facebook.com/blog/multilingual-model-speech-recognition/) |
| | - [Transformers documentation](https://huggingface.co/docs/transformers/main/en/model_doc/mms). |
| | - [Paper](https://arxiv.org/abs/2305.13516) |
| | - [GitHub Repository](https://github.com/facebookresearch/fairseq/tree/main/examples/mms#asr) |
| | - [Other **MMS** checkpoints](https://huggingface.co/models?other=mms) |
| | - MMS base checkpoints: |
| | - [facebook/mms-1b](https://huggingface.co/facebook/mms-1b) |
| | - [facebook/mms-300m](https://huggingface.co/facebook/mms-300m) |
| | - [Official Space](https://huggingface.co/spaces/facebook/MMS) |
| |
|