首页AI 工具
Massively Multilingual Speech

Massively Multilingual Speech

entry-slick
entry-slick
entry-slick
Massively Multilingual Speech简介

MMS:将语音技术扩展到 1000 多种语言

Massively Multilingual Speech (MMS) 项目通过构建支持 1,100 多种语言(是以前的 10 倍以上)的单一多语言语音识别模型,将语音技术从大约 100 种语言扩展到 1,000 多种语言,语言识别模型能够识别 4,000 多种语言(比以前多 40 倍),支持 1,400 多种语言的预训练模型,以及支持 1,100 多种语言的文本到语音模型。我们的目标是让人们更容易以他们的首选语言访问信息和使用设备。

您可以在论文 Scaling Speech Technology to 1000+ languages博文 中找到详细信息。

可以 在此处 找到 MMS 涵盖的语言的概述。

预训练模型

| Model | Link | | -------- | ------------------------------------------------------------ | | MMS-300M | download | | MMS-1B | download |

Example commands to finetune the pretrained models can be found here.

Finetuned models

ASR

| Model | Languages | Dataset | Model | Supported languages | | ------------ | --------- | -------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | | MMS-1B:FL102 | 102 | FLEURS | download | download | | MMS-1B:L1107 | 1107 | MMS-lab | download | download | | MMS-1B-all | 1162 | MMS-lab + FLEURS + CV + VP + MLS | download | download |

TTS

  1. Download the list of iso codes of 1107 languages.
  2. Find the iso code of the target language and download the checkpoint. Each folder contains 3 files: G_100000.pth, config.json, vocab.txt. The G_100000.pth is the generator trained for 100K updates, config.json is the training config, vocab.txt is the vocabulary for the TTS model.

```

Examples:

wget https://dl.fbaipublicfiles.com/mms/tts/eng.tar.gz # English (eng) wget https://dl.fbaipublicfiles.com/mms/tts/azj-script_latin.tar.gz # North Azerbaijani (azj-script_latin) ```

LID

| # Languages | Dataset | Model | Dictionary | Supported languages | | ----------- | ----------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | | 126 | FLEURS + VL + MMS-lab-U + MMS-unlab | download | download | download | | 256 | FLEURS + VL + MMS-lab-U + MMS-unlab | download | download | download | | 512 | FLEURS + VL + MMS-lab-U + MMS-unlab | download | download | download | | 1024 | FLEURS + VL + MMS-lab-U + MMS-unlab | download | download | download | | 2048 | FLEURS + VL + MMS-lab-U + MMS-unlab | download | download | download | | 4017 | FLEURS + VL + MMS-lab-U + MMS-unlab | download | download | download |

Commands to run inference

ASR

Run this command to transcribe one or more audio files:

cd /path/to/fairseq-py/ python examples/mms/asr/infer/mms_infer.py --model "/path/to/asr/model" --lang lang_code --audio "/path/to/audio_1.wav" "/path/to/audio_1.wav"

For more advance configuration and calculate CER/WER, you could prepare manifest folder by creating a folder with this format:

``` $ ls /path/to/manifest dev.tsv dev.wrd dev.ltr dev.uid

dev.tsv each line contains

$ cat dev.tsv / /path/to/audio_1 180000 /path/to/audio_2 200000

$ cat dev.ltr t h i s | i s | o n e | t h i s | i s | t w o |

$ cat dev.wrd this is one this is two

$ cat dev.uid audio_1 audio_2 ```

Followed by command below:

``` lang_code=

PYTHONPATH=. PREFIX=INFER HYDRA_FULL_ERROR=1 python examples/speech_recognition/new/infer.py -m --config-dir examples/mms/config/ --config-name infer_common decoding.type=viterbi dataset.max_tokens=4000000 distributed_training.distributed_world_size=1 "common_eval.path='/path/to/asr/model'" task.data='/path/to/manifest' dataset.gen_subset="${lang_code}:dev" common_eval.post_process=letter ```

Available options:

  • To get the raw character-based output, user can change to common_eval.post_process=none

  • To maximize GPU efficiency or avoid out-of-memory (OOM), user can tune dataset.max_tokens=??? size

  • To run language model decoding, install flashlight python bindings using

    git clone --recursive git@github.com:flashlight/flashlight.git cd flashlight; git checkout 035ead6efefb82b47c8c2e643603e87d38850076 cd bindings/python python3 setup.py install

    Train a KenLM language model and prepare a lexicon file in this format.

    LANG=<iso> # for example - 'eng', 'azj-script_latin' PYTHONPATH=. PREFIX=INFER HYDRA_FULL_ERROR=1 python examples/speech_recognition/new/infer.py --config-dir=examples/mms/asr/config \ --config-name=infer_common decoding.type=kenlm distributed_training.distributed_world_size=1 \ decoding.unique_wer_file=true decoding.beam=500 decoding.beamsizetoken=50 \ task.data=<MANIFEST_FOLDER_PATH> common_eval.path='<MODEL_PATH.pt>' decoding.lexicon=<LEXICON_FILE> decoding.lmpath=<LM_FILE> \
    decoding.results_path=<OUTPUT_DIR> dataset.gen_subset=${LANG}:dev decoding.lmweight=??? decoding.wordscore=???

    We typically sweep lmweight in the range of 0 to 5 and wordscore in the range of -3 to 3. The output directory will contain the reference and hypothesis outputs from decoder.

    For decoding with character-based language models, use empty lexicon file (decoding.lexicon=), decoding.unitlm=True and sweep over decoding.silweight instead of wordscore.

TTS

Note: clone and install VITS before running inference.

```

English TTS

$ PYTHONPATH=$PYTHONPATH:/path/to/vits python examples/mms/tts/infer.py --model-dir /path/to/model/eng \ --wav ./example.wav --txt "Expanding the language coverage of speech technology \ has the potential to improve access to information for many more people"

Maithili TTS

$ PYTHONPATH=$PYTHONPATH:/path/to/vits python examples/mms/tts/infer.py --model-dir /path/to/model/mai \ --wav ./example.wav --txt "मुदा आइ धरि ई तकनीक सौ सं किछु बेसी भाषा तक सीमित छल जे सात हजार \ सं बेसी ज्ञात भाषाक एकटा अंश अछी" ```

example.wav contains synthesized audio for the language.

LID

Prepare two files in this format

```

/path/to/manifest.tsv

/ /path/to/audio1.wav /path/to/audio2.wav /path/to/audio3.wav

/path/to/manifest.lang

eng 1 eng 1 eng 1 ```

Download model and the corresponding dictionary file for the LID model. Use the following command to run inference -

$ PYTHONPATH='.' python3 examples/mms/lid/infer.py /path/to/dict/l126/ --path /path/to/models/mms1b_l126.pt \ --task audio_classification --infer-manifest /path/to/manifest.tsv --output-path <OUTDIR>

The above command assumes there is a file named dict.lang.txt in /path/to/dict/l126/. <OUTDIR>/predictions.txt will contain the predictions from the model for the audio files in manifest.tsv.

Forced Alignment Tooling

We also developed an efficient forced alignment algorithm implemented on GPU which is able to process very long audio files. This algorithm is open sourced and we provide instructions on how to use it here. We also open source a multilingual alignment model trained on 31K hours of data in 1,130 languages, as well as text normalization scripts.

官网

https://github.com/facebookresearch/fairseq/tree/main/examples/mms