facebook
/

seamless-m4t-large

@@ -1,12 +1,117 @@
 ---
-inference: false
-tags:
-- SeamlessM4T
 license: cc-by-nc-4.0
 library_name: fairseq2
 ---
-# SeamlessM4T Large
 SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different
 linguistic communities to communicate effortlessly through speech and text.
@@ -18,14 +123,15 @@ SeamlessM4T covers:
 -------------------
-**🌟 SeamlessM4T v2, an improved version of this version with a novel architecture, has been released [here](https://huggingface.co/facebook/seamless-m4t-v2-large).
-This new model improves over SeamlessM4T v1 in quality as well as inference speed in speech generation tasks.**
 **SeamlessM4T v2 is also supported by 🤗 Transformers, more on it [in the model card of this new version](https://huggingface.co/facebook/seamless-m4t-v2-large#transformers-usage) or directly in [🤗 Transformers docs](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t_v2).**
 -------------------
-This is the "large" variant of the unified model, which enables multiple tasks without relying on multiple separate models:
 - Speech-to-speech translation (S2ST)
 - Speech-to-text translation (S2TT)
 - Text-to-speech translation (T2ST)
@@ -33,132 +139,81 @@ This is the "large" variant of the unified model, which enables multiple tasks w
 - Automatic speech recognition (ASR)
 ## SeamlessM4T models
 | Model Name         | #params | checkpoint                                                                              | metrics                                                                              |
 | ------------------ | ------- | --------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
-| SeamlessM4T-Large  | 2.3B    | [🤗 Model card](https://huggingface.co/facebook/seamless-m4t-large) - [checkpoint](https://huggingface.co/facebook/seamless-m4t-large/resolve/main/multitask_unity_large.pt)   | [metrics](https://dl.fbaipublicfiles.com/seamlessM4T/metrics/seamlessM4T_large.zip)  |
-| SeamlessM4T-Medium | 1.2B    | [🤗 Model card](https://huggingface.co/facebook/seamless-m4t-medium) - [checkpoint](https://huggingface.co/facebook/seamless-m4t-medium/resolve/main/multitask_unity_medium.pt) | [metrics](https://dl.fbaipublicfiles.com/seamlessM4T/metrics/seamlessM4T_medium.zip) |
-We provide extensive evaluation results of SeamlessM4T-Medium and SeamlessM4T-Large in the SeamlessM4T paper (as averages) in the `metrics` files above.
 ## 🤗 Transformers Usage
-First, load the processor and a checkpoint of the model:
-```python
->>> from transformers import AutoProcessor, SeamlessM4TModel
->>> processor = AutoProcessor.from_pretrained("facebook/hf-seamless-m4t-large")
->>> model = SeamlessM4TModel.from_pretrained("facebook/hf-seamless-m4t-large")
 ```
-You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.
-Here is how to use the processor to process text and audio:
-```python
->>> # let's load an audio sample from an Arabic speech corpus
->>> from datasets import load_dataset
->>> dataset = load_dataset("arabic_speech_corpus", split="test", streaming=True)
->>> audio_sample = next(iter(dataset))["audio"]
->>> # now, process it
->>> audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt")
->>> # now, process some English test as well
->>> text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
 ```
-### Speech
-[`SeamlessM4TModel`](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel) can *seamlessly* generate text or speech with few or no changes. Let's target Russian voice translation:
-```python
->>> audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
->>> audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
 ```
-With basically the same code, I've translated English text and Arabic speech to Russian speech samples.
-### Text
-Similarly, you can generate translated text from audio files or from text with the same model. You only have to pass `generate_speech=False` to [`SeamlessM4TModel.generate`](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel.generate).
-This time, let's translate to French.
-```python
->>> # from audio
->>> output_tokens = model.generate(**audio_inputs, tgt_lang="fra", generate_speech=False)
->>> translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
->>> # from text
->>> output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
->>> translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
 ```
-## Instructions to run inference with SeamlessM4T models
-The SeamlessM4T models are currently available through the `seamless_communication` package. The `seamless_communication`
-package can be installed by following the instructions outlined here: [Installation](https://github.com/facebookresearch/seamless_communication/tree/main#installation).
-Once installed, a [`Translator`](https://github.com/facebookresearch/seamless_communication/blob/main/src/seamless_communication/models/inference/translator.py#L50)
-object can be instantiated to perform all five of the spoken langauge tasks. The `Translator` is instantiated with three arguments:
-1. **model_name_or_card**: SeamlessM4T checkpoint. Can be either `seamlessM4T_medium` for the medium model, or `seamlessM4T_large` for the large model
-2. **vocoder_name_or_card**: vocoder checkpoint (`vocoder_36langs`)
-3. **device**: Torch device
-```python
 import torch
-from seamless_communication.models.inference import Translator
 # Initialize a Translator object with a multitask model, vocoder on the GPU.
-translator = Translator("seamlessM4T_large", vocoder_name_or_card="vocoder_36langs", device=torch.device("cuda:0"))
-```
-Once instantiated, the `predict()` method can be used to run inference as many times on any of the supported tasks.
-Given an input audio with `<path_to_input_audio>` or an input text `<input_text>` in `<src_lang>`, we can translate
-into `<tgt_lang>` as follows.
-### S2ST and T2ST:
-```python
-# S2ST
-translated_text, wav, sr = translator.predict(<path_to_input_audio>, "s2st", <tgt_lang>)
-# T2ST
-translated_text, wav, sr = translator.predict(<input_text>, "t2st", <tgt_lang>, src_lang=<src_lang>)
-```
-Note that `<src_lang>` must be specified for T2ST.
-The generated units are synthesized and the output audio file is saved with:
-```python
-wav, sr = translator.synthesize_speech(<speech_units>, <tgt_lang>)
-# Save the translated audio generation.
-torchaudio.save(
-    <path_to_save_audio>,
-    wav[0].cpu(),
-    sample_rate=sr,
 )
 ```
-### S2TT, T2TT and ASR:
-```python
-# S2TT
-translated_text, _, _ = translator.predict(<path_to_input_audio>, "s2tt", <tgt_lang>)
-# ASR
-# This is equivalent to S2TT with `<tgt_lang>=<src_lang>`.
-transcribed_text, _, _ = translator.predict(<path_to_input_audio>, "asr", <src_lang>)
-# T2TT
-translated_text, _, _ = translator.predict(<input_text>, "t2tt", <tgt_lang>, src_lang=<src_lang>)
-```
-Note that `<src_lang>` must be specified for T2TT.
 ## Citation
 If you plan to use SeamlessM4T in your work or any models/datasets/artifacts published in SeamlessM4T, please cite:

 ---
 license: cc-by-nc-4.0
+language:
+- af
+- am
+- ar
+- as
+- az
+- be
+- bn
+- bs
+- bg
+- ca
+- cs
+- zh
+- cy
+- da
+- de
+- el
+- en
+- et
+- fi
+- fr
+- or
+- om
+- ga
+- gl
+- gu
+- ha
+- he
+- hi
+- hr
+- hu
+- hy
+- ig
+- id
+- is
+- it
+- jv
+- ja
+- kn
+- ka
+- kk
+- mn
+- km
+- ky
+- ko
+- lo
+- ln
+- lt
+- lb
+- lg
+- lv
+- ml
+- mr
+- mk
+- mt
+- mi
+- my
+- nl
+- nb
+- ne
+- ny
+- oc
+- pa
+- ps
+- fa
+- pl
+- pt
+- ro
+- ru
+- sk
+- sl
+- sn
+- sd
+- so
+- es
+- sr
+- sv
+- sw
+- ta
+- te
+- tg
+- tl
+- th
+- tr
+- uk
+- ur
+- uz
+- vi
+- wo
+- xh
+- yo
+- ms
+- zu
+- ary
+- arz
+- yue
+- kea
+metrics:
+- bleu
+- wer
+- chrf
+inference: False
+pipeline_tag: automatic-speech-recognition
+tags:
+  - audio-to-audio
+  - text-to-speech
+  - speech-to-text
+  - text2text-generation
+  - seamless_communication
 library_name: fairseq2
 ---
+# SeamlessM4T Large (v1)
 SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different
 linguistic communities to communicate effortlessly through speech and text.
 -------------------
+**🌟 SeamlessM4T v2, an improved version of this version with a novel architecture, has been released [here](https://huggingface.co/facebook/seamless-m4t-v2-large).**
+**This new model improves over SeamlessM4T v1 in quality as well as inference speed in speech generation tasks.**
 **SeamlessM4T v2 is also supported by 🤗 Transformers, more on it [in the model card of this new version](https://huggingface.co/facebook/seamless-m4t-v2-large#transformers-usage) or directly in [🤗 Transformers docs](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t_v2).**
 -------------------
+This is the "large-v1" variant of SeamlessM4T, which enables multiple tasks without relying on multiple separate models:
 - Speech-to-speech translation (S2ST)
 - Speech-to-text translation (S2TT)
 - Text-to-speech translation (T2ST)
 - Automatic speech recognition (ASR)
 ## SeamlessM4T models
 | Model Name         | #params | checkpoint                                                                              | metrics                                                                              |
 | ------------------ | ------- | --------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
+| [SeamlessM4T-Large v2](https://huggingface.co/facebook/seamless-m4t-v2-large)  | 2.3B    | [checkpoint](https://huggingface.co/facebook/seamless-m4t-v2-large/blob/main/seamlessM4T_v2_large.pt)   | [metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_large_v2.zip)  |
+| [SeamlessM4T-Large (v1)](https://huggingface.co/facebook/seamless-m4t-large) | 2.3B    | [checkpoint](https://huggingface.co/facebook/seamless-m4t-large/blob/main/multitask_unity_large.pt)   | [metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_large.zip)  |
+| [SeamlessM4T-Medium (v1)](https://huggingface.co/facebook/seamless-m4t-medium) | 1.2B    | [checkpoint](https://huggingface.co/facebook/seamless-m4t-medium/blob/main/multitask_unity_medium.pt) | [metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_medium.zip) |
+We provide extensive evaluation results of SeamlessM4T models in the [SeamlessM4T](https://arxiv.org/abs/2308.11596) and [Seamless](https://arxiv.org/abs/2312.05187) papers (as averages) in the `metrics` files above.
 ## 🤗 Transformers Usage
+ First, load the processor and a checkpoint of the model:
+ ```python
+import torchaudio
+from transformers import AutoProcessor, SeamlessM4TModel
+processor = AutoProcessor.from_pretrained("facebook/hf-seamless-m4t-large")
+model = SeamlessM4TModel.from_pretrained("facebook/hf-seamless-m4t-large")
 ```
+ You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.
+ Here is how to use the processor to process text and audio:
+ ```python
+# Read an audio file and resample to 16kHz:
+audio, orig_freq =  torchaudio.load("https://www2.cs.uic.edu/~i101/SoundFiles/preamble10.wav")
+audio =  torchaudio.functional.resample(audio, orig_freq=orig_freq, new_freq=16_000) # must be a 16 kHz waveform array
+audio_inputs = processor(audios=audio, return_tensors="pt")
+# Process some input text as well:
+text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
 ```
+ ### Speech
+Generate speech in Russian from either text (T2ST) or speech input (S2ST):
+ ```python
+audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
+audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
 ```
+ ### Text
+ Similarly, you can generate translated text from audio files (S2TT) or from text (T2TT, conventionally MT) with the same model.
+ You only have to pass `generate_speech=False` to [`SeamlessM4TModel.generate`](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel.generate).
+ ```python
+# from audio
+output_tokens = model.generate(**audio_inputs, tgt_lang="fra", generate_speech=False)
+translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
+# from text
+output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
+translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
 ```
+## Seamless_communication
+You can also use the seamlessM4T models using the [`seamless_communication` library](https://github.com/facebookresearch/seamless_communication/blob/main/docs/m4t/README.md)
+with either CLI:
+```bash
+m4t_predict <path_to_input_audio> --task s2st --tgt_lang <tgt_lang> --output_path <path_to_save_audio> --model_name seamlessM4T_large
+```
+or a `Translator` API:
+```py
 import torch
+from seamless_communication.inference import Translator
 # Initialize a Translator object with a multitask model, vocoder on the GPU.
+translator = Translator("seamlessM4T_large", "vocoder_36langs", torch.device("cuda:0"), torch.float16)
+text_output, speech_output = translator.predict(
+    input=<path_to_input_audio>,
+    task_str="S2ST",
+    tgt_lang=<tgt_lang>,
+    text_generation_opts=text_generation_opts,
+    unit_generation_opts=unit_generation_opts
 )
 ```
 ## Citation
 If you plan to use SeamlessM4T in your work or any models/datasets/artifacts published in SeamlessM4T, please cite: