Files changed (1) hide show
  1. README.md +164 -109
README.md CHANGED
@@ -1,12 +1,117 @@
1
  ---
2
- inference: false
3
- tags:
4
- - SeamlessM4T
5
  license: cc-by-nc-4.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  library_name: fairseq2
7
  ---
8
-
9
- # SeamlessM4T Large
10
 
11
  SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different
12
  linguistic communities to communicate effortlessly through speech and text.
@@ -18,14 +123,15 @@ SeamlessM4T covers:
18
 
19
  -------------------
20
 
21
- **🌟 SeamlessM4T v2, an improved version of this version with a novel architecture, has been released [here](https://huggingface.co/facebook/seamless-m4t-v2-large).
22
- This new model improves over SeamlessM4T v1 in quality as well as inference speed in speech generation tasks.**
 
23
 
24
  **SeamlessM4T v2 is also supported by 🤗 Transformers, more on it [in the model card of this new version](https://huggingface.co/facebook/seamless-m4t-v2-large#transformers-usage) or directly in [🤗 Transformers docs](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t_v2).**
25
 
26
  -------------------
27
 
28
- This is the "large" variant of the unified model, which enables multiple tasks without relying on multiple separate models:
29
  - Speech-to-speech translation (S2ST)
30
  - Speech-to-text translation (S2TT)
31
  - Text-to-speech translation (T2ST)
@@ -33,132 +139,81 @@ This is the "large" variant of the unified model, which enables multiple tasks w
33
  - Automatic speech recognition (ASR)
34
 
35
  ## SeamlessM4T models
36
-
37
  | Model Name | #params | checkpoint | metrics |
38
  | ------------------ | ------- | --------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
39
- | SeamlessM4T-Large | 2.3B | [🤗 Model card](https://huggingface.co/facebook/seamless-m4t-large) - [checkpoint](https://huggingface.co/facebook/seamless-m4t-large/resolve/main/multitask_unity_large.pt) | [metrics](https://dl.fbaipublicfiles.com/seamlessM4T/metrics/seamlessM4T_large.zip) |
40
- | SeamlessM4T-Medium | 1.2B | [🤗 Model card](https://huggingface.co/facebook/seamless-m4t-medium) - [checkpoint](https://huggingface.co/facebook/seamless-m4t-medium/resolve/main/multitask_unity_medium.pt) | [metrics](https://dl.fbaipublicfiles.com/seamlessM4T/metrics/seamlessM4T_medium.zip) |
 
41
 
42
- We provide extensive evaluation results of SeamlessM4T-Medium and SeamlessM4T-Large in the SeamlessM4T paper (as averages) in the `metrics` files above.
43
 
44
  ## 🤗 Transformers Usage
45
 
46
- First, load the processor and a checkpoint of the model:
47
 
48
- ```python
49
- >>> from transformers import AutoProcessor, SeamlessM4TModel
50
- >>> processor = AutoProcessor.from_pretrained("facebook/hf-seamless-m4t-large")
51
- >>> model = SeamlessM4TModel.from_pretrained("facebook/hf-seamless-m4t-large")
 
52
  ```
53
 
54
- You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.
55
 
56
- Here is how to use the processor to process text and audio:
57
 
58
- ```python
59
- >>> # let's load an audio sample from an Arabic speech corpus
60
- >>> from datasets import load_dataset
61
- >>> dataset = load_dataset("arabic_speech_corpus", split="test", streaming=True)
62
- >>> audio_sample = next(iter(dataset))["audio"]
63
- >>> # now, process it
64
- >>> audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt")
65
- >>> # now, process some English test as well
66
- >>> text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
67
  ```
68
 
 
69
 
70
- ### Speech
71
-
72
- [`SeamlessM4TModel`](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel) can *seamlessly* generate text or speech with few or no changes. Let's target Russian voice translation:
73
 
74
- ```python
75
- >>> audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
76
- >>> audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
77
  ```
78
 
79
- With basically the same code, I've translated English text and Arabic speech to Russian speech samples.
80
 
81
- ### Text
 
82
 
83
- Similarly, you can generate translated text from audio files or from text with the same model. You only have to pass `generate_speech=False` to [`SeamlessM4TModel.generate`](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel.generate).
84
- This time, let's translate to French.
85
-
86
- ```python
87
- >>> # from audio
88
- >>> output_tokens = model.generate(**audio_inputs, tgt_lang="fra", generate_speech=False)
89
- >>> translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
90
- >>> # from text
91
- >>> output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
92
- >>> translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
93
  ```
94
 
95
-
96
- ## Instructions to run inference with SeamlessM4T models
97
-
98
- The SeamlessM4T models are currently available through the `seamless_communication` package. The `seamless_communication`
99
- package can be installed by following the instructions outlined here: [Installation](https://github.com/facebookresearch/seamless_communication/tree/main#installation).
100
-
101
- Once installed, a [`Translator`](https://github.com/facebookresearch/seamless_communication/blob/main/src/seamless_communication/models/inference/translator.py#L50)
102
- object can be instantiated to perform all five of the spoken langauge tasks. The `Translator` is instantiated with three arguments:
103
- 1. **model_name_or_card**: SeamlessM4T checkpoint. Can be either `seamlessM4T_medium` for the medium model, or `seamlessM4T_large` for the large model
104
- 2. **vocoder_name_or_card**: vocoder checkpoint (`vocoder_36langs`)
105
- 3. **device**: Torch device
106
-
107
- ```python
108
  import torch
109
- from seamless_communication.models.inference import Translator
110
-
111
-
112
  # Initialize a Translator object with a multitask model, vocoder on the GPU.
113
- translator = Translator("seamlessM4T_large", vocoder_name_or_card="vocoder_36langs", device=torch.device("cuda:0"))
114
- ```
115
-
116
- Once instantiated, the `predict()` method can be used to run inference as many times on any of the supported tasks.
117
-
118
- Given an input audio with `<path_to_input_audio>` or an input text `<input_text>` in `<src_lang>`, we can translate
119
- into `<tgt_lang>` as follows.
120
-
121
- ### S2ST and T2ST:
122
-
123
- ```python
124
- # S2ST
125
- translated_text, wav, sr = translator.predict(<path_to_input_audio>, "s2st", <tgt_lang>)
126
-
127
- # T2ST
128
- translated_text, wav, sr = translator.predict(<input_text>, "t2st", <tgt_lang>, src_lang=<src_lang>)
129
- ```
130
-
131
- Note that `<src_lang>` must be specified for T2ST.
132
-
133
- The generated units are synthesized and the output audio file is saved with:
134
-
135
- ```python
136
- wav, sr = translator.synthesize_speech(<speech_units>, <tgt_lang>)
137
-
138
- # Save the translated audio generation.
139
- torchaudio.save(
140
- <path_to_save_audio>,
141
- wav[0].cpu(),
142
- sample_rate=sr,
143
  )
144
  ```
145
-
146
- ### S2TT, T2TT and ASR:
147
-
148
- ```python
149
- # S2TT
150
- translated_text, _, _ = translator.predict(<path_to_input_audio>, "s2tt", <tgt_lang>)
151
-
152
- # ASR
153
- # This is equivalent to S2TT with `<tgt_lang>=<src_lang>`.
154
- transcribed_text, _, _ = translator.predict(<path_to_input_audio>, "asr", <src_lang>)
155
-
156
- # T2TT
157
- translated_text, _, _ = translator.predict(<input_text>, "t2tt", <tgt_lang>, src_lang=<src_lang>)
158
-
159
- ```
160
- Note that `<src_lang>` must be specified for T2TT.
161
-
162
  ## Citation
163
 
164
  If you plan to use SeamlessM4T in your work or any models/datasets/artifacts published in SeamlessM4T, please cite:
 
1
  ---
 
 
 
2
  license: cc-by-nc-4.0
3
+ language:
4
+ - af
5
+ - am
6
+ - ar
7
+ - as
8
+ - az
9
+ - be
10
+ - bn
11
+ - bs
12
+ - bg
13
+ - ca
14
+ - cs
15
+ - zh
16
+ - cy
17
+ - da
18
+ - de
19
+ - el
20
+ - en
21
+ - et
22
+ - fi
23
+ - fr
24
+ - or
25
+ - om
26
+ - ga
27
+ - gl
28
+ - gu
29
+ - ha
30
+ - he
31
+ - hi
32
+ - hr
33
+ - hu
34
+ - hy
35
+ - ig
36
+ - id
37
+ - is
38
+ - it
39
+ - jv
40
+ - ja
41
+ - kn
42
+ - ka
43
+ - kk
44
+ - mn
45
+ - km
46
+ - ky
47
+ - ko
48
+ - lo
49
+ - ln
50
+ - lt
51
+ - lb
52
+ - lg
53
+ - lv
54
+ - ml
55
+ - mr
56
+ - mk
57
+ - mt
58
+ - mi
59
+ - my
60
+ - nl
61
+ - nb
62
+ - ne
63
+ - ny
64
+ - oc
65
+ - pa
66
+ - ps
67
+ - fa
68
+ - pl
69
+ - pt
70
+ - ro
71
+ - ru
72
+ - sk
73
+ - sl
74
+ - sn
75
+ - sd
76
+ - so
77
+ - es
78
+ - sr
79
+ - sv
80
+ - sw
81
+ - ta
82
+ - te
83
+ - tg
84
+ - tl
85
+ - th
86
+ - tr
87
+ - uk
88
+ - ur
89
+ - uz
90
+ - vi
91
+ - wo
92
+ - xh
93
+ - yo
94
+ - ms
95
+ - zu
96
+ - ary
97
+ - arz
98
+ - yue
99
+ - kea
100
+ metrics:
101
+ - bleu
102
+ - wer
103
+ - chrf
104
+ inference: False
105
+ pipeline_tag: automatic-speech-recognition
106
+ tags:
107
+ - audio-to-audio
108
+ - text-to-speech
109
+ - speech-to-text
110
+ - text2text-generation
111
+ - seamless_communication
112
  library_name: fairseq2
113
  ---
114
+ # SeamlessM4T Large (v1)
 
115
 
116
  SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different
117
  linguistic communities to communicate effortlessly through speech and text.
 
123
 
124
  -------------------
125
 
126
+ **🌟 SeamlessM4T v2, an improved version of this version with a novel architecture, has been released [here](https://huggingface.co/facebook/seamless-m4t-v2-large).**
127
+
128
+ **This new model improves over SeamlessM4T v1 in quality as well as inference speed in speech generation tasks.**
129
 
130
  **SeamlessM4T v2 is also supported by 🤗 Transformers, more on it [in the model card of this new version](https://huggingface.co/facebook/seamless-m4t-v2-large#transformers-usage) or directly in [🤗 Transformers docs](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t_v2).**
131
 
132
  -------------------
133
 
134
+ This is the "large-v1" variant of SeamlessM4T, which enables multiple tasks without relying on multiple separate models:
135
  - Speech-to-speech translation (S2ST)
136
  - Speech-to-text translation (S2TT)
137
  - Text-to-speech translation (T2ST)
 
139
  - Automatic speech recognition (ASR)
140
 
141
  ## SeamlessM4T models
 
142
  | Model Name | #params | checkpoint | metrics |
143
  | ------------------ | ------- | --------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
144
+ | [SeamlessM4T-Large v2](https://huggingface.co/facebook/seamless-m4t-v2-large) | 2.3B | [checkpoint](https://huggingface.co/facebook/seamless-m4t-v2-large/blob/main/seamlessM4T_v2_large.pt) | [metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_large_v2.zip) |
145
+ | [SeamlessM4T-Large (v1)](https://huggingface.co/facebook/seamless-m4t-large) | 2.3B | [checkpoint](https://huggingface.co/facebook/seamless-m4t-large/blob/main/multitask_unity_large.pt) | [metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_large.zip) |
146
+ | [SeamlessM4T-Medium (v1)](https://huggingface.co/facebook/seamless-m4t-medium) | 1.2B | [checkpoint](https://huggingface.co/facebook/seamless-m4t-medium/blob/main/multitask_unity_medium.pt) | [metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_medium.zip) |
147
 
148
+ We provide extensive evaluation results of SeamlessM4T models in the [SeamlessM4T](https://arxiv.org/abs/2308.11596) and [Seamless](https://arxiv.org/abs/2312.05187) papers (as averages) in the `metrics` files above.
149
 
150
  ## 🤗 Transformers Usage
151
 
152
+ First, load the processor and a checkpoint of the model:
153
 
154
+ ```python
155
+ import torchaudio
156
+ from transformers import AutoProcessor, SeamlessM4TModel
157
+ processor = AutoProcessor.from_pretrained("facebook/hf-seamless-m4t-large")
158
+ model = SeamlessM4TModel.from_pretrained("facebook/hf-seamless-m4t-large")
159
  ```
160
 
161
+ You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.
162
 
163
+ Here is how to use the processor to process text and audio:
164
 
165
+ ```python
166
+ # Read an audio file and resample to 16kHz:
167
+ audio, orig_freq = torchaudio.load("https://www2.cs.uic.edu/~i101/SoundFiles/preamble10.wav")
168
+ audio = torchaudio.functional.resample(audio, orig_freq=orig_freq, new_freq=16_000) # must be a 16 kHz waveform array
169
+ audio_inputs = processor(audios=audio, return_tensors="pt")
170
+ # Process some input text as well:
171
+ text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
 
 
172
  ```
173
 
174
+ ### Speech
175
 
176
+ Generate speech in Russian from either text (T2ST) or speech input (S2ST):
 
 
177
 
178
+ ```python
179
+ audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
180
+ audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
181
  ```
182
 
183
+ ### Text
184
 
185
+ Similarly, you can generate translated text from audio files (S2TT) or from text (T2TT, conventionally MT) with the same model.
186
+ You only have to pass `generate_speech=False` to [`SeamlessM4TModel.generate`](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel.generate).
187
 
188
+ ```python
189
+ # from audio
190
+ output_tokens = model.generate(**audio_inputs, tgt_lang="fra", generate_speech=False)
191
+ translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
192
+ # from text
193
+ output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
194
+ translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
 
 
 
195
  ```
196
 
197
+ ## Seamless_communication
198
+ You can also use the seamlessM4T models using the [`seamless_communication` library](https://github.com/facebookresearch/seamless_communication/blob/main/docs/m4t/README.md)
199
+ with either CLI:
200
+ ```bash
201
+ m4t_predict <path_to_input_audio> --task s2st --tgt_lang <tgt_lang> --output_path <path_to_save_audio> --model_name seamlessM4T_large
202
+ ```
203
+ or a `Translator` API:
204
+ ```py
 
 
 
 
 
205
  import torch
206
+ from seamless_communication.inference import Translator
 
 
207
  # Initialize a Translator object with a multitask model, vocoder on the GPU.
208
+ translator = Translator("seamlessM4T_large", "vocoder_36langs", torch.device("cuda:0"), torch.float16)
209
+ text_output, speech_output = translator.predict(
210
+ input=<path_to_input_audio>,
211
+ task_str="S2ST",
212
+ tgt_lang=<tgt_lang>,
213
+ text_generation_opts=text_generation_opts,
214
+ unit_generation_opts=unit_generation_opts
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
215
  )
216
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
217
  ## Citation
218
 
219
  If you plan to use SeamlessM4T in your work or any models/datasets/artifacts published in SeamlessM4T, please cite: