File size: 11,282 Bytes
8e30e1a
 
3b7314e
da47c7d
3b7314e
 
8e30e1a
 
3b7314e
 
d700b23
 
 
 
 
84ca102
 
ce07dc1
9c14f5a
 
ce07dc1
b636d9f
ce07dc1
9c14f5a
 
 
 
 
 
 
 
 
 
 
 
ce07dc1
 
3b7314e
40aca74
 
 
 
9e28f91
 
 
 
 
 
 
 
 
 
 
 
 
40aca74
3b7314e
3b225f5
3b7314e
 
 
 
 
 
52b08ae
3b7314e
 
 
 
 
 
ce07dc1
3b225f5
3b7314e
 
 
 
 
 
 
d700b23
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
library_name: transformers
license: apache-2.0
pipeline_tag: automatic-speech-recognition
tags:
- audio
---

# Cascaded Japanese Speech2Text Translation
This is a pipeline for speech-to-text translation from Japanese speech to any target language text based on the cascaded approach, that consists of ASR and translation.
The pipeline employs [kotoba-tech/kotoba-whisper-v2.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v2.0) for ASR (Japanese speech -> Japanese text)
and [facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) for text translation.
The input must be Japanese speech, while the translation can be in any languages NLLB trained on. Please find the all available languages and their language codes
[here](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200).

**Model for English speech translation is available at [en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation).**

## Benchmark
The folloiwng table shows WER computed over the reference and predicted translation for translating Japanse speech to English text task
(subsets of [CoVoST2 and Fleurs](https://huggingface.co/datasets/japanese-asr/ja2en.s2t_translation)) with different size of NLLB along with OpenAI Whisper models.

| model                                                                                                                                                                                                     |   [CoVoST2 (Ja->En)](https://huggingface.co/datasets/japanese-asr/ja2en.s2t_translation)|   [Fleurs (Ja->En)](https://huggingface.co/datasets/japanese-asr/ja2en.s2t_translation) |
|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------:|
| [japanese-asr/ja-cascaded-s2t-translation](https://huggingface.co/japanese-asr/ja-cascaded-s2t-translation) ([facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B))                     |                                                                                                   64.3 |                                                                                                  67.1 |
| [japanese-asr/ja-cascaded-s2t-translation](https://huggingface.co/japanese-asr/ja-cascaded-s2t-translation) ([facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B))                     |                                                                                                   65.4 |                                                                                                  68.9 |
| [japanese-asr/ja-cascaded-s2t-translation](https://huggingface.co/japanese-asr/ja-cascaded-s2t-translation) ([facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B)) |                                                                                                   65.6 |                                                                                                  67.4 |
| [japanese-asr/ja-cascaded-s2t-translation](https://huggingface.co/japanese-asr/ja-cascaded-s2t-translation) ([facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)) |                                                                                                   68.2 |                                                                                                  72.2 |
| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)                                                                                                                                 |                                                                                                   71   |                                                                                                  86.1 |
| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2)                                                                                                                                 |                                                                                                   66.4 |                                                                                                  78.8 |
| [openai/whisper-large](https://huggingface.co/openai/whisper-large)                                                                                                                                       |                                                                                                   66.5 |                                                                                                  86.1 |
| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium)                                                                                                                                     |                                                                                                   70.3 |                                                                                                  97.2 |
| [openai/whisper-small](https://huggingface.co/openai/whisper-small)                                                                                                                                       |                                                                                                   97.3 |                                                                                                 132.2 |
| [openai/whisper-base](https://huggingface.co/openai/whisper-base)                                                                                                                                         |                                                                                                  186.2 |                                                                                                 349.6 |
| [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)                                                                                                                                         |                                                                                                  377.2 |                                                                                                 474   | 


See [https://github.com/kotoba-tech/kotoba-whisper](https://github.com/kotoba-tech/kotoba-whisper) for the evaluation detail.

### Inference Speed
Due to the nature of cascaded approach, the pipeline has additional complexity compared to the single end2end OpenAI whisper models for the sake of high accuracy.
Following table shows the mean inference time in second averaged over 10 trials on audio sample with different durations.

| model                                                                                                                                                                                                     |    10 |    30 |    60 |   300 |
|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------:|------:|------:|------:|
| [japanese-asr/ja-cascaded-s2t-translation](https://huggingface.co/japanese-asr/ja-cascaded-s2t-translation) ([facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B))                     | 0.173 | 0.247 | 0.352 | 1.772 |
| [japanese-asr/ja-cascaded-s2t-translation](https://huggingface.co/japanese-asr/ja-cascaded-s2t-translation) ([facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B))                     | 0.173 | 0.24  | 0.348 | 1.515 |
| [japanese-asr/ja-cascaded-s2t-translation](https://huggingface.co/japanese-asr/ja-cascaded-s2t-translation) ([facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B)) | 0.17  | 0.245 | 0.348 | 1.882 |
| [japanese-asr/ja-cascaded-s2t-translation](https://huggingface.co/japanese-asr/ja-cascaded-s2t-translation) ([facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)) | 0.108 | 0.179 | 0.283 | 1.33  |
| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)                                                                                                                                 | 0.061 | 0.184 | 0.372 | 1.804 |
| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2)                                                                                                                                 | 0.062 | 0.199 | 0.415 | 1.854 |
| [openai/whisper-large](https://huggingface.co/openai/whisper-large)                                                                                                                                       | 0.062 | 0.183 | 0.363 | 1.899 |
| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium)                                                                                                                                     | 0.045 | 0.132 | 0.266 | 1.368 |
| [openai/whisper-small](https://huggingface.co/openai/whisper-small)                                                                                                                                       | 0.135 | 0.376 | 0.631 | 3.495 |
| [openai/whisper-base](https://huggingface.co/openai/whisper-base)                                                                                                                                         | 0.054 | 0.108 | 0.231 | 1.019 |
| [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)                                                                                                                                         | 0.045 | 0.124 | 0.208 | 0.838 |

## Usage
Here is an example to translate Japanese speech into English text translation.
First, download a sample speech.
```bash
wget https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000/resolve/main/sample.flac -O sample_ja.flac
```

Then, run the pipeline as below.
```python3
from transformers import pipeline

# load model
pipe = pipeline(
    model="japanese-asr/ja-cascaded-s2t-translation",
    model_kwargs={"attn_implementation": "sdpa"},
    model_translation="facebook/nllb-200-distilled-600M",
    tgt_lang="eng_Latn",
    chunk_length_s=15,
    trust_remote_code=True,
)

# translate
output = pipe("./sample_ja.flac")
```


Other NLLB models can be used by setting `model_translation` such as following.
- [facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B)
- [facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)
- [facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B)
- [facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B)