Update README.md
Browse files
README.md
CHANGED
@@ -47,7 +47,7 @@ We compare our kotoba-whisper-bilingual with OpenAI whisper models, kotoba-whisp
|
|
47 |
OpenAI whisper is not trained for English to Japanese speech-to-text translation, and other models are specific to the Task (eg. kotoba-whisper is Japanese ASR and
|
48 |
distil whisper is English ASR only).
|
49 |
|
50 |
-
### Speech2Text Translation (Japanese->English): WER
|
51 |
|
52 |
| model | [CoVoST2 (Ja->En)](https://huggingface.co/datasets/japanese-asr/ja2en.s2t_translation)| [Fleurs (Ja->En)](https://huggingface.co/datasets/japanese-asr/ja2en.s2t_translation) |
|
53 |
|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------:|
|
@@ -65,7 +65,7 @@ distil whisper is English ASR only).
|
|
65 |
| [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 377.2 | 474 |
|
66 |
|
67 |
|
68 |
-
### Speech2Text Translation (English->Japanese): CER
|
69 |
|
70 |
| model | [CoVoST2 (En->Ja)](https://huggingface.co/datasets/japanese-asr/en2ja.s2t_translation)| [Fleurs (En->JA)](https://huggingface.co/datasets/japanese-asr/en2ja.s2t_translation) |
|
71 |
|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------:|
|
@@ -83,7 +83,7 @@ distil whisper is English ASR only).
|
|
83 |
| [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 185.2 | 200.5 |
|
84 |
|
85 |
|
86 |
-
### ASR (Japanese): CER
|
87 |
|
88 |
| model | [CommonVoice 8 (Japanese test set)](https://huggingface.co/datasets/japanese-asr/ja_asr.common_voice_8_0) | [JSUT Basic 5000](https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000) | [ReazonSpeech (held out test set)](https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test) |
|
89 |
|:--------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------:|----------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------------:|
|
@@ -101,7 +101,7 @@ distil whisper is English ASR only).
|
|
101 |
|
102 |
|
103 |
|
104 |
-
### ASR (English): WER
|
105 |
|
106 |
| model | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (ami) | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (earnings22) | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (librispeech) | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (tedlium) | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (voxpopuli) |
|
107 |
|:----------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------:|-----------------------------------------------------------------------------------:|------------------------------------------------------------------------------------:|--------------------------------------------------------------------------------:|----------------------------------------------------------------------------------:|
|
@@ -117,24 +117,17 @@ distil whisper is English ASR only).
|
|
117 |
|
118 |
|
119 |
### Inference Speed
|
120 |
-
Although the cascaded approach is better in translation task, due to the nature of cascaded approach, the pipeline
|
121 |
-
|
122 |
-
|
123 |
-
|
124 |
-
|
125 |
-
|
126 |
-
| [
|
127 |
-
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-
|
128 |
-
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-
|
129 |
-
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-distilled-
|
130 |
-
| [
|
131 |
-
| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) | 0.062 | 0.199 | 0.415 | 1.854 |
|
132 |
-
| [openai/whisper-large](https://huggingface.co/openai/whisper-large) | 0.062 | 0.183 | 0.363 | 1.899 |
|
133 |
-
| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 0.045 | 0.132 | 0.266 | 1.368 |
|
134 |
-
| [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 0.135 | 0.376 | 0.631 | 3.495 |
|
135 |
-
| [openai/whisper-base](https://huggingface.co/openai/whisper-base) | 0.054 | 0.108 | 0.231 | 1.019 |
|
136 |
-
| [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 0.045 | 0.124 | 0.208 | 0.838 |
|
137 |
-
|
138 |
|
139 |
## Transformers Usage
|
140 |
Kotoba-Whisper is supported in the Hugging Face 🤗 Transformers library from version 4.39 onwards. To run the model, first
|
|
|
47 |
OpenAI whisper is not trained for English to Japanese speech-to-text translation, and other models are specific to the Task (eg. kotoba-whisper is Japanese ASR and
|
48 |
distil whisper is English ASR only).
|
49 |
|
50 |
+
### Speech2Text Translation (Japanese->English): WER (smaller is better)
|
51 |
|
52 |
| model | [CoVoST2 (Ja->En)](https://huggingface.co/datasets/japanese-asr/ja2en.s2t_translation)| [Fleurs (Ja->En)](https://huggingface.co/datasets/japanese-asr/ja2en.s2t_translation) |
|
53 |
|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------:|
|
|
|
65 |
| [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 377.2 | 474 |
|
66 |
|
67 |
|
68 |
+
### Speech2Text Translation (English->Japanese): CER (smaller is better)
|
69 |
|
70 |
| model | [CoVoST2 (En->Ja)](https://huggingface.co/datasets/japanese-asr/en2ja.s2t_translation)| [Fleurs (En->JA)](https://huggingface.co/datasets/japanese-asr/en2ja.s2t_translation) |
|
71 |
|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------:|
|
|
|
83 |
| [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 185.2 | 200.5 |
|
84 |
|
85 |
|
86 |
+
### ASR (Japanese): CER (smaller is better)
|
87 |
|
88 |
| model | [CommonVoice 8 (Japanese test set)](https://huggingface.co/datasets/japanese-asr/ja_asr.common_voice_8_0) | [JSUT Basic 5000](https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000) | [ReazonSpeech (held out test set)](https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test) |
|
89 |
|:--------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------:|----------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------------:|
|
|
|
101 |
|
102 |
|
103 |
|
104 |
+
### ASR (English): WER (smaller is better)
|
105 |
|
106 |
| model | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (ami) | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (earnings22) | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (librispeech) | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (tedlium) | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (voxpopuli) |
|
107 |
|:----------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------:|-----------------------------------------------------------------------------------:|------------------------------------------------------------------------------------:|--------------------------------------------------------------------------------:|----------------------------------------------------------------------------------:|
|
|
|
117 |
|
118 |
|
119 |
### Inference Speed
|
120 |
+
Although the cascaded approach is better in translation task, due to the nature of cascaded approach, the pipeline
|
121 |
+
has additional complexity compared to the single end2end models for the sake of high accuracy.
|
122 |
+
Following table shows the mean inference time on a single RTX 4090 (VRAM 24 GB) in second averaged over 10 trials on audio sample with different durations.
|
123 |
+
|
124 |
+
| model | Param. (M) | 10 | 30 | 60 | 300 |
|
125 |
+
|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------:|------:|------:|------:|------:|
|
126 |
+
| [**kotoba-tech/kotoba-whisper-bilingual-v1.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-bilingual-v1.0) | | 0.041 | 0.111 | 0.214 | 1.077 |
|
127 |
+
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B)) | | 0.173 | 0.247 | 0.352 | 1.772 |
|
128 |
+
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B)) | | 0.173 | 0.24 | 0.348 | 1.515 |
|
129 |
+
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B)) | | 0.17 | 0.245 | 0.348 | 1.882 |
|
130 |
+
| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)) | | 0.108 | 0.179 | 0.283 | 1.33 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
131 |
|
132 |
## Transformers Usage
|
133 |
Kotoba-Whisper is supported in the Hugging Face 🤗 Transformers library from version 4.39 onwards. To run the model, first
|