kotoba-tech
/

kotoba-whisper-bilingual-v1.0

@@ -47,7 +47,7 @@ We compare our kotoba-whisper-bilingual with OpenAI whisper models, kotoba-whisp
 OpenAI whisper is not trained for English to Japanese speech-to-text translation, and other models are specific to the Task (eg. kotoba-whisper is Japanese ASR and
 distil whisper is English ASR only).
-### Speech2Text Translation (Japanese->English): WER
 | model                                                                                                                                                                                                     |   [CoVoST2 (Ja->En)](https://huggingface.co/datasets/japanese-asr/ja2en.s2t_translation)|   [Fleurs (Ja->En)](https://huggingface.co/datasets/japanese-asr/ja2en.s2t_translation) |
 |:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------:|
@@ -65,7 +65,7 @@ distil whisper is English ASR only).
 | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)                                                                                                                                         |                                                                                                  377.2 |                                                                                                 474   |
-### Speech2Text Translation (English->Japanese): CER
 | model                                                                                                                                                                                                     |   [CoVoST2 (En->Ja)](https://huggingface.co/datasets/japanese-asr/en2ja.s2t_translation)|   [Fleurs (En->JA)](https://huggingface.co/datasets/japanese-asr/en2ja.s2t_translation) |
 |:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------:|
@@ -83,7 +83,7 @@ distil whisper is English ASR only).
 | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)                                                                                                                                         |                                                                                                  185.2 |                                                                                                 200.5 |
-### ASR (Japanese): CER
 | model                                                                                                                                             |   [CommonVoice 8 (Japanese test set)](https://huggingface.co/datasets/japanese-asr/ja_asr.common_voice_8_0) |   [JSUT Basic 5000](https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000) |   [ReazonSpeech (held out test set)](https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test) |
 |:--------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------:|----------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------------:|
@@ -101,7 +101,7 @@ distil whisper is English ASR only).
-### ASR (English): WER
 | model                                                                                                           |   [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (ami) |   [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (earnings22) |   [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (librispeech) |   [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (tedlium) |   [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (voxpopuli) |
 |:----------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------:|-----------------------------------------------------------------------------------:|------------------------------------------------------------------------------------:|--------------------------------------------------------------------------------:|----------------------------------------------------------------------------------:|
@@ -117,24 +117,17 @@ distil whisper is English ASR only).
 ### Inference Speed
-Although the cascaded approach is better in translation task, due to the nature of cascaded approach, the pipeline has additional complexity compared to the single end2end models for the sake of high accuracy.
-Following table shows the mean inference time in second averaged over 10 trials on audio sample with different durations.
-| model                                                                                                                                                                                                     |    10 |    30 |    60 |   300 |
-|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------:|------:|------:|------:|
-| [**kotoba-tech/kotoba-whisper-bilingual-v1.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-bilingual-v1.0)                                                                                             | 0.041 | 0.111 | 0.214 | 1.077 |
-| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B))                     | 0.173 | 0.247 | 0.352 | 1.772 |
-| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B))                     | 0.173 | 0.24  | 0.348 | 1.515 |
-| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B)) | 0.17  | 0.245 | 0.348 | 1.882 |
-| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)) | 0.108 | 0.179 | 0.283 | 1.33  |
-| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)                                                                                                                                 | 0.061 | 0.184 | 0.372 | 1.804 |
-| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2)                                                                                                                                 | 0.062 | 0.199 | 0.415 | 1.854 |
-| [openai/whisper-large](https://huggingface.co/openai/whisper-large)                                                                                                                                       | 0.062 | 0.183 | 0.363 | 1.899 |
-| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium)                                                                                                                                     | 0.045 | 0.132 | 0.266 | 1.368 |
-| [openai/whisper-small](https://huggingface.co/openai/whisper-small)                                                                                                                                       | 0.135 | 0.376 | 0.631 | 3.495 |
-| [openai/whisper-base](https://huggingface.co/openai/whisper-base)                                                                                                                                         | 0.054 | 0.108 | 0.231 | 1.019 |
-| [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)                                                                                                                                         | 0.045 | 0.124 | 0.208 | 0.838 |
 ## Transformers Usage
 Kotoba-Whisper is supported in the Hugging Face 🤗 Transformers library from version 4.39 onwards. To run the model, first

 OpenAI whisper is not trained for English to Japanese speech-to-text translation, and other models are specific to the Task (eg. kotoba-whisper is Japanese ASR and
 distil whisper is English ASR only).
+### Speech2Text Translation (Japanese->English): WER (smaller is better)
 | model                                                                                                                                                                                                     |   [CoVoST2 (Ja->En)](https://huggingface.co/datasets/japanese-asr/ja2en.s2t_translation)|   [Fleurs (Ja->En)](https://huggingface.co/datasets/japanese-asr/ja2en.s2t_translation) |
 |:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------:|
 | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)                                                                                                                                         |                                                                                                  377.2 |                                                                                                 474   |
+### Speech2Text Translation (English->Japanese): CER (smaller is better)
 | model                                                                                                                                                                                                     |   [CoVoST2 (En->Ja)](https://huggingface.co/datasets/japanese-asr/en2ja.s2t_translation)|   [Fleurs (En->JA)](https://huggingface.co/datasets/japanese-asr/en2ja.s2t_translation) |
 |:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------:|
 | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)                                                                                                                                         |                                                                                                  185.2 |                                                                                                 200.5 |
+### ASR (Japanese): CER (smaller is better)
 | model                                                                                                                                             |   [CommonVoice 8 (Japanese test set)](https://huggingface.co/datasets/japanese-asr/ja_asr.common_voice_8_0) |   [JSUT Basic 5000](https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000) |   [ReazonSpeech (held out test set)](https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test) |
 |:--------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------:|----------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------------:|
+### ASR (English): WER (smaller is better)
 | model                                                                                                           |   [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (ami) |   [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (earnings22) |   [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (librispeech) |   [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (tedlium) |   [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (voxpopuli) |
 |:----------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------:|-----------------------------------------------------------------------------------:|------------------------------------------------------------------------------------:|--------------------------------------------------------------------------------:|----------------------------------------------------------------------------------:|
 ### Inference Speed
+Although the cascaded approach is better in translation task, due to the nature of cascaded approach, the pipeline
+has additional complexity compared to the single end2end models for the sake of high accuracy.
+Following table shows the mean inference time on a single RTX 4090 (VRAM 24 GB) in second averaged over 10 trials on audio sample with different durations.
+| model                                                                                                                                                                                                     | Param. (M) |    10 |    30 |    60 |   300 |
+|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------:|------:|------:|------:|------:|
+| [**kotoba-tech/kotoba-whisper-bilingual-v1.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-bilingual-v1.0)                                                                                         |            | 0.041 | 0.111 | 0.214 | 1.077 |
+| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B))                     |            | 0.173 | 0.247 | 0.352 | 1.772 |
+| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B))                     |            | 0.173 | 0.24  | 0.348 | 1.515 |
+| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B)) |            | 0.17  | 0.245 | 0.348 | 1.882 |
+| [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)) |            | 0.108 | 0.179 | 0.283 | 1.33  |
 ## Transformers Usage
 Kotoba-Whisper is supported in the Hugging Face 🤗 Transformers library from version 4.39 onwards. To run the model, first