Automatic Speech Recognition
Transformers
Safetensors
English
Japanese
whisper
audio
hf-asr-leaderboard
Inference Endpoints
asahi417 commited on
Commit
4779456
1 Parent(s): a245fe1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -22
README.md CHANGED
@@ -47,7 +47,7 @@ We compare our kotoba-whisper-bilingual with OpenAI whisper models, kotoba-whisp
47
  OpenAI whisper is not trained for English to Japanese speech-to-text translation, and other models are specific to the Task (eg. kotoba-whisper is Japanese ASR and
48
  distil whisper is English ASR only).
49
 
50
- ### Speech2Text Translation (Japanese->English): WER
51
 
52
  | model | [CoVoST2 (Ja->En)](https://huggingface.co/datasets/japanese-asr/ja2en.s2t_translation)| [Fleurs (Ja->En)](https://huggingface.co/datasets/japanese-asr/ja2en.s2t_translation) |
53
  |:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------:|
@@ -65,7 +65,7 @@ distil whisper is English ASR only).
65
  | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 377.2 | 474 |
66
 
67
 
68
- ### Speech2Text Translation (English->Japanese): CER
69
 
70
  | model | [CoVoST2 (En->Ja)](https://huggingface.co/datasets/japanese-asr/en2ja.s2t_translation)| [Fleurs (En->JA)](https://huggingface.co/datasets/japanese-asr/en2ja.s2t_translation) |
71
  |:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------:|
@@ -83,7 +83,7 @@ distil whisper is English ASR only).
83
  | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 185.2 | 200.5 |
84
 
85
 
86
- ### ASR (Japanese): CER
87
 
88
  | model | [CommonVoice 8 (Japanese test set)](https://huggingface.co/datasets/japanese-asr/ja_asr.common_voice_8_0) | [JSUT Basic 5000](https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000) | [ReazonSpeech (held out test set)](https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test) |
89
  |:--------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------:|----------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------------:|
@@ -101,7 +101,7 @@ distil whisper is English ASR only).
101
 
102
 
103
 
104
- ### ASR (English): WER
105
 
106
  | model | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (ami) | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (earnings22) | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (librispeech) | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (tedlium) | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (voxpopuli) |
107
  |:----------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------:|-----------------------------------------------------------------------------------:|------------------------------------------------------------------------------------:|--------------------------------------------------------------------------------:|----------------------------------------------------------------------------------:|
@@ -117,24 +117,17 @@ distil whisper is English ASR only).
117
 
118
 
119
  ### Inference Speed
120
- Although the cascaded approach is better in translation task, due to the nature of cascaded approach, the pipeline has additional complexity compared to the single end2end models for the sake of high accuracy.
121
- Following table shows the mean inference time in second averaged over 10 trials on audio sample with different durations.
122
-
123
- | model | 10 | 30 | 60 | 300 |
124
- |:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------:|------:|------:|------:|
125
- | [**kotoba-tech/kotoba-whisper-bilingual-v1.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-bilingual-v1.0) | 0.041 | 0.111 | 0.214 | 1.077 |
126
- | [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B)) | 0.173 | 0.247 | 0.352 | 1.772 |
127
- | [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B)) | 0.173 | 0.24 | 0.348 | 1.515 |
128
- | [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B)) | 0.17 | 0.245 | 0.348 | 1.882 |
129
- | [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)) | 0.108 | 0.179 | 0.283 | 1.33 |
130
- | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 0.061 | 0.184 | 0.372 | 1.804 |
131
- | [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) | 0.062 | 0.199 | 0.415 | 1.854 |
132
- | [openai/whisper-large](https://huggingface.co/openai/whisper-large) | 0.062 | 0.183 | 0.363 | 1.899 |
133
- | [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 0.045 | 0.132 | 0.266 | 1.368 |
134
- | [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 0.135 | 0.376 | 0.631 | 3.495 |
135
- | [openai/whisper-base](https://huggingface.co/openai/whisper-base) | 0.054 | 0.108 | 0.231 | 1.019 |
136
- | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 0.045 | 0.124 | 0.208 | 0.838 |
137
-
138
 
139
  ## Transformers Usage
140
  Kotoba-Whisper is supported in the Hugging Face 🤗 Transformers library from version 4.39 onwards. To run the model, first
 
47
  OpenAI whisper is not trained for English to Japanese speech-to-text translation, and other models are specific to the Task (eg. kotoba-whisper is Japanese ASR and
48
  distil whisper is English ASR only).
49
 
50
+ ### Speech2Text Translation (Japanese->English): WER (smaller is better)
51
 
52
  | model | [CoVoST2 (Ja->En)](https://huggingface.co/datasets/japanese-asr/ja2en.s2t_translation)| [Fleurs (Ja->En)](https://huggingface.co/datasets/japanese-asr/ja2en.s2t_translation) |
53
  |:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------:|
 
65
  | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 377.2 | 474 |
66
 
67
 
68
+ ### Speech2Text Translation (English->Japanese): CER (smaller is better)
69
 
70
  | model | [CoVoST2 (En->Ja)](https://huggingface.co/datasets/japanese-asr/en2ja.s2t_translation)| [Fleurs (En->JA)](https://huggingface.co/datasets/japanese-asr/en2ja.s2t_translation) |
71
  |:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------:|
 
83
  | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 185.2 | 200.5 |
84
 
85
 
86
+ ### ASR (Japanese): CER (smaller is better)
87
 
88
  | model | [CommonVoice 8 (Japanese test set)](https://huggingface.co/datasets/japanese-asr/ja_asr.common_voice_8_0) | [JSUT Basic 5000](https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000) | [ReazonSpeech (held out test set)](https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test) |
89
  |:--------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------:|----------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------------:|
 
101
 
102
 
103
 
104
+ ### ASR (English): WER (smaller is better)
105
 
106
  | model | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (ami) | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (earnings22) | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (librispeech) | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (tedlium) | [ESB](https://huggingface.co/datasets/japanese-asr/en_asr.esb_eval) (voxpopuli) |
107
  |:----------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------:|-----------------------------------------------------------------------------------:|------------------------------------------------------------------------------------:|--------------------------------------------------------------------------------:|----------------------------------------------------------------------------------:|
 
117
 
118
 
119
  ### Inference Speed
120
+ Although the cascaded approach is better in translation task, due to the nature of cascaded approach, the pipeline
121
+ has additional complexity compared to the single end2end models for the sake of high accuracy.
122
+ Following table shows the mean inference time on a single RTX 4090 (VRAM 24 GB) in second averaged over 10 trials on audio sample with different durations.
123
+
124
+ | model | Param. (M) | 10 | 30 | 60 | 300 |
125
+ |:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------:|------:|------:|------:|------:|
126
+ | [**kotoba-tech/kotoba-whisper-bilingual-v1.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-bilingual-v1.0) | | 0.041 | 0.111 | 0.214 | 1.077 |
127
+ | [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B)) | | 0.173 | 0.247 | 0.352 | 1.772 |
128
+ | [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B)) | | 0.173 | 0.24 | 0.348 | 1.515 |
129
+ | [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B)) | | 0.17 | 0.245 | 0.348 | 1.882 |
130
+ | [japanese-asr/en-cascaded-s2t-translation](https://huggingface.co/japanese-asr/en-cascaded-s2t-translation) ([facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)) | | 0.108 | 0.179 | 0.283 | 1.33 |
 
 
 
 
 
 
 
131
 
132
  ## Transformers Usage
133
  Kotoba-Whisper is supported in the Hugging Face 🤗 Transformers library from version 4.39 onwards. To run the model, first