metadata

library_name: transformers
license: apache-2.0
pipeline_tag: automatic-speech-recognition
tags:
  - audio

Cascaded Japanese Speech2Text Translation

This is a pipeline for speech-to-text translation from Japanese speech to any target language text based on the cascaded approach, that consists of ASR and translation. The pipeline employs kotoba-tech/kotoba-whisper-v2.0 for ASR (Japanese speech -> Japanese text) and facebook/nllb-200-3.3B for text translation. The input must be Japanese speech, while the translation can be in any languages NLLB trained on. Please find the all available languages and their language codes here.

Model for English speech translation is available at en-cascaded-s2t-translation.

Benchmark

The folloiwng table shows WER computed over the reference and predicted translation for translating Japanse speech to English text task (subsets of CoVoST2 and Fleurs) with different size of NLLB along with OpenAI Whisper models.

model	CoVoST2 (Ja->En)	Fleurs (Ja->En)
japanese-asr/ja-cascaded-s2t-translation (facebook/nllb-200-3.3B)	64.3	67.1
japanese-asr/ja-cascaded-s2t-translation (facebook/nllb-200-1.3B)	65.4	68.9
japanese-asr/ja-cascaded-s2t-translation (facebook/nllb-200-distilled-1.3B)	65.6	67.4
japanese-asr/ja-cascaded-s2t-translation (facebook/nllb-200-distilled-600M)	68.2	72.2
openai/whisper-large-v3	71	86.1
openai/whisper-large-v2	66.4	78.8
openai/whisper-large	66.5	86.1
openai/whisper-medium	70.3	97.2
openai/whisper-small	97.3	132.2
openai/whisper-base	186.2	349.6
openai/whisper-tiny	377.2	474

See https://github.com/kotoba-tech/kotoba-whisper for the evaluation detail.

Inference Speed

Due to the nature of cascaded approach, the pipeline has additional complexity compared to the single end2end OpenAI whisper models for the sake of high accuracy. Following table shows the mean inference time in second averaged over 10 trials on audio sample with different durations.

model	10	30	60	300
japanese-asr/ja-cascaded-s2t-translation (facebook/nllb-200-3.3B)	0.173	0.247	0.352	1.772
japanese-asr/ja-cascaded-s2t-translation (facebook/nllb-200-1.3B)	0.173	0.24	0.348	1.515
japanese-asr/ja-cascaded-s2t-translation (facebook/nllb-200-distilled-1.3B)	0.17	0.245	0.348	1.882
japanese-asr/ja-cascaded-s2t-translation (facebook/nllb-200-distilled-600M)	0.108	0.179	0.283	1.33
openai/whisper-large-v3	0.061	0.184	0.372	1.804
openai/whisper-large-v2	0.062	0.199	0.415	1.854
openai/whisper-large	0.062	0.183	0.363	1.899
openai/whisper-medium	0.045	0.132	0.266	1.368
openai/whisper-small	0.135	0.376	0.631	3.495
openai/whisper-base	0.054	0.108	0.231	1.019
openai/whisper-tiny	0.045	0.124	0.208	0.838

Usage

Here is an example to translate Japanese speech into English text translation. First, download a sample speech.

wget https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000/resolve/main/sample.flac -O sample_ja.flac

Then, run the pipeline as below.

from transformers import pipeline

# load model
pipe = pipeline(
    model="japanese-asr/ja-cascaded-s2t-translation",
    model_kwargs={"attn_implementation": "sdpa"},
    model_translation="facebook/nllb-200-distilled-600M",
    tgt_lang="eng_Latn",
    chunk_length_s=15,
    trust_remote_code=True,
)

# translate
output = pipe("./sample_ja.flac")

Other NLLB models can be used by setting model_translation such as following.