Automatic Speech Recognition
Transformers
PyTorch
speech-encoder-decoder
speech
xls_r
xls_r_translation
Inference Endpoints

Incorrect config file

#5
by shrey-jasuja - opened

The configuration attached to this model is of mbart 50 which makes it completely unusable.

Hey @shrey-jasuja , this is a SpeechEncoderDecoderModel, which uses a speech encoder and a text (mbart) decoder. As said in the model card:

The encoder was warm-started from the facebook/wav2vec2-xls-r-1b checkpoint and the decoder from the facebook/mbart-large-50 checkpoint. Consequently, the encoder-decoder model was fine-tuned on 21 {lang} -> en translation pairs of the Covost2 dataset.

I understand but the inference code under the current form doesn't work. The tokenizer needs to be defined explicitly. The following changes worked for me:

import torch
from transformers import SpeechEncoderDecoderModel,MBart50Tokenizer
from datasets import load_dataset

tokenizer = MBart50Tokenizer.from_pretrained("facebook/mbart-large-50")
from transformers import Wav2Vec2FeatureExtractor
feature_extractor = Wav2Vec2FeatureExtractor("facebook/wav2vec2-xls-r-2b-21-to-en")

from transformers import pipeline
asr=pipeline(model="facebook/wav2vec2-xls-r-2b-21-to-en",tokenizer=tokenizer,feature_extractor=feature_extractor,device=0)

audio = item['file']
translation = asr(audio)["text"]

Pinging @sanchit-gandhi for advice :)

Indeed - the code examples here are incorrect. Will be fixed by #6!

Sign up or log in to comment