metadata

license: mit
language:
  - af
  - am
  - ar
  - as
  - az
  - be
  - bn
  - bs
  - bg
  - ca
  - cs
  - zh
  - cy
  - da
  - de
  - el
  - en
  - et
  - fi
  - fr
  - or
  - om
  - ga
  - gl
  - gu
  - ha
  - he
  - hi
  - hr
  - hu
  - hy
  - ig
  - id
  - is
  - it
  - jv
  - ja
  - kn
  - ka
  - kk
  - mn
  - km
  - ky
  - ko
  - lo
  - ln
  - lt
  - lb
  - lg
  - lv
  - ml
  - mr
  - mk
  - mt
  - mi
  - my
  - nl
  - nb
  - ne
  - ny
  - oc
  - pa
  - ps
  - fa
  - pl
  - pt
  - ro
  - ru
  - sk
  - sl
  - sn
  - sd
  - so
  - es
  - sr
  - sv
  - sw
  - ta
  - te
  - tg
  - tl
  - th
  - tr
  - uk
  - ur
  - uz
  - vi
  - wo
  - xh
  - yo
  - ms
  - zu
  - ary
  - arz
  - yue
  - kea
inference: false

W2v-BERT 2.0 speech encoder

We are open-sourcing our Conformer-based W2v-BERT 2.0 speech encoder as described in Section 3.2.1 of the paper, which is at the core of our Seamless models.

This model was pre-trained on 4.5M hours of unlabeled audio data covering more than 143 languages. It requires finetuning to be used for downstream tasks such as Automatic Speech Recognition (ASR), or Audio Classification.

Model Name	#params	checkpoint
W2v-BERT 2.0	600M	checkpoint

This model and its training are supported by 🤗 Transformers, more on it in the docs.

Seamless Communication usage

This model can be used in Seamless Communication, where it was released.

Here's how to make a forward pass through the voice encoder, after having completed the installation steps:

import torch

from fairseq2.data.audio import AudioDecoder, WaveformToFbankConverter
from fairseq2.memory import MemoryBlock
from fairseq2.nn.padding import get_seqs_and_padding_mask
from pathlib import Path
from seamless_communication.models.conformer_shaw import load_conformer_shaw_model


audio_wav_path, device, dtype = ...
audio_decoder = AudioDecoder(dtype=torch.float32, device=device)
fbank_converter = WaveformToFbankConverter(
    num_mel_bins=80,
    waveform_scale=2**15,
    channel_last=True,
    standardize=True,
    device=device,
    dtype=dtype,
)
collater = Collater(pad_value=1)

model = load_conformer_shaw_model("conformer_shaw", device=device, dtype=dtype)
model.eval()

with Path(audio_wav_path).open("rb") as fb:
    block = MemoryBlock(fb.read())

decoded_audio = audio_decoder(block)
src = collater(fbank_converter(decoded_audio))["fbank"]
seqs, padding_mask = get_seqs_and_padding_mask(src)

with torch.inference_mode():
  seqs, padding_mask = model.encoder_frontend(seqs, padding_mask)
  seqs, padding_mask = model.encoder(seqs, padding_mask)