|
--- |
|
license: apache-2.0 |
|
language: fr |
|
library_name: transformers |
|
thumbnail: null |
|
tags: |
|
- automatic-speech-recognition |
|
- hf-asr-leaderboard |
|
- robust-speech-event |
|
- CTC |
|
- Wav2vec2 |
|
datasets: |
|
- common_voice |
|
- mozilla-foundation/common_voice_11_0 |
|
- facebook/multilingual_librispeech |
|
- facebook/voxpopuli |
|
- gigant/african_accented_french |
|
metrics: |
|
- wer |
|
model-index: |
|
- name: Fine-tuned wav2vec2-FR-7K-large model for ASR in French |
|
results: |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Common Voice 11 |
|
type: mozilla-foundation/common_voice_11_0 |
|
args: fr |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 11.44 |
|
- name: Test WER (+LM) |
|
type: wer |
|
value: 9.66 |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Multilingual LibriSpeech (MLS) |
|
type: facebook/multilingual_librispeech |
|
args: french |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 5.93 |
|
- name: Test WER (+LM) |
|
type: wer |
|
value: 5.13 |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: VoxPopuli |
|
type: facebook/voxpopuli |
|
args: fr |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 9.33 |
|
- name: Test WER (+LM) |
|
type: wer |
|
value: 8.51 |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: African Accented French |
|
type: gigant/african_accented_french |
|
args: fr |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 16.22 |
|
- name: Test WER (+LM) |
|
type: wer |
|
value: 15.39 |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Robust Speech Event - Dev Data |
|
type: speech-recognition-community-v2/dev_data |
|
args: fr |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 16.56 |
|
- name: Test WER (+LM) |
|
type: wer |
|
value: 12.96 |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Fleurs |
|
type: google/fleurs |
|
args: fr_fr |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 10.10 |
|
- name: Test WER (+LM) |
|
type: wer |
|
value: 8.84 |
|
--- |
|
|
|
# Fine-tuned wav2vec2-FR-7K-large model for ASR in French |
|
|
|
<style> |
|
img { |
|
display: inline; |
|
} |
|
</style> |
|
|
|
[![Model architecture](https://img.shields.io/badge/Model_Architecture-Wav2Vec2--CTC-lightgrey)](#model-architecture) [![Model size](https://img.shields.io/badge/Params-315M-lightgrey)](#model-architecture) [![Language](https://img.shields.io/badge/Language-French-lightgrey)](#datasets) |
|
|
|
This model is a fine-tuned version of [LeBenchmark/wav2vec2-FR-7K-large](https://huggingface.co/LeBenchmark/wav2vec2-FR-7K-large), trained on a composite dataset comprising of over 2200 hours of French speech audio, using the train and validation splits of [Common Voice 11.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0), [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech), [Voxpopuli](https://github.com/facebookresearch/voxpopuli), [Multilingual TEDx](http://www.openslr.org/100), [MediaSpeech](https://www.openslr.org/108), and [African Accented French](https://huggingface.co/datasets/gigant/african_accented_french). When using the model make sure that your speech input is also sampled at 16Khz. |
|
|
|
## Usage |
|
|
|
1. To use on a local audio file with the language model |
|
|
|
```python |
|
import torch |
|
import torchaudio |
|
|
|
from transformers import AutoModelForCTC, Wav2Vec2ProcessorWithLM |
|
|
|
model = AutoModelForCTC.from_pretrained("bhuang/asr-wav2vec2-french").cuda() |
|
processor_with_lm = Wav2Vec2ProcessorWithLM.from_pretrained("bhuang/asr-wav2vec2-french") |
|
|
|
wav_path = "example.wav" # path to your audio file |
|
waveform, sample_rate = torchaudio.load(wav_path) |
|
waveform = waveform.squeeze(axis=0) # mono |
|
|
|
# resample |
|
if sample_rate != 16_000: |
|
resampler = torchaudio.transforms.Resample(sample_rate, 16_000) |
|
waveform = resampler(waveform) |
|
|
|
# normalize |
|
input_dict = processor_with_lm(waveform, sampling_rate=16_000, return_tensors="pt") |
|
|
|
with torch.inference_mode(): |
|
logits = model(input_dict.input_values.to("cuda")).logits |
|
|
|
predicted_sentence = processor_with_lm.batch_decode(logits.cpu().numpy()).text[0] |
|
``` |
|
|
|
2. To use on a local audio file without the language model |
|
|
|
```python |
|
import torch |
|
import torchaudio |
|
|
|
from transformers import AutoModelForCTC, Wav2Vec2Processor |
|
|
|
model = AutoModelForCTC.from_pretrained("bhuang/asr-wav2vec2-french").cuda() |
|
processor = Wav2Vec2Processor.from_pretrained("bhuang/asr-wav2vec2-french") |
|
|
|
wav_path = "example.wav" # path to your audio file |
|
waveform, sample_rate = torchaudio.load(wav_path) |
|
waveform = waveform.squeeze(axis=0) # mono |
|
|
|
# resample |
|
if sample_rate != 16_000: |
|
resampler = torchaudio.transforms.Resample(sample_rate, 16_000) |
|
waveform = resampler(waveform) |
|
|
|
# normalize |
|
input_dict = processor(waveform, sampling_rate=16_000, return_tensors="pt") |
|
|
|
with torch.inference_mode(): |
|
logits = model(input_dict.input_values.to("cuda")).logits |
|
|
|
# decode |
|
predicted_ids = torch.argmax(logits, dim=-1) |
|
predicted_sentence = processor.batch_decode(predicted_ids)[0] |
|
``` |
|
|
|
## Evaluation |
|
|
|
1. To evaluate on `mozilla-foundation/common_voice_11_0` |
|
|
|
```bash |
|
python eval.py \ |
|
--model_id "bhuang/asr-wav2vec2-french" \ |
|
--dataset "mozilla-foundation/common_voice_11_0" \ |
|
--config "fr" \ |
|
--split "test" \ |
|
--log_outputs \ |
|
--outdir "outputs/results_mozilla-foundatio_common_voice_11_0_with_lm" |
|
``` |
|
|
|
2. To evaluate on `speech-recognition-community-v2/dev_data` |
|
|
|
```bash |
|
python eval.py \ |
|
--model_id "bhuang/asr-wav2vec2-french" \ |
|
--dataset "speech-recognition-community-v2/dev_data" \ |
|
--config "fr" \ |
|
--split "validation" \ |
|
--chunk_length_s 30.0 \ |
|
--stride_length_s 5.0 \ |
|
--log_outputs \ |
|
--outdir "outputs/results_speech-recognition-community-v2_dev_data_with_lm" |
|
``` |
|
|