File size: 6,370 Bytes

b9072de
 
02a26b0
 
 
b9072de
 
 
 
02a26b0
 
b9072de
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4bf0c36
b9072de
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
02a26b0
 
 
 
 
 
 
 
 
 
 
 
 
 
b9072de
 
 
 
2b478a5
 
 
 
 
 
1e400c8
 
 
b9072de
02a26b0
b9072de
 
 
 
 
 
 
 
 
 
 
4bf0c36
f1dd1bd
 
b9072de
f1dd1bd
b9072de
 
 
 
 
 
f1dd1bd
 
b9072de
 
 
f1dd1bd
b9072de
 
f1dd1bd
b9072de
 
 
 
 
 
 
 
 
 
 
 
4bf0c36
f1dd1bd
 
b9072de
f1dd1bd
b9072de
 
 
 
 
 
f1dd1bd
 
b9072de
 
 
f1dd1bd
b9072de
 
f1dd1bd
b9072de

---
license: apache-2.0
language: fr
library_name: transformers
thumbnail: null
tags:
- automatic-speech-recognition
- hf-asr-leaderboard
- robust-speech-event
- CTC
- Wav2vec2
datasets:
- common_voice
- mozilla-foundation/common_voice_11_0
- facebook/multilingual_librispeech
- facebook/voxpopuli
- gigant/african_accented_french
metrics:
- wer
model-index:
- name: Fine-tuned wav2vec2-FR-7K-large model for ASR in French
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Common Voice 11.0
      type: mozilla-foundation/common_voice_11_0
      args: fr
    metrics:
    - name: Test WER
      type: wer
      value: 11.44
    - name: Test WER (+LM)
      type: wer
      value: 9.66
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Multilingual LibriSpeech (MLS)
      type: facebook/multilingual_librispeech
      args: french
    metrics:
    - name: Test WER
      type: wer
      value: 5.93
    - name: Test WER (+LM)
      type: wer
      value: 5.13
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: VoxPopuli
      type: facebook/voxpopuli
      args: fr
    metrics:
    - name: Test WER
      type: wer
      value: 9.33
    - name: Test WER (+LM)
      type: wer
      value: 8.51
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: African Accented French
      type: gigant/african_accented_french
      args: fr
    metrics:
    - name: Test WER
      type: wer
      value: 16.22
    - name: Test WER (+LM)
      type: wer
      value: 15.39
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Robust Speech Event - Dev Data
      type: speech-recognition-community-v2/dev_data
      args: fr
    metrics:
    - name: Test WER
      type: wer
      value: 16.56
    - name: Test WER (+LM)
      type: wer
      value: 12.96
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Fleurs
      type: google/fleurs
      args: fr_fr
    metrics:
    - name: Test WER
      type: wer
      value: 10.10
    - name: Test WER (+LM)
      type: wer
      value: 8.84
---

# Fine-tuned wav2vec2-FR-7K-large model for ASR in French

<style>
img {
 display: inline;
}
</style>

![Model architecture](https://img.shields.io/badge/Model_Architecture-Wav2Vec2--CTC-lightgrey)
![Model size](https://img.shields.io/badge/Params-315M-lightgrey)
![Language](https://img.shields.io/badge/Language-French-lightgrey)

This model is a fine-tuned version of [LeBenchmark/wav2vec2-FR-7K-large](https://huggingface.co/LeBenchmark/wav2vec2-FR-7K-large), trained on a composite dataset comprising of over 2200 hours of French speech audio, using the train and validation splits of [Common Voice 11.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0), [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech), [Voxpopuli](https://github.com/facebookresearch/voxpopuli), [Multilingual TEDx](http://www.openslr.org/100), [MediaSpeech](https://www.openslr.org/108), and [African Accented French](https://huggingface.co/datasets/gigant/african_accented_french). When using the model make sure that your speech input is also sampled at 16Khz.

## Usage

1. To use on a local audio file with the language model

```python
import torch
import torchaudio

from transformers import AutoModelForCTC, Wav2Vec2ProcessorWithLM

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model = AutoModelForCTC.from_pretrained("bhuang/asr-wav2vec2-french").to(device)
processor_with_lm = Wav2Vec2ProcessorWithLM.from_pretrained("bhuang/asr-wav2vec2-french")
model_sample_rate = processor_with_lm.feature_extractor.sampling_rate

wav_path = "example.wav"  # path to your audio file
waveform, sample_rate = torchaudio.load(wav_path)
waveform = waveform.squeeze(axis=0)  # mono

# resample
if sample_rate != model_sample_rate:
    resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
    waveform = resampler(waveform)

# normalize
input_dict = processor_with_lm(waveform, sampling_rate=model_sample_rate, return_tensors="pt")

with torch.inference_mode():
    logits = model(input_dict.input_values.to(device)).logits

predicted_sentence = processor_with_lm.batch_decode(logits.cpu().numpy()).text[0]
```

2. To use on a local audio file without the language model

```python
import torch
import torchaudio

from transformers import AutoModelForCTC, Wav2Vec2Processor

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model = AutoModelForCTC.from_pretrained("bhuang/asr-wav2vec2-french").to(device)
processor = Wav2Vec2Processor.from_pretrained("bhuang/asr-wav2vec2-french")
model_sample_rate = processor.feature_extractor.sampling_rate

wav_path = "example.wav"  # path to your audio file
waveform, sample_rate = torchaudio.load(wav_path)
waveform = waveform.squeeze(axis=0)  # mono

# resample
if sample_rate != model_sample_rate:
    resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
    waveform = resampler(waveform)

# normalize
input_dict = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt")

with torch.inference_mode():
    logits = model(input_dict.input_values.to(device)).logits

# decode
predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentence = processor.batch_decode(predicted_ids)[0]
```

## Evaluation

1. To evaluate on `mozilla-foundation/common_voice_11_0`

```bash
python eval.py \
  --model_id "bhuang/asr-wav2vec2-french" \
  --dataset "mozilla-foundation/common_voice_11_0" \
  --config "fr" \
  --split "test" \
  --log_outputs \
  --outdir "outputs/results_mozilla-foundatio_common_voice_11_0_with_lm"
```

2. To evaluate on `speech-recognition-community-v2/dev_data`

```bash
python eval.py \
  --model_id "bhuang/asr-wav2vec2-french" \
  --dataset "speech-recognition-community-v2/dev_data" \
  --config "fr" \
  --split "validation" \
  --chunk_length_s 30.0 \
  --stride_length_s 5.0 \
  --log_outputs \
  --outdir "outputs/results_speech-recognition-community-v2_dev_data_with_lm"
```