|
--- |
|
license: mit |
|
language: fr |
|
datasets: |
|
- mozilla-foundation/common_voice_13_0 |
|
metrics: |
|
- per |
|
tags: |
|
- audio |
|
- automatic-speech-recognition |
|
- speech |
|
- phonemize |
|
model-index: |
|
- name: Wav2Vec2-base French finetuned for phonemes by LMSSC |
|
results: |
|
- task: |
|
name: Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Common Voice v13 |
|
type: mozilla-foundation/common_voice_13_0 |
|
args: fr |
|
metrics: |
|
- name: Test PER on Common Voice FR 13.0 | Trained |
|
type: per |
|
value: 5.52 |
|
- name: Test PER on Multilingual Librispeech FR | Trained |
|
type: per |
|
value: 4.36 |
|
- name: Val PER on Common Voice FR 13.0 | Trained |
|
type: per |
|
value: 4.31 |
|
--- |
|
|
|
# Fine-tuned French Voxpopuli v2 wav2vec2-base model for speech-to-phoneme task in French |
|
|
|
Fine-tuned [facebook/wav2vec2-base-fr-voxpopuli-v2](https://huggingface.co/facebook/wav2vec2-base-fr-voxpopuli-v2) for **French speech-to-phoneme** (without language model) using the train and validation splits of [Common Voice v13](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0). |
|
|
|
## Audio samplerate for usage |
|
|
|
When using this model, make sure that your speech input is **sampled at 16kHz**. |
|
|
|
## Output |
|
|
|
As this model is specifically trained for a speech-to-phoneme task, the output is sequence of [IPA-encoded](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) words, without punctuation. |
|
If you don't read the phonetic alphabet fluently, you can use this excellent [IPA reader website](http://ipa-reader.xyz) to convert the transcript back to audio synthetic speech in order to check the quality of the phonetic transcription. |
|
|
|
## Training procedure |
|
|
|
The model has been finetuned on Coommonvoice-v13 (FR) for 14 epochs on 4x2080 Ti GPUs using a ddp strategy and gradient-accumulation procedure (256 audios per update, corresponding roughly to 25 minutes of speech per update -> 2k updates per epoch) |
|
|
|
- Learning rate schedule : Double Tri-state schedule |
|
- Warmup from 1e-5 for 7% of total updates |
|
- Constant at 1e-4 for 28% of total updates |
|
- Linear decrease to 1e-6 for 36% of total updates |
|
- Second warmup boost to 3e-5 for 3% of total updates |
|
- Constant at 3e-5 for 12% of total updates |
|
- Linear decrease to 1e-7 for remaining 14% of updates |
|
|
|
- The set of hyperparameters used for training are the same as those detailed in Annex B and Table 6 of [wav2vec2 paper](https://arxiv.org/pdf/2006.11477.pdf). |
|
|
|
## Usage (with HuggingSound) |
|
|
|
The model can be used directly using the [HuggingSound](https://github.com/jonatasgrosman/huggingsound) library: |
|
|
|
```python |
|
import pandas as pd |
|
from huggingsound import SpeechRecognitionModel |
|
|
|
model = SpeechRecognitionModel("Cnam-LMSSC/wav2vec2-french-phonemizer") |
|
audio_paths = ["./test_relecture_texte.wav", "./10179_11051_000021.flac"] |
|
|
|
# No need for the Audio files to be sampled at 16 kHz here, |
|
# they are automatically resampled by Huggingsound |
|
|
|
transcriptions = model.transcribe(audio_paths) |
|
|
|
# (Optionnal) Display results in a table : |
|
## transcriptions is list of dicts also containing timestamps and probabilities ! |
|
|
|
df = pd.DataFrame(transcriptions) |
|
df['Audio file'] = pd.DataFrame(audio_paths) |
|
df.set_index('Audio file', inplace=True) |
|
df[['transcription']] |
|
``` |
|
|
|
**Output** : |
|
|
|
| **Audio file** | **Phonetic transcription (IPA)** | |
|
|:---------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------| |
|
| ./test_relecture_texte.wav | ʃapitʁ di də abɛse pəti kɔ̃t də ʒyl ləmɛtʁ ɑ̃ʁʒistʁe puʁ libʁivɔksɔʁɡ ibis dɑ̃ la bas kuʁ dœ̃ ʃato sə tʁuva paʁmi tut sɔʁt də volaj œ̃n ibis ʁɔz | |
|
| ./10179_11051_000021.flac | kɛl dɔmaʒ kə sə nə swa pa dy sykʁ supiʁa se foʁaz ɑ̃ pasɑ̃ sa lɑ̃ɡ syʁ la vitʁ fɛ̃ dy ʃapitʁ kɛ̃z ɑ̃ʁʒistʁe paʁ sonjɛ̃ sɛt ɑ̃ʁʒistʁəmɑ̃ fɛ paʁti dy domɛn pyblik | |
|
|
|
## Inference script (if you do not want to use Huggingsound) : |
|
|
|
```python |
|
import torch |
|
from transformers import AutoModelForCTC, Wav2Vec2Processor |
|
from datasets import load_dataset |
|
import soundfile as sf # Or Librosa if you prefer to ... |
|
|
|
MODEL_ID = "Cnam-LMSSC/wav2vec2-french-phonemizer" |
|
|
|
model = AutoModelForCTC.from_pretrained(MODEL_ID) |
|
processor = Wav2Vec2Processor.from_pretrained(MODEL_ID) |
|
|
|
audio = sf.read('example.wav') |
|
# Make sure you have a 16 kHz sampled audio file, or resample it ! |
|
|
|
inputs = processor(np.array(audio[0]),sampling_rate=16_000., return_tensors="pt") |
|
|
|
with torch.no_grad(): |
|
logits = model(**inputs).logits |
|
|
|
predicted_ids = torch.argmax(logits,dim = -1) |
|
transcription = processor.batch_decode(predicted_ids) |
|
|
|
print("Phonetic transcription : ", transcription) |
|
``` |
|
|
|
**Output** : |
|
|
|
'ʒə syi tʁɛ kɔ̃tɑ̃ də vu pʁezɑ̃te notʁ solysjɔ̃ puʁ fonomize dez odjo fasilmɑ̃ sa fɔ̃ksjɔn kɑ̃ mɛm tʁɛ bjɛ̃' |
|
|
|
## Test Results: |
|
|
|
In the table below, we report the Phoneme Error Rate (PER) (CER) of the model on both Common Voice and Multilingual Librispeech (using the French configs for both datasets of course), when finetuned on Common Voice train set only : |
|
|
|
| Model | Test Set | PER | |
|
| ------------- | ------------- | ------------- | |
|
| Cnam-LMSSC/wav2vec2-french-phonemizer | Common Voice v13 (French) | **5.52%** | |
|
| Cnam-LMSSC/wav2vec2-french-phonemizer | Multilingual Librispeech (French) | **4.36%** | |
|
|
|
|
|
## Citation |
|
If you use this finetuned model for any publication, please use this to cite our work : |
|
|
|
```bibtex |
|
@misc{lmssc-wav2vec2-base-phonemizer-french, |
|
title={Fine-tuned wav2vec2 base model for speech to phoneme in {F}rench}, |
|
author={Malo, Olivier and Julien, Hauret and {\'E}ric, Bavu}, |
|
howpublished={\url{https://huggingface.co/Cnam-LMSSC/wav2vec2-french-phonemizer}}, |
|
year={2023} |
|
} |
|
``` |