--- license: mit language: fr datasets: - mozilla-foundation/common_voice_13_0 metrics: - per tags: - audio - automatic-speech-recognition - speech - phonemize model-index: - name: Wav2Vec2-base French finetuned for phonemes by LMSSC results: - task: name: Speech Recognition type: automatic-speech-recognition dataset: name: Common Voice v13 type: mozilla-foundation/common_voice_13_0 args: fr metrics: - name: Test PER on Common Voice FR 13.0 | Trained type: per value: 5.52 - name: Test PER on Multilingual Librispeech FR | Trained type: per value: 4.36 - name: Val PER on Common Voice FR 13.0 | Trained type: per value: 4.31 --- # Fine-tuned French Voxpopuli v2 wav2vec2-base model for speech-to-phoneme task in French Fine-tuned [facebook/wav2vec2-base-fr-voxpopuli-v2](https://huggingface.co/facebook/wav2vec2-base-fr-voxpopuli-v2) for **French speech-to-phoneme** (without language model) using the train and validation splits of [Common Voice v13](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0). ## Audio samplerate for usage When using this model, make sure that your speech input is **sampled at 16kHz**. ## Output As this model is specifically trained for a speech-to-phoneme task, the output is sequence of [IPA-encoded](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) words, without punctuation. If you don't read the phonetic alphabet fluently, you can use this excellent [IPA reader website](http://ipa-reader.xyz) to convert the transcript back to audio synthetic speech in order to check the quality of the phonetic transcription. ## Training procedure The model has been finetuned on Coommonvoice-v13 (FR) for 14 epochs on 4x2080 Ti GPUs using a ddp strategy and gradient-accumulation procedure (256 audios per update, corresponding roughly to 25 minutes of speech per update -> 2k updates per epoch) - Learning rate schedule : Double Tri-state schedule - Warmup from 1e-5 for 7% of total updates - Constant at 1e-4 for 28% of total updates - Linear decrease to 1e-6 for 36% of total updates - Second warmup boost to 3e-5 for 3% of total updates - Constant at 3e-5 for 12% of total updates - Linear decrease to 1e-7 for remaining 14% of updates - The set of hyperparameters used for training are the same as those detailed in Annex B and Table 6 of [wav2vec2 paper](https://arxiv.org/pdf/2006.11477.pdf). ## Usage (with HuggingSound) The model can be used directly using the [HuggingSound](https://github.com/jonatasgrosman/huggingsound) library: ```python from huggingsound import SpeechRecognitionModel model = SpeechRecognitionModel("Cnam-LMSSC/wav2vec2-french-phonemizer") audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"] transcriptions = model.transcribe(audio_paths) ``` ## Usage