|
--- |
|
datasets: |
|
- projecte-aina/festcat_trimmed_denoised |
|
- projecte-aina/openslr-slr69-ca-trimmed-denoised |
|
- lj_speech |
|
- blabble-io/libritts_r |
|
license: apache-2.0 |
|
--- |
|
|
|
# Wavenext-mel-22khz |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
Wavenext is a modification of Vocos, where the last ISTFT layer is replaced with a a trainable linear layer that can directly predict speech waveform samples. |
|
|
|
This version of Wavenext uses 80-bin mel spectrograms as acoustic features which are widespread |
|
in the TTS domain since the introduction of [hifi-gan](https://github.com/jik876/hifi-gan/blob/master/meldataset.py) |
|
The goal of this model is to provide an alternative to hifi-gan that is faster and compatible with the |
|
acoustic output of several TTS models. |
|
|
|
## Intended Uses and limitations |
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
The model is aimed to serve as a vocoder to synthesize audio waveforms from mel spectrograms. Is trained to generate speech and if is used in other audio |
|
domain is possible that the model won't produce high quality samples. |
|
|
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
|
|
The model was trained on 4 speech datasets |
|
|
|
| Dataset | Language | Hours | |
|
|---------------------|----------|---------| |
|
| LibriTTS-r | en | 585 | |
|
| LJSpeech | en | 24 | |
|
| Festcat | ca | 22 | |
|
| OpenSLR69 | ca | 5 | |
|
|
|
|
|
### Training Procedure |
|
|
|
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
|
The model was trained for 1M steps and 96 epochs with a batch size of 16 for stability. We used a Cosine scheduler with a initial learning rate of 1e-4. |
|
We also modified the mel spectrogram loss to use 128 bins and fmax of 11025 instead of the same input mel spectrogram. |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
|
|
* initial_learning_rate: 5e-4 |
|
* scheduler: cosine without warmup or restarts |
|
* mel_loss_coeff: 45 |
|
* mrd_loss_coeff: 0.1 |
|
* batch_size: 16 |
|
* num_samples: 16384 |
|
|
|
## Evaluation |
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
|
|
Evaluation was done using the metrics on the original repo, after 183 epochs we achieve: |
|
|
|
* val_loss: 3.79 |
|
* f1_score: 0.94 |
|
* mel_loss: 0.27 |
|
* periodicity_loss:0.128 |
|
* pesq_score: 3.27 |
|
* pitch_loss: 31.33 |
|
* utmos_score: 3.20 |
|
|
|
|
|
## Citation |
|
|
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
|
|
If this code contributes to your research, please cite the work: |
|
|
|
``` |
|
@INPROCEEDINGS{10389765, |
|
author={Okamoto, Takuma and Yamashita, Haruki and Ohtani, Yamato and Toda, Tomoki and Kawai, Hisashi}, |
|
booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)}, |
|
title={WaveNeXt: ConvNeXt-Based Fast Neural Vocoder Without ISTFT layer}, |
|
year={2023}, |
|
volume={}, |
|
number={}, |
|
pages={1-8}, |
|
keywords={Fourier transforms;Vocoders;Conferences;Automatic speech recognition;ConvNext;end-to-end text-to-speech;linear layer-based upsampling;neural vocoder;Vocos}, |
|
doi={10.1109/ASRU57964.2023.10389765}} |
|
|
|
@article{siuzdak2023vocos, |
|
title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis}, |
|
author={Siuzdak, Hubert}, |
|
journal={arXiv preprint arXiv:2306.00814}, |
|
year={2023} |
|
} |
|
``` |
|
|
|
## Additional information |
|
|
|
### Author |
|
The Language Technologies Unit from Barcelona Supercomputing Center. |
|
|
|
### Contact |
|
For further information, please send an email to <langtech@bsc.es>. |
|
|
|
### Copyright |
|
Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center. |
|
|
|
### License |
|
[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
|
|
|
### Funding |
|
|
|
This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/). |