|
# EnCodec: High Fidelity Neural Audio Compression |
|
|
|
AudioCraft provides the training code for EnCodec, a state-of-the-art deep learning |
|
based audio codec supporting both mono stereo audio, presented in the |
|
[High Fidelity Neural Audio Compression][arxiv] paper. |
|
Check out our [sample page][encodec_samples]. |
|
|
|
## Original EnCodec models |
|
|
|
The EnCodec models presented in High Fidelity Neural Audio Compression can be accessed |
|
and used with the [EnCodec repository](https://github.com/facebookresearch/encodec). |
|
|
|
**Note**: We do not guarantee compatibility between the AudioCraft and EnCodec codebases |
|
and released checkpoints at this stage. |
|
|
|
|
|
## Installation |
|
|
|
Please follow the AudioCraft installation instructions from the [README](../README.md). |
|
|
|
|
|
## Training |
|
|
|
The [CompressionSolver](../audiocraft/solvers/compression.py) implements the audio reconstruction |
|
task to train an EnCodec model. Specifically, it trains an encoder-decoder with a quantization |
|
bottleneck - a SEANet encoder-decoder with Residual Vector Quantization bottleneck for EnCodec - |
|
using a combination of objective and perceptual losses in the forms of discriminators. |
|
|
|
The default configuration matches a causal EnCodec training with at a single bandwidth. |
|
|
|
### Example configuration and grids |
|
|
|
We provide sample configuration and grids for training EnCodec models. |
|
|
|
The compression configuration are defined in |
|
[config/solver/compression](../config/solver/compression). |
|
|
|
The example grids are available at |
|
[audiocraft/grids/compression](../audiocraft/grids/compression). |
|
|
|
```shell |
|
# base causal encodec on monophonic audio sampled at 24 khz |
|
dora grid compression.encodec_base_24khz |
|
# encodec model used for MusicGen on monophonic audio sampled at 32 khz |
|
dora grid compression.encodec_musicgen_32khz |
|
``` |
|
|
|
### Training and valid stages |
|
|
|
The model is trained using a combination of objective and perceptual losses. |
|
More specifically, EnCodec is trained with the MS-STFT discriminator along with |
|
objective losses through the use of a loss balancer to effectively weight |
|
the different losses, in an intuitive manner. |
|
|
|
### Evaluation stage |
|
|
|
Evaluations metrics for audio generation: |
|
* SI-SNR: Scale-Invariant Signal-to-Noise Ratio. |
|
* ViSQOL: Virtual Speech Quality Objective Listener. |
|
|
|
Note: Path to the ViSQOL binary (compiled with bazel) needs to be provided in |
|
order to run the ViSQOL metric on the reference and degraded signals. |
|
The metric is disabled by default. |
|
Please refer to the [metrics documentation](../METRICS.md) to learn more. |
|
|
|
### Generation stage |
|
|
|
The generation stage consists in generating the reconstructed audio from samples |
|
with the current model. The number of samples generated and the batch size used are |
|
controlled by the `dataset.generate` configuration. The output path and audio formats |
|
are defined in the generate stage configuration. |
|
|
|
```shell |
|
# generate samples every 5 epoch |
|
dora run solver=compression/encodec_base_24khz generate.every=5 |
|
# run with a different dset |
|
dora run solver=compression/encodec_base_24khz generate.path=<PATH_IN_DORA_XP_FOLDER> |
|
# limit the number of samples or use a different batch size |
|
dora grid solver=compression/encodec_base_24khz dataset.generate.num_samples=10 dataset.generate.batch_size=4 |
|
``` |
|
|
|
### Playing with the model |
|
|
|
Once you have a model trained, it is possible to get the entire solver, or just |
|
the trained model with the following functions: |
|
|
|
```python |
|
from audiocraft.solvers import CompressionSolver |
|
|
|
# If you trained a custom model with signature SIG. |
|
model = CompressionSolver.model_from_checkpoint('//sig/SIG') |
|
# If you want to get one of the pretrained models with the `//pretrained/` prefix. |
|
model = CompressionSolver.model_from_checkpoint('//pretrained/facebook/encodec_32khz') |
|
# Or load from a custom checkpoint path |
|
model = CompressionSolver.model_from_checkpoint('/my_checkpoints/foo/bar/checkpoint.th') |
|
|
|
|
|
# If you only want to use a pretrained model, you can also directly get it |
|
# from the CompressionModel base model class. |
|
from audiocraft.models import CompressionModel |
|
|
|
# Here do not put the `//pretrained/` prefix! |
|
model = CompressionModel.get_pretrained('facebook/encodec_32khz') |
|
model = CompressionModel.get_pretrained('dac_44khz') |
|
|
|
# Finally, you can also retrieve the full Solver object, with its dataloader etc. |
|
from audiocraft import train |
|
from pathlib import Path |
|
import logging |
|
import os |
|
import sys |
|
|
|
# uncomment the following line if you want some detailed logs when loading a Solver. |
|
logging.basicConfig(stream=sys.stderr, level=logging.INFO) |
|
# You must always run the following function from the root directory. |
|
os.chdir(Path(train.__file__).parent.parent) |
|
|
|
|
|
# You can also get the full solver (only for your own experiments). |
|
# You can provide some overrides to the parameters to make things more convenient. |
|
solver = train.get_solver_from_sig('SIG', {'device': 'cpu', 'dataset': {'batch_size': 8}}) |
|
solver.model |
|
solver.dataloaders |
|
``` |
|
|
|
### Importing / Exporting models |
|
|
|
At the moment we do not have a definitive workflow for exporting EnCodec models, for |
|
instance to Hugging Face (HF). We are working on supporting automatic convertion between |
|
AudioCraft and Hugging Face implementations. |
|
|
|
We still have some support for fine tuning an EnCodec model coming from HF in AudioCraft, |
|
using for instance `continue_from=//pretrained/facebook/encodec_32k`. |
|
|
|
An AudioCraft checkpoint can be exported in a more compact format (excluding the optimizer etc.) |
|
using `audiocraft.utils.export.export_encodec`. For instance, you could run |
|
|
|
```python |
|
from audiocraft.utils import export |
|
from audiocraft import train |
|
xp = train.main.get_xp_from_sig('SIG') |
|
export.export_encodec( |
|
xp.folder / 'checkpoint.th', |
|
'/checkpoints/my_audio_lm/compression_state_dict.bin') |
|
|
|
|
|
from audiocraft.models import CompressionModel |
|
model = CompressionModel.get_pretrained('/checkpoints/my_audio_lm/compression_state_dict.bin') |
|
|
|
from audiocraft.solvers import CompressionSolver |
|
# The two are strictly equivalent, but this function supports also loading from non already exported models. |
|
model = CompressionSolver.model_from_checkpoint('//pretrained//checkpoints/my_audio_lm/compression_state_dict.bin') |
|
``` |
|
|
|
We will see then how to use this model as a tokenizer for MusicGen/Audio gen in the |
|
[MusicGen documentation](./MUSICGEN.md). |
|
|
|
### Learn more |
|
|
|
Learn more about AudioCraft training pipelines in the [dedicated section](./TRAINING.md). |
|
|
|
|
|
## Citation |
|
``` |
|
@article{defossez2022highfi, |
|
title={High Fidelity Neural Audio Compression}, |
|
author={Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi}, |
|
journal={arXiv preprint arXiv:2210.13438}, |
|
year={2022} |
|
} |
|
``` |
|
|
|
|
|
## License |
|
|
|
See license information in the [README](../README.md). |
|
|
|
[arxiv]: https://arxiv.org/abs/2210.13438 |
|
[encodec_samples]: https://ai.honu.io/papers/encodec/samples.html |
|
|