File size: 5,989 Bytes
c3769a8 1d7ef3b c3769a8 58940de 1c26931 9789e02 1c26931 d1481f9 1c26931 9789e02 1c26931 9789e02 1c26931 23ac72b 1c26931 9789e02 1c26931 5973917 1f0d43e 1c26931 8040025 d1481f9 8040025 1c26931 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 |
---
license: mit
tags:
- audio-feature-extraction
- speech-language-models
- gpt4-o
- tokenizer
- codec-representation
- text-to-speech
- automatic-speech-recognition
---
# WavTokenizer: SOTA Discrete Codec Models With Forty Tokens Per Second for Audio Language Modeling
[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2408.16532)
[![demo](https://img.shields.io/badge/WanTokenizer-Demo-red)](https://wavtokenizer.github.io/)
[![model](https://img.shields.io/badge/%F0%9F%A4%97%20WavTokenizer-Models-blue)](https://huggingface.co/novateur/WavTokenizer)
### ππ with WavTokenizer, you can represent speech, music, and audio with only 40 tokens per second!
### ππ with WavTokenizer, You can get strong reconstruction results.
### ππ WavTokenizer owns rich semantic information and is build for audio language models such as GPT4-o.
# π₯ News
- *2024.08*: We release WavTokenizer on arxiv.
![result](result.png)
## Installation
To use WavTokenizer, install it using:
```bash
conda create -n wavtokenizer python=3.9
conda activate wavtokenizer
pip install -r requirements.txt
```
## Infer
### Part1: Reconstruct audio from raw wav
```python
from encoder.utils import convert_audio
import torchaudio
import torch
from decoder.pretrained import WavTokenizer
device=torch.device('cpu')
config_path = "./configs/xxx.yaml"
model_path = "./xxx.ckpt"
audio_outpath = "xxx"
wavtokenizer = WavTokenizer.from_pretrained0802(config_path, model_path)
wavtokenizer = wavtokenizer.to(device)
wav, sr = torchaudio.load(audio_path)
wav = convert_audio(wav, sr, 24000, 1)
bandwidth_id = torch.tensor([0])
wav=wav.to(device)
features,discrete_code= wavtokenizer.encode_infer(wav, bandwidth_id=bandwidth_id)
audio_out = wavtokenizer.decode(features, bandwidth_id=bandwidth_id)
torchaudio.save(audio_outpath, audio_out, sample_rate=24000, encoding='PCM_S', bits_per_sample=16)
```
### Part2: Generating discrete codecs
```python
from encoder.utils import convert_audio
import torchaudio
import torch
from decoder.pretrained import WavTokenizer
device=torch.device('cpu')
config_path = "./configs/xxx.yaml"
model_path = "./xxx.ckpt"
wavtokenizer = WavTokenizer.from_pretrained0802(config_path, model_path)
wavtokenizer = wavtokenizer.to(device)
wav, sr = torchaudio.load(audio_path)
wav = convert_audio(wav, sr, 24000, 1)
bandwidth_id = torch.tensor([0])
wav=wav.to(device)
_,discrete_code= wavtokenizer.encode_infer(wav, bandwidth_id=bandwidth_id)
print(discrete_code)
```
### Part3: Audio reconstruction through codecs
```python
# audio_tokens [n_q,1,t]/[n_q,t]
features = wavtokenizer.codes_to_features(audio_tokens)
bandwidth_id = torch.tensor([0])
audio_out = wavtokenizer.decode(features, bandwidth_id=bandwidth_id)
```
## Available models
π€ links to the Huggingface model hub.
| Model name | HuggingFace | Corpus | Token/s | Domain | Open-Source |
|:--------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------:|:---------:|:----------:|:------:|
| WavTokenizer-small-600-24k-4096 | [π€](https://huggingface.co/novateur/WavTokenizer/blob/main/WavTokenizer_small_600_24k_4096.ckpt) | LibriTTS | 40 | Speech | β |
| WavTokenizer-small-320-24k-4096 | [π€](https://huggingface.co/novateur/WavTokenizer/blob/main/WavTokenizer_small_320_24k_4096.ckpt) | LibriTTS | 75 | Speech | β|
| WavTokenizer-medium-600-24k-4096 | [π€](https://github.com/jishengpeng/wavtokenizer) | 10000 Hours | 40 | Speech, Audio, Music | Coming Soon|
| WavTokenizer-medium-320-24k-4096 | [π€](https://github.com/jishengpeng/wavtokenizer) | 10000 Hours | 75 | Speech, Audio, Music | Coming Soon|
| WavTokenizer-large-600-24k-4096 | [π€](https://github.com/jishengpeng/wavtokenizer) | 80000 Hours | 40 | Speech, Audio, Music | Coming Soon|
| WavTokenizer-large-320-24k-4096 | [π€](https://github.com/jishengpeng/wavtokenizer) | 80000 Hours | 75 | Speech, Audio, Music | Coming Soon |
## Training
### Step1: Prepare train dataset
```python
# Process the data into a form similar to ./data/demo.txt
```
### Step2: Modifying configuration files
```python
# ./configs/xxx.yaml
# Modify the values of parameters such as batch_size, filelist_path, save_dir, device
```
### Step3: Start training process
Refer to [Pytorch Lightning documentation](https://lightning.ai/docs/pytorch/stable/) for details about customizing the
training pipeline.
```bash
cd ./WavTokenizer
python train.py fit --config ./configs/xxx.yaml
```
## Citation
If this code contributes to your research, please cite our work, Language-Codec and WavTokenizer:
```
@article{ji2024wavtokenizer,
title={WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling},
author={Ji, Shengpeng and Jiang, Ziyue and Cheng, Xize and Chen, Yifu and Fang, Minghui and Zuo, Jialong and Yang, Qian and Li, Ruiqi and Zhang, Ziang and Yang, Xiaoda and others},
journal={arXiv preprint arXiv:2408.16532},
year={2024}
}
@article{ji2024language,
title={Language-codec: Reducing the gaps between discrete codec representation and speech language models},
author={Ji, Shengpeng and Fang, Minghui and Jiang, Ziyue and Huang, Rongjie and Zuo, Jialung and Wang, Shulei and Zhao, Zhou},
journal={arXiv preprint arXiv:2402.12208},
year={2024}
}
``` |