Update README.md
Browse files
README.md
CHANGED
@@ -3,4 +3,74 @@ license: mit
|
|
3 |
tags:
|
4 |
- text-to-audio
|
5 |
- controlnet
|
6 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
tags:
|
4 |
- text-to-audio
|
5 |
- controlnet
|
6 |
+
---
|
7 |
+
|
8 |
+
<img src="arts/ezaudio.png">
|
9 |
+
|
10 |
+
# EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer
|
11 |
+
|
12 |
+
🟣 EzAudio is a diffusion-based text-to-audio generation model. Designed for real-world audio applications, EzAudio brings together high-quality audio synthesis with lower computational demands.
|
13 |
+
|
14 |
+
🎛 Play EzAudio on Hugging Face Space: [EzAudio: Text-to-Audio Generation, Editing, and Inpainting](https://huggingface.co/spaces/OpenSound/EzAudio)
|
15 |
+
|
16 |
+
🎮 EzAudio-ControlNet is available: [EzAudio-ControlNet](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet)!
|
17 |
+
|
18 |
+
## Installation
|
19 |
+
|
20 |
+
Clone the repository:
|
21 |
+
```
|
22 |
+
git clone git@github.com:haidog-yaqub/EzAudio.git
|
23 |
+
```
|
24 |
+
Install the dependencies:
|
25 |
+
```
|
26 |
+
cd EzAudio
|
27 |
+
pip install -r requirements.txt
|
28 |
+
```
|
29 |
+
Download checkponts from: [https://huggingface.co/OpenSound/EzAudio](https://huggingface.co/OpenSound/EzAudio)
|
30 |
+
|
31 |
+
## Usage
|
32 |
+
|
33 |
+
You can use the model with the following code:
|
34 |
+
|
35 |
+
```python
|
36 |
+
from api.ezaudio import load_models, generate_audio
|
37 |
+
|
38 |
+
# model and config paths
|
39 |
+
config_name = 'ckpts/ezaudio-xl.yml'
|
40 |
+
ckpt_path = 'ckpts/s3/ezaudio_s3_xl.pt'
|
41 |
+
vae_path = 'ckpts/vae/1m.pt'
|
42 |
+
# save_path = 'output/'
|
43 |
+
device = 'cuda' if torch.cuda.is_available() else 'cpu'
|
44 |
+
|
45 |
+
# load model
|
46 |
+
(autoencoder, unet, tokenizer,
|
47 |
+
text_encoder, noise_scheduler, params) = load_models(config_name, ckpt_path,
|
48 |
+
vae_path, device)
|
49 |
+
|
50 |
+
prompt = "a dog barking in the distance"
|
51 |
+
sr, audio = generate_audio(prompt, autoencoder, unet, tokenizer, text_encoder, noise_scheduler, params, device)
|
52 |
+
|
53 |
+
```
|
54 |
+
|
55 |
+
## Todo
|
56 |
+
- [x] Release Gradio Demo along with checkpoints [EzAudio Space](https://huggingface.co/spaces/OpenSound/EzAudio)
|
57 |
+
- [x] Release ControlNet Demo along with checkpoints [EzAudio ControlNet Space](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet)
|
58 |
+
- [x] Release inference code
|
59 |
+
- [ ] Release checkpoints for stage1 and stage2
|
60 |
+
- [ ] Release training pipeline and dataset
|
61 |
+
|
62 |
+
## Reference
|
63 |
+
|
64 |
+
If you find the code useful for your research, please consider citing:
|
65 |
+
|
66 |
+
```bibtex
|
67 |
+
@article{hai2024ezaudio,
|
68 |
+
title={EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer},
|
69 |
+
author={Hai, Jiarui and Xu, Yong and Zhang, Hao and Li, Chenxing and Wang, Helin and Elhilali, Mounya and Yu, Dong},
|
70 |
+
journal={arXiv preprint arXiv:2409.10819},
|
71 |
+
year={2024}
|
72 |
+
}
|
73 |
+
```
|
74 |
+
|
75 |
+
## Acknowledgement
|
76 |
+
Some code are borrowed from or inspired by: [U-Vit](https://github.com/baofff/U-ViT), [Pixel-Art](https://github.com/PixArt-alpha/PixArt-alpha), [Huyuan-DiT](https://github.com/Tencent/HunyuanDiT), and [Stable Audio](https://github.com/Stability-AI/stable-audio-tools).
|