OpenSound commited on
Commit
da59973
•
1 Parent(s): d59585b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -1
README.md CHANGED
@@ -3,4 +3,74 @@ license: mit
3
  tags:
4
  - text-to-audio
5
  - controlnet
6
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  tags:
4
  - text-to-audio
5
  - controlnet
6
+ ---
7
+
8
+ <img src="arts/ezaudio.png">
9
+
10
+ # EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer
11
+
12
+ 🟣 EzAudio is a diffusion-based text-to-audio generation model. Designed for real-world audio applications, EzAudio brings together high-quality audio synthesis with lower computational demands.
13
+
14
+ 🎛 Play EzAudio on Hugging Face Space: [EzAudio: Text-to-Audio Generation, Editing, and Inpainting](https://huggingface.co/spaces/OpenSound/EzAudio)
15
+
16
+ 🎮 EzAudio-ControlNet is available: [EzAudio-ControlNet](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet)!
17
+
18
+ ## Installation
19
+
20
+ Clone the repository:
21
+ ```
22
+ git clone git@github.com:haidog-yaqub/EzAudio.git
23
+ ```
24
+ Install the dependencies:
25
+ ```
26
+ cd EzAudio
27
+ pip install -r requirements.txt
28
+ ```
29
+ Download checkponts from: [https://huggingface.co/OpenSound/EzAudio](https://huggingface.co/OpenSound/EzAudio)
30
+
31
+ ## Usage
32
+
33
+ You can use the model with the following code:
34
+
35
+ ```python
36
+ from api.ezaudio import load_models, generate_audio
37
+
38
+ # model and config paths
39
+ config_name = 'ckpts/ezaudio-xl.yml'
40
+ ckpt_path = 'ckpts/s3/ezaudio_s3_xl.pt'
41
+ vae_path = 'ckpts/vae/1m.pt'
42
+ # save_path = 'output/'
43
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
44
+
45
+ # load model
46
+ (autoencoder, unet, tokenizer,
47
+ text_encoder, noise_scheduler, params) = load_models(config_name, ckpt_path,
48
+ vae_path, device)
49
+
50
+ prompt = "a dog barking in the distance"
51
+ sr, audio = generate_audio(prompt, autoencoder, unet, tokenizer, text_encoder, noise_scheduler, params, device)
52
+
53
+ ```
54
+
55
+ ## Todo
56
+ - [x] Release Gradio Demo along with checkpoints [EzAudio Space](https://huggingface.co/spaces/OpenSound/EzAudio)
57
+ - [x] Release ControlNet Demo along with checkpoints [EzAudio ControlNet Space](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet)
58
+ - [x] Release inference code
59
+ - [ ] Release checkpoints for stage1 and stage2
60
+ - [ ] Release training pipeline and dataset
61
+
62
+ ## Reference
63
+
64
+ If you find the code useful for your research, please consider citing:
65
+
66
+ ```bibtex
67
+ @article{hai2024ezaudio,
68
+ title={EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer},
69
+ author={Hai, Jiarui and Xu, Yong and Zhang, Hao and Li, Chenxing and Wang, Helin and Elhilali, Mounya and Yu, Dong},
70
+ journal={arXiv preprint arXiv:2409.10819},
71
+ year={2024}
72
+ }
73
+ ```
74
+
75
+ ## Acknowledgement
76
+ Some code are borrowed from or inspired by: [U-Vit](https://github.com/baofff/U-ViT), [Pixel-Art](https://github.com/PixArt-alpha/PixArt-alpha), [Huyuan-DiT](https://github.com/Tencent/HunyuanDiT), and [Stable Audio](https://github.com/Stability-AI/stable-audio-tools).