license: cc-by-nc-nd-4.0
MusicLDM
MusicLDM is a latent text-to-audio diffusion model capable of generating music samples from a text input. It is available in the 🧨 Diffusers library from v0.21.0 onwards.
Model Details
MusicLDM was proposed in MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies by Ke Chen, Yusong Wu, Haohe Liu, Marianna Nezhurina, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
Inspired by Stable Diffusion and AudioLDM, MusicLDM is a text-to-music latent diffusion model (LDM) that learns continuous audio representations from CLAP latents.
MusicLDM is trained on a corpus of 466 hours of music data. Beat-synchronous data augmentation strategies are applied to the music samples, both in the time domain and in the latent space. Using beat-synchronous data augmentation strategies encourages the model to interpolate between the training samples, but stay within the domain of the training data. The result is generated music that is more diverse while staying faithful to the corresponding style.
Model Sources
Usage
First, install the required packages:
pip install --upgrade diffusers transformers accelerate
Text-to-Music
For text-to-music generation, the MusicLDMPipeline can be used to load pre-trained weights and generate text-conditional audio outputs:
from diffusers import MusicLDMPipeline
import torch
repo_id = "cvssp/musicldm"
pipe = MusicLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
audio = pipe(prompt, num_inference_steps=200, audio_length_in_s=10.0).audios[0]
The resulting audio output can be saved as a .wav file:
import scipy
scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
Or displayed in a Jupyter Notebook / Google Colab:
from IPython.display import Audio
Audio(audio, rate=16000)
Tips
When constructing a prompt, keep in mind:
- Descriptive prompt inputs work best; use adjectives to describe the sound (for example, "high quality" or "clear") and make the prompt context specific where possible (e.g. "melodic techno with a fast beat and synths" works better than "techno").
- Using a negative prompt can significantly improve the quality of the generated audio. Try using a negative prompt of "low quality, average quality".
During inference:
- The quality of the generated audio sample can be controlled by the
num_inference_steps
argument; higher steps give higher quality audio at the expense of slower inference. - Multiple waveforms can be generated in one go: set
num_waveforms_per_prompt
to a value greater than 1 to enable. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly. - The length of the generated audio sample can be controlled by varying the
audio_length_in_s
argument.
The following example demonstrates how to construct a good audio generation using the aforementioned tips:
import scipy
import torch
from diffusers import MusicLDMPipeline
# load the pipeline
repo_id = "ircam-reach/musicldm"
pipe = MusicLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
# define the prompts
prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
negative_prompt = "low quality, average quality"
# set the seed
generator = torch.Generator("cuda").manual_seed(0)
# run the generation
audio = pipe(
prompt,
negative_prompt=negative_prompt,
num_inference_steps=200,
audio_length_in_s=10.0,
num_waveforms_per_prompt=3,
).audios
# save the best audio sample (index 0) as a .wav file
scipy.io.wavfile.write("techno.wav", rate=16000, data=audio[0])
Citation
BibTeX:
@article{chen2023musicldm,
title={"MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies"},
author={Chen*, Ke and Wu*, Yusong and Liu*, Haohe and Nezhurina, Marianna and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo},
journal={arXiv preprint arXiv:2308.01546},
year={2023}
}