arxiv:2301.11757

Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion

Published on Jan 27, 2023

Authors:

Flavio Schneider ,

Abstract

Recent years have seen the rapid development of large generative models for text; however, much less research has explored the connection between text and another "language" of communication -- music. Music, much like text, can convey emotions, stories, and ideas, and has its own unique structure and syntax. In our work, we bridge text and music via a text-to-music generation model that is highly efficient, expressive, and can handle long-term structure. Specifically, we develop Mo\^usai, a cascading two-stage latent diffusion model that can generate multiple minutes of high-quality stereo music at 48kHz from textual descriptions. Moreover, our model features high efficiency, which enables real-time inference on a single consumer GPU with a reasonable speed. Through experiments and property analyses, we show our model's competence over a variety of criteria compared with existing music generation models. Lastly, to promote the open-source culture, we provide a collection of open-source libraries with the hope of facilitating future work in the field. We open-source the following: Codes: https://github.com/archinetai/audio-diffusion-pytorch; music samples for this paper: http://bit.ly/44ozWDH; all music samples for all models: https://bit.ly/audio-diffusion.

View arXiv page View PDF Add to collection

Community

Arrive

Nov 24, 2023

Axiomas

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2301.11757 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2301.11757 in a Space README.md to link it from this page.