Riffusion (Smaller model)

This model is smaller in filesize (2.13GB VS 14.6GB) due to being lower precision and having ema weights and some things stuff stripped.

  • Generated spectrograms will be different from the ones in the 14.6GB model
  • There is no noticable quality difference between the original and the small model
  • The small model is easier to load on low cpu RAM, for example: If you have only 16GB of RAM, loading the large model could have some issues, like in my case, my pc froze for a few seconds.
  • The small model loads faster than the large model
  • The large model is probably better for training, but i have had great success with training LoRA on the small model.

Riffusion (Original readme)

Riffusion is an app for real-time music generation with stable diffusion.

Read about it at https://www.riffusion.com/about and try it at https://www.riffusion.com/.

This repository contains the model files, including:

  • a diffusers formated library
  • a compiled checkpoint file
  • a traced unet for improved inference speed
  • a seed image library for use with riffusion-app

Riffusion v1 Model

Riffusion is a latent text-to-image diffusion model capable of generating spectrogram images given any text input. These spectrograms can be converted into audio clips.

The model was created by Seth Forsgren and Hayk Martiros as a hobby project.

You can use the Riffusion model directly, or try the Riffusion web app.

The Riffusion model was created by fine-tuning the Stable-Diffusion-v1-5 checkpoint. Read about Stable Diffusion here 🤗's Stable Diffusion blog.

Model Details

Direct Use

The model is intended for research purposes only. Possible research areas and tasks include

  • Generation of artworks, audio, and use in creative processes.
  • Applications in educational or creative tools.
  • Research on generative models.

Datasets

The original Stable Diffusion v1.5 was trained on the LAION-5B dataset using the CLIP text encoder, which provided an amazing starting point with an in-depth understanding of language, including musical concepts. The team at LAION also compiled a fantastic audio dataset from many general, speech, and music sources that we recommend at LAION-AI/audio-dataset.

Fine Tuning

Check out the diffusers training examples from Hugging Face. Fine tuning requires a dataset of spectrogram images of short audio clips, with associated text describing them. Note that the CLIP encoder is able to understand and connect many words even if they never appear in the dataset. It is also possible to use a dreambooth method to get custom styles.

Citation

If you build on this work, please cite it as follows:

@article{Forsgren_Martiros_2022,
  author = {Forsgren, Seth* and Martiros, Hayk*},
  title = {{Riffusion - Stable diffusion for real-time music generation}},
  url = {https://riffusion.com/about},
  year = {2022}
}
Downloads last month
6
Inference Examples
Inference API (serverless) does not yet support diffusers models for this pipeline type.