Preprint: “Long-form music generation with latent diffusion”

We show that by training a generative model on long temporal contexts it is possible to produce long-form music of up to 4m 45s. Our model consists of a diffusion-transformer operating on a highly downsampled continuous latent representation (latent rate of 21.5 Hz). It obtains state-of-the-art generations according to metrics on audio quality and prompt alignment, and subjective tests reveal that it produces full-length music with coherent structure.

Check it on arXiv, and our demos online and in SoundCloud!

Preprint: “Fast Timing-Conditioned Latent Audio Diffusion”

The Stable Audio paper is finally out. In this work I’ve been mostly focusing on the evaluation of the model. With those metrics you can now evaluate long-form, full-band, and variable-length music and audio generations. Previous work focused on evaluating short-form, 16kHz music and audio. The results in our perceptual study show that Stable Audio is competitive, specially in terms of audio quality. We also assessed musicality, stereo correctness, and musical structure. Stable Audio is able to consistently generate music with structure!

Check its the model and evaluation code.
Also check it on arXiv and its demo!

On Prompting Stable Audio

Stable Audio allows you creating custom-length audio just by describing it. It is powered by a generative audio model based on diffusion. You can generate and download audio in 44.1 kHz stereo. You also have a nice interface, no need to be a hacker! And the audio you create can be used in your commercial projects. I’ve been experimenting with it during the last weeks, and here some ideas on how to use it!

Continue reading