By Meghmala Chowdhury

Riffusion was developed by Seth Forsgren and Hayk Martiros as a side project. It stores audio in sonograms, which are two-dimensional images. Riffusion, an AI model that makes music from text prompts by constructing a visual representation of sound and converting it to audio for playback, was launched on Thursday by a couple of IT enthusiasts. It applies visual latent diffusion to sound processing in a novel manner using a fine-tuned version of the Stable Diffusion 1.5 image synthesis model. The X-axis in a sonogram depicts time (the left-to-right order in which the frequencies are played), and the Y-axis is the frequency of the sounds.

The color of each pixel in the image, meanwhile, shows the volume of the sound at that specific instant in time. A sonogram can be processed using stable diffusion because it is a sort of image. With the help of examples of sonograms that were connected to descriptions of the sounds or musical genres they represented, Forsgren and Martiros trained a unique Stable Diffusion model. With this knowledge, Riffusion can produce fresh music on demand based on text prompts that specify the genre of music or sound you like, such as “jazz,” “rock,” or even keystrokes on a keyboard. Riffusion creates the sonogram image, converts it to sound using Torchaudio, and then plays it back as audio.

The Riffusion website offers an interactive web tool that allows users to engage with the AI model by creating interpolated sonograms that are seamlessly stitched together for uninterrupted playback and continuously displaying the spectrogram on the left side of the page. The developers of Riffusion state on its explanation page, “This is the v1.5 Stable Diffusion model with no alterations, merely fine-tuned on images of spectrograms combined with text. “By changing the seed, it can produce an endless number of prompt variations. Img2img, inpainting, negative prompts, and interpolation are all the same web user interfaces and techniques that function right out of the box.”

Additionally, it can combine several fashion trends. Typing “smooth tropical dance jazz,” for instance, combines components of various genres to produce a fresh outcome, fostering innovation through the blending of forms. Riffusion is not the first AI-driven music maker, of course. An AI-driven generative music model called Dance Diffusion was released by Harmonai earlier this year. A neural network is also used in OpenAI’s Jukebox, which was revealed in 2020. Additionally, websites like Soundraw continuously produce music on demand. Riffusion feels more like the side project it is in comparison to those more organized AI music attempts. Although the music it produces fluctuates from being intriguing to being incomprehensible, it is nevertheless an impressive use of the latent diffusion technology, which modifies audio in a visual environment.