Where is the “DALL-E for music”?


Enter a line of text and hear a piece of music after a few seconds? There are still hurdles to overcome before that happens, says an analyst.

First it was AI-generated text, then images, which have become more sophisticated recently. HD video and 3D AI generators are also in the works.

This rightly raises the question: where is a service similar to GPT-3, Midjourney or DALL-E for the music industry? Cherie Hu of Water and Music, a research and intelligence network for the new music industry, made some arguments in a Twitter thread as to why such a service is slow to come.

Too little training data, too much copyright

The first point she raises is the lack of training data. Although each of the available text-image models was trained with tens of terabytes of data, there is not as much public training data for music. To get to that point, Hu says, you need to form a template with all published music and also access private drafts from DAWs like GarageBand, Ableton Live or Logic.

A d

As with image generators, copyright considerations also play a major role: it is true that millions of music tracks can be pirated from music streaming services and then used for training. But that would immediately bring the majors and their lawyers into the picture.

“Music industry lawyers have more power than in any other creative industry,” Hu said. Some artists and coders are already fighting generative AI that could infringe copyrights.

Lack of expertise outside of academic research

While breakthroughs are being made by the open source community in image and text AIs, the music industry is still dominated by academia. “There is less data, so the work is just harder and slower. And the Nexus of people who know machine learning, music production, signal processing, etc., is tiny.

According to Wu, it is also because music is more difficult to sift through and, more importantly, to evaluate than visual art. “It literally takes time to listen to and rate a one-minute song. At the same time, you can scan hundreds of images.

Hu sums up that the best AI models for music right now…


Developer combines Stable Diffusion, Whisper and GPT-3 for a futuristic design assistant
Developer combines Stable Diffusion, Whisper and GPT-3 for a futuristic design assistant
  • require more specialized technical knowledge to operate,
  • take longer to run,
  • cost more to operate
  • have only one output OK,
  • and are more difficult to rally the public’s enthusiasm.

When does generative AI for music have its Midjourney moment?

However, Hu draws a conclusion that shouldn’t make the music industry breathe a sigh of relief: “All of that is going to change very soon, given how quickly the creative AI landscape is changing.”

Early examples include startups like Mubert, which recently unveiled a text-to-speech model, and Sony’s AI division, which studies neural synthesizers.

The HarmonAI open source project is also worth mentioning. It describes itself as a community-oriented organization that provides open source tools for generative audio to make and promote music production more accessible to everyone.

His current work, “Dance Diffusion”, a generative audio model, is already available for testing through the Dance Diffusion Colab. Harmonai is supported by London-based startup Stability AI, which has also enabled the open-source Stable Diffusion model.

Leave a Comment

Your email address will not be published. Required fields are marked *