Sony AI researchers present GANstrument, a neural synthesizer that transforms arbitrary input sounds into instrument sounds.
Generative AI systems such as DALL-E 2, Midjourney or Stable Diffusion are currently shaking up the visual arts. Text-to-image systems allow impressive results even with simple text inputs.
Powerful comparable systems do not yet exist in music. But here too, recent projects such as the generative text-to-music model by US start-up Mubert show where the journey could lead.
In addition to end-to-end musical synthesis, there is a second line of research: the synthesis of individual notes which are then rendered in a symbolic format such as MIDI (Musical Instrument Digital Interface). This allows independent control of MIDI and timbre, so the process is compatible with music industry production workflows.
In a new paper, AI researchers at Sony are now demonstrating GANstrument, a neural synthesizer for instrument sounds.
GANstrument: Sony presents a neural synthesizer based on the GAN
Currently, realistic instrument sounds are synthesized with samplers that use recorded sounds. Although any sound material can be used, it’s difficult to synthesize a completely new timbre or combine multiple sounds in an intelligent way, Sony said. Generative AI models for audio synthesis, however, have shown that AI can create and mix a variety of timbres.
The research team therefore aims to develop a neural synthesizer which combines the flexibility of classic samplers with the generative power of neural networks. With such a tool, users could freely control the timbre according to the existing sound material.
For its neural synthesizer, Sony uses a GAN (Generative Adversarial Network), which is trained with waveforms transformed into Mel spectrograms. The team relies on something called instance conditioning instead of class conditioning, which is typically used in GAN training.
Class conditioning sorts data into different non-overlapping distributions while instance conditioning sorts data into multiple overlapping local distributions.
GANstrument can turn a rooster into a cello piece
Along with other improvements, such as a pitch-invariant feature extractor, GANstrument thus achieves better and more diverse synthesized sounds, as well as generalization to different sound inputs, writes the team. After training, GANstrument can transform for example flute sounds into brass sounds or organ sounds into guitar sounds.
Interpolation (Input 1 to 2)
The AI system can also smoothly mix different instruments and thus merge two input instruments into a single track, for example.
Melody (Mallet to Reed)
Interpolation (Input 1 to 2)
The system also works with input sounds it has never heard before. He can turn them into known instrument sounds or change the pitch of the input. GANstrument can therefore also convert the crowing of a rooster or the meowing of a cat into sounds of different pitches.
According to Sony, GANstrument generates audio in 1.62 seconds on an Intel Core i7-7800X processor.
Our new neural synthesizer, GANStrument, generates high-pitched instrument sounds reflecting the unique input timbre in interactive time. It incorporates two key features: 1) instance conditioning, resulting in better generation quality and ability to generalize to various inputs and 2) height-invariant feature extraction based on adversarial training, resulting in accuracy of dramatically improved pitch and timbre consistency.
The authors believe that GANstrument can produce new instrument sounds and make desired timbres freely explorable using a variety of sound materials. Other examples can be found on the GANstrument project page.