Generative AI models such as Stable Diffusion can generate images, but struggle to edit them. Google shows a new method that allows more control.
With OpenAI’s DALL-E 2, Midjourney, or Stable Diffusion, interested parties have a variety of generative text-to-image conversion models to choose from. All models produce believable images and can be checked through rapid engineering. In many cases, the choice of the offer is therefore above all a matter of personal preference, in some cases a matter of specific requirements which one model may meet better than another.
Besides quick engineering, there are other features that allow more control over the desired result: overprint, variations or masking parts of an image. OpenAI’s DALL-E 2 was a pioneer here with the editing function, where areas of an image can be masked and then regenerated. Similar solutions now also exist for stable streaming.
Google’s prompt prompt allows for text-level control
However, editing by masking has limitations, as it only allows for fairly coarse changes in editing – or requires an elaborate combination of extremely precise masking and various quick changes.
Google researchers show an alternative: Prompt-to-Prompt dispenses with hiding and instead allows control via modifications to the original prompt. To this end, the team accesses the cross-attention maps in the generative AI model. These represent the link between the text prompt and the generated images and contain semantic information relevant for a generation.
The manipulation of these cross-attention maps thus makes it possible to control the process of diffusion of the model, of which the authors show several variants. One of them allows you to change a single word of the text prompt while keeping the rest of the scene intact, which for example switches from one object to another. A second method is for adding words, adding objects or other visual elements to an otherwise immutable scene. A third method can adjust the weighting of individual words, by changing a characteristic of an image, such as the size of a group of people or the fluffiness of a teddy bear.
Prompt-to-Prompt is easy to use for stable streaming
According to Google, Prompt-to-Prompt requires no fine-tuning or other optimization and can be applied directly to existing templates for more control. In their work, the researchers test the method with latent diffusion and stable diffusion. According to Google, Prompt-to-Prompt should work on graphics cards with at least 12 gigabytes of VRAM.
This work is a first step towards providing users with simple and intuitive ways to edit images and navigate through a semantic, textual space that features incremental changes after each step, rather than producing an image from scratch after each text manipulation.
YouTuber Nerdy Rodent demonstrates how Prompt-to-Prompt can be used for stable streaming in his tutorial.
More information about Prompt-to-Prompt and the code is available on GitHub.