Google’s new text-to-image synthesis model, “Muse”, generates high-quality images at record speed. It is also supposed to represent texts and concepts in images more reliably.
Researchers from Google Research present “Muse”, a Transformer-based generative image AI that produces images comparable to current models, but is said to be “significantly more efficient” than existing diffusion models such as Stable Diffusion and DALL-E 2 or autoregressive models like Google Parti.
Similar quality, but much faster
Muse performs as well as Stable Diffusion 1.4 and Google’s internal competitors Parti-3B and Imagen in terms of quality, variety and text alignment of generated images.
However, Muse is significantly faster. With a 1.3 second generation time per frame (512 x 512)Image AI clearly outperforms the fastest Image AI system, Stable Diffusion 1.4, with 3.7 seconds.
The team achieved the speed advantage by using compressed discrete latent space and parallel decoding. For text comprehension, it uses a gel Language model T5 who is pre-trained on text-to-text tasks. According to the team, Muse fully addresses a text prompt rather than just focusing on particularly salient words.
Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requires fewer sampling iterations; compared to autoregressive models, such as Parti, Muse is more efficient through the use of parallel decoding. Using a pre-trained LLM allows fine-grained understanding of language, resulting in high-fidelity image generation and understanding of visual concepts such as objects, their spatial relationships, pose, cardinality, etc.
The new architecture allows a range of image editing applications without further adjustment or model inversion. In an image, objects can be replaced or modified by simple prompting, without masking.
In an evaluation by human testers, images from Muse were found to be better suited for text input than those from Stable Diffusion 1.4 in about 70% of cases.
Muse is also said to be above average at embedding predefined words in images, like a T-shirt that says “Carpe Diem”. Additionally, Muse is said to be compositionally accurate, i.e., it displays predefined image elements in the prompt with more exact numbers, positions, and colors. This often doesn’t work with current image AI systems.
More sample images are available on the project website. The researchers and Google itself have yet to comment on a possible release of the image model to compete with OpenAI’s DALL-E 2 or Midjourney. Currently, only Imagen by Google is available in a limited beta version in the United States.
As is often the case with scientific work on AI systems for language and images these days, the Muse team points out that, depending on the use case, there is “potential for harm”. , such as the reproduction of social prejudices or the dissemination of false information. For this reason, the team refrains from releasing the code and a publicly available demo. In particular, the team notes the risk of using AI image models to generate people, humans, and faces.