Google “Muse” generates high-quality AI images at record speed


Google’s new text-to-image synthesis model, “Muse”, generates high-quality images at record speed. It is also supposed to represent texts and concepts in images more reliably.

Researchers from Google Research present “Muse”, a Transformer-based generative image AI that produces images comparable to current models, but is said to be “significantly more efficient” than existing diffusion models such as Stable Diffusion and DALL-E 2 or autoregressive models like Google Parti.

Similar quality, but much faster

Muse performs as well as Stable Diffusion 1.4 and Google’s internal competitors Parti-3B and Imagen in terms of quality, variety and text alignment of generated images.

Comparison of prompts and generated images between Muse, Imagen and DALL-E 2. | Image: Google Search

However, Muse is significantly faster. With a 1.3 second generation time per frame (512 x 512)Image AI clearly outperforms the fastest Image AI system, Stable Diffusion 1.4, with 3.7 seconds.

A d

Google’s Image AI Muse is said to generate AI images much faster than existing systems at the same quality. | Image: Google Search

The team achieved the speed advantage by using compressed discrete latent space and parallel decoding. For text comprehension, it uses a gel Language model T5 who is pre-trained on text-to-text tasks. According to the team, Muse fully addresses a text prompt rather than just focusing on particularly salient words.

Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requires fewer sampling iterations; compared to autoregressive models, such as Parti, Muse is more efficient through the use of parallel decoding. Using a pre-trained LLM allows fine-grained understanding of language, resulting in high-fidelity image generation and understanding of visual concepts such as objects, their spatial relationships, pose, cardinality, etc.


The new architecture allows a range of image editing applications without further adjustment or model inversion. In an image, objects can be replaced or modified by simple prompting, without masking.

Muse’s image processing modes, which the model activates without any fine-tuning. | Image: Google Search

In an evaluation by human testers, images from Muse were found to be better suited for text input than those from Stable Diffusion 1.4 in about 70% of cases.

In human evaluations, Muse performed better than Stable Diffusion 1.4. | Image: Google Search

Muse is also said to be above average at embedding predefined words in images, like a T-shirt that says “Carpe Diem”. Additionally, Muse is said to be compositionally accurate, i.e., it displays predefined image elements in the prompt with more exact numbers, positions, and colors. This often doesn’t work with current image AI systems.

Overview of the quality advantages of Muse. | Image: Google Search

More sample images are available on the project website. The researchers and Google itself have yet to comment on a possible release of the image model to compete with OpenAI’s DALL-E 2 or Midjourney. Currently, only Imagen by Google is available in a limited beta version in the United States.


Can AI discover the laws of human language acquisition?
Can AI discover the laws of human language acquisition?

As is often the case with scientific work on AI systems for language and images these days, the Muse team points out that, depending on the use case, there is “potential for harm”. , such as the reproduction of social prejudices or the dissemination of false information. For this reason, the team refrains from releasing the code and a publicly available demo. In particular, the team notes the risk of using AI image models to generate people, humans, and faces.

Leave a Comment

Your email address will not be published. Required fields are marked *