Paella is a compact and powerful text-to-image AI model


An international team of researchers presents Paella, a performance-optimized text-to-image AI model.

Currently, the best-known text-to-image AI systems, such as Stable Diffusion and DALL-E 2, are based on diffusion models for image generation and transformers for speech understanding. This allows the generation of high quality images for text input.

However, the systems require multiple inference steps – and therefore strong hardware – to achieve good results. According to the Paella research team, this can complicate application scenarios for end users.

Back to GANs

The team presents Paella, a text-to-image model with 573 million parameters. According to the researchers, it uses an f8-VQGAN architecture (convolutional neural network, see explanatory video at the end of the article) with optimized performance with an average compression rate and CLIP embeddings.

A d

The global architecture of the Paella. | Image: Rampas et al.

GA networks grew in popularity as deepfakes grew in popularity before recently being eclipsed by broadcast methods. However, the research team sees the Paella architecture as a powerful alternative to Diffusion and Transform: Paella can generate a 256 x 256 pixel image in just eight steps and under 500 milliseconds on an Nvidia A100 GPU. Paella was trained over two weeks with 600 million images from the LAION-5B aesthetic dataset on 64 Nvidia A100 GPUs.

Some images generated with Paella. | Image: Rampas et al.

With our model, we can sample images in just 8 steps while achieving high-fidelity results, which makes the model attractive for use cases limited by latency, memory, or computational complexity requirements.


In addition to image generation, Paella can modify input images with techniques such as inpainting (changing image content based on text), outpainting (expanding subject based on text) and structural editing. Paella also supports quick variations such as specific painting styles (eg watercolor).

Examples of painting with Paella – an existing image is visually enlarged using a text command. | Image: Rampas et al.

The research team particularly highlights the small amount of code – just 400 lines – used to train and run Paella. This simplicity compared to transformer and diffusion models could make generative AI techniques manageable for more people, including those outside of research, they say.

The team makes its code and model available on Github. A demo of Paella is available on Huggingface. Image generation is fast and matches text, but image quality cannot yet match broadcast models.

However, the researchers point to the relatively small number of images used for training, which makes a fair comparison with other models difficult, “especially when many of these models are kept private.”


People don't recognize deepfakes and trust them more
People don't recognize deepfakes and trust them more

In this sense, the authors consider Paella, together with the publication of the model and the code, as contributing to “a reproducible and transparent science”. The lead author of the Paella study is Dominic Rampas from Ingolstadt University of Technology.

Explainer video: What is a convolutional neural network?

Leave a Comment

Your email address will not be published. Required fields are marked *