The ability to generate high-quality images quickly is a game-changer for applications like training self-driving cars to navigate complex environments and predict hazards on the road. However, current generative AI techniques used for producing such images come with their own set of limitations. Diffusion models, while producing incredibly realistic images, are slow and resource-intensive. On the other hand, autoregressive models—like the ones behind LLMs such as ChatGPT—are fast but often result in images with errors and poor detail. Now, researchers from MIT and NVIDIA have developed a solution that combines the strengths of both approaches.

Their groundbreaking hybrid image-generation tool, known as HART (Hybrid Autoregressive Transformer), integrates an autoregressive model for fast, high-level image generation and a smaller diffusion model to refine and enhance image details. Published on the arXiv preprint server, HART produces images that match or even surpass the quality of current state-of-the-art diffusion models, all while running up to nine times faster.

HART takes advantage of the strengths of both generative models by first using an autoregressive model to quickly sketch out the broad strokes of an image. Then, a small diffusion model is used to predict and fill in the fine details, known as residual tokens, which the initial model might miss. This method allows for both speed and impressive detail, solving the slow, resource-heavy problems typical of traditional diffusion models. And with its minimal computational overhead, HART can run smoothly on everyday devices like commercial laptops or even smartphones.

Haotian Tang, Ph.D., co-lead author of the research, explains the concept: “If you’re painting a landscape, and you just paint the entire canvas at once, it might not look very good. But if you paint the big picture and then refine it with smaller brush strokes, your painting could look a lot better. That’s the basic idea with HART.”

Popular models like Stable Diffusion and DALL-E rely on a process where the system generates an image by repeatedly “denoising” each pixel of the image across many steps, sometimes 30 or more. While this process results in high-quality images, it’s computationally expensive and slow. In contrast, autoregressive models, which predict images sequentially, are much faster but often generate incomplete or erroneous images due to their inability to go back and correct mistakes.

HART improves on this by using an autoregressive model to first predict compressed, discrete image tokens. These tokens are essentially compressed versions of the image’s raw pixels. The model then passes these tokens to a diffusion model, which predicts the residual tokens—capturing the fine details like edges and textures that are critical for realism. This combination allows for high-quality reconstruction of images in a fraction of the time it would normally take.

“The diffusion model has an easier job to do, which leads to more efficiency,” says Tang.

HART outperforms other models, delivering the same level of quality as a diffusion model with 2 billion parameters, despite having only 700 million parameters for the autoregressive model and 37 million for the diffusion model. It achieves this at a speed that’s up to nine times faster, consuming 31% less computational resources than current best-in-class models.

One of the key advantages of HART is its compatibility with larger AI systems, such as vision-language models. Since it uses the same type of autoregressive model that powers language models like ChatGPT, HART is primed to work in tandem with unified vision-language generative models. This could open new possibilities, such as interacting with AI to generate the intermediate steps for assembling furniture or for more complex tasks like video generation.

The researchers envision expanding this technology even further, possibly applying HART’s framework to video generation and even audio prediction tasks, showing its potential beyond just image creation.

By Impact Lab