24. Autoregressive Generation

Autoregressive Image Generation

Diffusion models have dominated image generation since 2020 — but there is a radically different approach that has gained traction: treating images as sequences of discrete tokens and generating them the same way language models generate text.

This is the approach behind native image generation in Gemini, GPT-4o, and models like Chameleon (Meta) and LlamaGen.

The Problem: How to Tokenize an Image?

Text is naturally discrete (words, subwords). Images are continuous — pixels in \([0,255]^3\). To use autoregressive generation, we need a visual vocabulary.

The solution: VQ-GAN (Vector Quantization GAN)¹ learns a codebook of \(K\) vectors. The encoder maps any image patch to the nearest vector in the codebook — converting the image into a grid of integer indices.

Autoregressive Token Generation

With a trained codebook, we can represent any image as a sequence of \(N\) integer indices. We then generate this sequence exactly like an LLM generates text:

\[ p(t_1, t_2, \ldots, t_N) = \prod_{i=1}^{N} p(t_i \mid t_1, \ldots, t_{i-1}, \text{prompt}) \]

Each token is generated one at a time, conditioned on all previous ones and the text prompt.

Order:

MaskGIT: Parallel Generation via Masking

Purely autoregressive generation is slow: 1024 tokens = 1024 model passes. MaskGIT² accelerates this with iterative parallel generation:

Start with all tokens masked [MASK]
At each iteration, predict all tokens simultaneously (bidirectional!)
"Reveal" only the tokens with highest confidence
Repeat with fewer masked tokens

In just 8–12 iterations, it generates 1024 tokens — versus 1024 iterations for pure AR.

Any-to-Any: Gemini, GPT-4o, and Chameleon

The final step is removing the distinction between text and image tokens. Any-to-any models treat everything as a token sequence:

[TEXT: "a photo of"] [IMG_TOK_3742] [IMG_TOK_891] ... [IMG_TOK_5531] [TEXT: "cat"]

The standard Transformer model processes this mixed sequence naturally.

How Each Model Implements This

Model	Visual tokenizer	Generation	Training
Chameleon (Meta)	VQ-VAE (8192 codes)	Pure autoregressive	Text + image together from the start
Gemini 2.0 (Google)	Proprietary tokenizer	AR + diffusion decoder	Native multimodal
GPT-4o (OpenAI)	Discrete visual tokens	AR + diffusion decoder	Native multimodal
LlamaGen	VQGAN (16384 codes)	AR with LLaMA	Initializes from pre-trained LLaMA

AR vs. Diffusion: When to Use Each?

Autoregressive Generation

Unifies text and image in the same architecture
Best for multimodal any-to-any
Leverages the entire LLM infrastructure
Scales well with more data
Slow: 1 token at a time

Diffusion (DDPM / Flow Matching)

Best standalone image quality
Coherent global generation
More control (guidance, cfg scale)
Faster per image than AR
Does not natively unify with text

The current trend: hybrids — an autoregressive LLM backbone for understanding and reasoning, with a diffusion decoder to render the final image at high quality. This is exactly what GPT-4o does.

Implementation: VQ-GAN + Autoregressive Transformer

import torch
import torch.nn as nn

# 1. Vector quantizer
class VectorQuantizer(nn.Module):
    def __init__(self, n_codes, d_code):
        super().__init__()
        self.codebook = nn.Embedding(n_codes, d_code)

    def forward(self, z):
        # z: (B, H, W, d_code) — encoder latents
        flat = z.view(-1, z.shape[-1])
        # Distances to codebook
        dists = torch.cdist(flat, self.codebook.weight)
        indices = dists.argmin(dim=-1)            # index of nearest code
        quantized = self.codebook(indices).view_as(z)
        # Straight-through estimator for backprop
        quantized_st = z + (quantized - z).detach()
        return quantized_st, indices.view(z.shape[:3])

# 2. Autoregressive generation with GPT-like model
class ImageGPT(nn.Module):
    def __init__(self, n_codes, seq_len, d_model, n_heads, n_layers):
        super().__init__()
        self.tok_emb = nn.Embedding(n_codes + 1, d_model)  # +1 for BOS token
        self.pos_emb = nn.Embedding(seq_len + 1, d_model)
        encoder_layer = nn.TransformerEncoderLayer(d_model, n_heads, d_model*4, batch_first=True)
        self.transformer = nn.TransformerEncoder(encoder_layer, n_layers)
        self.head = nn.Linear(d_model, n_codes)

    def forward(self, tokens):
        B, T = tokens.shape
        pos = torch.arange(T, device=tokens.device).unsqueeze(0)
        x = self.tok_emb(tokens) + self.pos_emb(pos)
        # Causal mask
        mask = torch.triu(torch.ones(T, T, device=tokens.device), diagonal=1).bool()
        x = self.transformer(x, mask=mask)
        return self.head(x)  # logits over n_codes

    @torch.no_grad()
    def generate(self, prompt_tokens, n_new, temperature=1.0, top_k=2048):
        tokens = prompt_tokens.clone()
        for _ in range(n_new):
            logits = self(tokens)[:, -1, :] / temperature
            if top_k: logits[logits < logits.topk(top_k)[0][:,-1:]] = -float('inf')
            probs = logits.softmax(-1)
            next_tok = torch.multinomial(probs, 1)
            tokens = torch.cat([tokens, next_tok], dim=1)
        return tokens

Esser, P. et al. (2021). Taming Transformers for High-Resolution Image Synthesis (VQ-GAN). ↩
Chang, H. et al. (2022). MaskGIT: Masked Generative Image Transformer. ↩
Yu, L. et al. (2023). LlamaGen: Autoregressive Image Generation without Vector Quantization. ↩
Team, C. et al. (2024). Chameleon: Mixed-Modal Early-Fusion Foundation Models. ↩
Li, J. et al. (2024). MAR: Autoregressive Image Generation without Vector Quantization. ↩