Skip to content

23. Autoregressive Generation

Autoregressive Image Generation

Diffusion models have dominated image generation since 2020 β€” but there is a radically different approach that has gained traction: treating images as sequences of discrete tokens and generating them the same way language models generate text.

This is the approach behind native image generation in Gemini, GPT-4o, and models like Chameleon (Meta) and LlamaGen.


The Problem: How to Tokenize an Image?

Text is naturally discrete (words, subwords). Images are continuous β€” pixels in \([0,255]^3\). To use autoregressive generation, we need a visual vocabulary.

The solution: VQ-GAN (Vector Quantization GAN)1 learns a codebook of \(K\) vectors. The encoder maps any image patch to the nearest vector in the codebook β€” converting the image into a grid of integer indices.


Autoregressive Token Generation

With a trained codebook, we can represent any image as a sequence of \(N\) integer indices. We then generate this sequence exactly like an LLM generates text:

\[ p(t_1, t_2, \ldots, t_N) = \prod_{i=1}^{N} p(t_i \mid t_1, \ldots, t_{i-1}, \text{prompt}) \]

Each token is generated one at a time, conditioned on all previous ones and the text prompt.


MaskGIT: Parallel Generation via Masking

Purely autoregressive generation is slow: 1024 tokens = 1024 model passes. MaskGIT2 accelerates this with iterative parallel generation:

  1. Start with all tokens masked [MASK]
  2. At each iteration, predict all tokens simultaneously (bidirectional!)
  3. "Reveal" only the tokens with highest confidence
  4. Repeat with fewer masked tokens

In just 8–12 iterations, it generates 1024 tokens β€” versus 1024 iterations for pure AR.


Any-to-Any: Gemini, GPT-4o, and Chameleon

The final step is removing the distinction between text and image tokens. Any-to-any models treat everything as a token sequence:

[TEXT: "a photo of"] [IMG_TOK_3742] [IMG_TOK_891] ... [IMG_TOK_5531] [TEXT: "cat"]

The standard Transformer model processes this mixed sequence naturally.

How Each Model Implements This

Model Visual tokenizer Generation Training
Chameleon (Meta) VQ-VAE (8192 codes) Pure autoregressive Text + image together from the start
Gemini 2.0 (Google) Proprietary tokenizer AR + diffusion decoder Native multimodal
GPT-4o (OpenAI) Discrete visual tokens AR + diffusion decoder Native multimodal
LlamaGen VQGAN (16384 codes) AR with LLaMA Initializes from pre-trained LLaMA

AR vs. Diffusion: When to Use Each?

Autoregressive Generation
  • Unifies text and image in the same architecture
  • Best for multimodal any-to-any
  • Leverages the entire LLM infrastructure
  • Scales well with more data
  • Slow: 1 token at a time
Diffusion (DDPM / Flow Matching)
  • Best standalone image quality
  • Coherent global generation
  • More control (guidance, cfg scale)
  • Faster per image than AR
  • Does not natively unify with text

The current trend: hybrids β€” an autoregressive LLM backbone for understanding and reasoning, with a diffusion decoder to render the final image at high quality. This is exactly what GPT-4o does.


Implementation: VQ-GAN + Autoregressive Transformer

import torch
import torch.nn as nn

# 1. Vector quantizer
class VectorQuantizer(nn.Module):
    def __init__(self, n_codes, d_code):
        super().__init__()
        self.codebook = nn.Embedding(n_codes, d_code)

    def forward(self, z):
        # z: (B, H, W, d_code) β€” encoder latents
        flat = z.view(-1, z.shape[-1])
        # Distances to codebook
        dists = torch.cdist(flat, self.codebook.weight)
        indices = dists.argmin(dim=-1)            # index of nearest code
        quantized = self.codebook(indices).view_as(z)
        # Straight-through estimator for backprop
        quantized_st = z + (quantized - z).detach()
        return quantized_st, indices.view(z.shape[:3])

# 2. Autoregressive generation with GPT-like model
class ImageGPT(nn.Module):
    def __init__(self, n_codes, seq_len, d_model, n_heads, n_layers):
        super().__init__()
        self.tok_emb = nn.Embedding(n_codes + 1, d_model)  # +1 for BOS token
        self.pos_emb = nn.Embedding(seq_len + 1, d_model)
        encoder_layer = nn.TransformerEncoderLayer(d_model, n_heads, d_model*4, batch_first=True)
        self.transformer = nn.TransformerEncoder(encoder_layer, n_layers)
        self.head = nn.Linear(d_model, n_codes)

    def forward(self, tokens):
        B, T = tokens.shape
        pos = torch.arange(T, device=tokens.device).unsqueeze(0)
        x = self.tok_emb(tokens) + self.pos_emb(pos)
        # Causal mask
        mask = torch.triu(torch.ones(T, T, device=tokens.device), diagonal=1).bool()
        x = self.transformer(x, mask=mask)
        return self.head(x)  # logits over n_codes

    @torch.no_grad()
    def generate(self, prompt_tokens, n_new, temperature=1.0, top_k=2048):
        tokens = prompt_tokens.clone()
        for _ in range(n_new):
            logits = self(tokens)[:, -1, :] / temperature
            if top_k: logits[logits < logits.topk(top_k)[0][:,-1:]] = -float('inf')
            probs = logits.softmax(-1)
            next_tok = torch.multinomial(probs, 1)
            tokens = torch.cat([tokens, next_tok], dim=1)
        return tokens