23. Autoregressive Generation
Autoregressive Image Generation
Diffusion models have dominated image generation since 2020 β but there is a radically different approach that has gained traction: treating images as sequences of discrete tokens and generating them the same way language models generate text.
This is the approach behind native image generation in Gemini, GPT-4o, and models like Chameleon (Meta) and LlamaGen.
The Problem: How to Tokenize an Image?
Text is naturally discrete (words, subwords). Images are continuous β pixels in \([0,255]^3\). To use autoregressive generation, we need a visual vocabulary.
The solution: VQ-GAN (Vector Quantization GAN)1 learns a codebook of \(K\) vectors. The encoder maps any image patch to the nearest vector in the codebook β converting the image into a grid of integer indices.
Autoregressive Token Generation
With a trained codebook, we can represent any image as a sequence of \(N\) integer indices. We then generate this sequence exactly like an LLM generates text:
Each token is generated one at a time, conditioned on all previous ones and the text prompt.
MaskGIT: Parallel Generation via Masking
Purely autoregressive generation is slow: 1024 tokens = 1024 model passes. MaskGIT2 accelerates this with iterative parallel generation:
- Start with all tokens masked
[MASK] - At each iteration, predict all tokens simultaneously (bidirectional!)
- "Reveal" only the tokens with highest confidence
- Repeat with fewer masked tokens
In just 8β12 iterations, it generates 1024 tokens β versus 1024 iterations for pure AR.
Any-to-Any: Gemini, GPT-4o, and Chameleon
The final step is removing the distinction between text and image tokens. Any-to-any models treat everything as a token sequence:
The standard Transformer model processes this mixed sequence naturally.
How Each Model Implements This
| Model | Visual tokenizer | Generation | Training |
|---|---|---|---|
| Chameleon (Meta) | VQ-VAE (8192 codes) | Pure autoregressive | Text + image together from the start |
| Gemini 2.0 (Google) | Proprietary tokenizer | AR + diffusion decoder | Native multimodal |
| GPT-4o (OpenAI) | Discrete visual tokens | AR + diffusion decoder | Native multimodal |
| LlamaGen | VQGAN (16384 codes) | AR with LLaMA | Initializes from pre-trained LLaMA |
AR vs. Diffusion: When to Use Each?
- Unifies text and image in the same architecture
- Best for multimodal any-to-any
- Leverages the entire LLM infrastructure
- Scales well with more data
- Slow: 1 token at a time
- Best standalone image quality
- Coherent global generation
- More control (guidance, cfg scale)
- Faster per image than AR
- Does not natively unify with text
The current trend: hybrids β an autoregressive LLM backbone for understanding and reasoning, with a diffusion decoder to render the final image at high quality. This is exactly what GPT-4o does.
Implementation: VQ-GAN + Autoregressive Transformer
import torch
import torch.nn as nn
# 1. Vector quantizer
class VectorQuantizer(nn.Module):
def __init__(self, n_codes, d_code):
super().__init__()
self.codebook = nn.Embedding(n_codes, d_code)
def forward(self, z):
# z: (B, H, W, d_code) β encoder latents
flat = z.view(-1, z.shape[-1])
# Distances to codebook
dists = torch.cdist(flat, self.codebook.weight)
indices = dists.argmin(dim=-1) # index of nearest code
quantized = self.codebook(indices).view_as(z)
# Straight-through estimator for backprop
quantized_st = z + (quantized - z).detach()
return quantized_st, indices.view(z.shape[:3])
# 2. Autoregressive generation with GPT-like model
class ImageGPT(nn.Module):
def __init__(self, n_codes, seq_len, d_model, n_heads, n_layers):
super().__init__()
self.tok_emb = nn.Embedding(n_codes + 1, d_model) # +1 for BOS token
self.pos_emb = nn.Embedding(seq_len + 1, d_model)
encoder_layer = nn.TransformerEncoderLayer(d_model, n_heads, d_model*4, batch_first=True)
self.transformer = nn.TransformerEncoder(encoder_layer, n_layers)
self.head = nn.Linear(d_model, n_codes)
def forward(self, tokens):
B, T = tokens.shape
pos = torch.arange(T, device=tokens.device).unsqueeze(0)
x = self.tok_emb(tokens) + self.pos_emb(pos)
# Causal mask
mask = torch.triu(torch.ones(T, T, device=tokens.device), diagonal=1).bool()
x = self.transformer(x, mask=mask)
return self.head(x) # logits over n_codes
@torch.no_grad()
def generate(self, prompt_tokens, n_new, temperature=1.0, top_k=2048):
tokens = prompt_tokens.clone()
for _ in range(n_new):
logits = self(tokens)[:, -1, :] / temperature
if top_k: logits[logits < logits.topk(top_k)[0][:,-1:]] = -float('inf')
probs = logits.softmax(-1)
next_tok = torch.multinomial(probs, 1)
tokens = torch.cat([tokens, next_tok], dim=1)
return tokens
-
Esser, P. et al. (2021). Taming Transformers for High-Resolution Image Synthesis (VQ-GAN). β©
-
Chang, H. et al. (2022). MaskGIT: Masked Generative Image Transformer. β©
-
Yu, L. et al. (2023). LlamaGen: Autoregressive Image Generation without Vector Quantization. β©
-
Team, C. et al. (2024). Chameleon: Mixed-Modal Early-Fusion Foundation Models. β©
-
Li, J. et al. (2024). MAR: Autoregressive Image Generation without Vector Quantization. β©