13. Transformers
Transformers
In 2017, the paper "Attention Is All You Need"1 eliminated recurrence and convolutions from sequence models, replacing them entirely with attention. The result was the Transformer — the architecture powering GPT, BERT, ViT, Whisper, and virtually every state-of-the-art model today.
Architecture Overview
The original Transformer is an encoder-decoder model for machine translation. Today there are encoder-only variants (BERT, for classification/representation) and decoder-only variants (GPT, for generation).
The Encoder Block in Detail
Each of the \(N\) identical encoder blocks consists of:
The FFN (Feed-Forward Network) is applied independently to each position:
Typically \(d_{\text{model}} = 512\), \(d_{\text{ff}} = 2048\) — a bottleneck 4× wider than the embedding.
Residual connections (inspired by ResNets) ensure gradients flow directly through many layers. Layer Normalization normalizes along the feature dimension (not batch), which is more stable for variable-length sequences.
Decoder and Autoregressive Generation
The decoder generates tokens one at a time, conditioned on everything generated so far:
To prevent token \(t\) from seeing future tokens during training, masked self-attention applies a causal mask:
BERT vs. GPT — Encoder vs. Decoder
- Bidirectional: sees left and right context
- Pre-trained with Masked Language Model
- Excellent for classification, NER, QA
- Not generative
- Causal: sees only left context
- Pre-trained with Next Token Prediction
- Excellent for text generation, chat, code
- Scale leads to emergent capabilities
Vision Transformer (ViT)
In 2020, Dosovitskiy et al.3 applied Transformers to images by dividing them into \(16 \times 16\) pixel patches, linearizing each patch as a token:
where \(N = HW/P^2\) is the number of patches. A special [CLS] token is prepended and its final output is used for classification.
ViT surpassed CNNs on ImageNet with sufficient data scale — showing that convolutional inductive biases (locality, translation equivariance) are not necessary when enough data is available.
Scaling Laws and Large Language Models
Kaplan et al. (2020)4 discovered that language model loss follows power-law scaling laws:
where \(N\) is the number of parameters, \(D\) the data size, and \(L_\infty\) the irreducible loss.
This led to the LLM paradigm: training enormous models (billions of parameters) on trillions of tokens. The next class explores this world.
Quick Implementation Reference
import torch
import torch.nn as nn
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
self.num_heads = num_heads
self.d_k = d_model // num_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def forward(self, q, k, v, mask=None):
B, n, _ = q.shape
Q = self.W_q(q).view(B, n, self.num_heads, self.d_k).transpose(1,2)
K = self.W_k(k).view(B, -1, self.num_heads, self.d_k).transpose(1,2)
V = self.W_v(v).view(B, -1, self.num_heads, self.d_k).transpose(1,2)
scores = Q @ K.transpose(-2,-1) / self.d_k**0.5
if mask is not None:
scores = scores.masked_fill(mask==0, float('-inf'))
attn = scores.softmax(dim=-1)
out = (attn @ V).transpose(1,2).reshape(B, n, -1)
return self.W_o(out)
class TransformerBlock(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
self.attn = MultiHeadAttention(d_model, num_heads)
self.ff = nn.Sequential(
nn.Linear(d_model, d_ff), nn.GELU(),
nn.Linear(d_ff, d_model)
)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.drop = nn.Dropout(dropout)
def forward(self, x, mask=None):
x = self.norm1(x + self.drop(self.attn(x, x, x, mask)))
x = self.norm2(x + self.drop(self.ff(x)))
return x
-
Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS. ↩
-
Devlin, J. et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers. ↩
-
Dosovitskiy, A. et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ↩
-
Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. ↩