13. Transformers

Transformers

In 2017, the paper "Attention Is All You Need"¹ eliminated recurrence and convolutions from sequence models, replacing them entirely with attention. The result was the Transformer — the architecture powering GPT, BERT, ViT, Whisper, and virtually every state-of-the-art model today.

Architecture Overview

The original Transformer is an encoder-decoder model for machine translation. Today there are encoder-only variants (BERT, for classification/representation) and decoder-only variants (GPT, for generation).

The Encoder Block in Detail

Each of the \(N\) identical encoder blocks consists of:

\[ \text{SubLayer}_1(x) = \text{LayerNorm}(x + \text{MultiHead}(x, x, x)) \]

\[ \text{SubLayer}_2(x) = \text{LayerNorm}(x + \text{FFN}(x)) \]

The FFN (Feed-Forward Network) is applied independently to each position:

\[ \text{FFN}(x) = \max(0,\; xW_1 + b_1)W_2 + b_2 \]

Typically \(d_{\text{model}} = 512\), \(d_{\text{ff}} = 2048\) — a bottleneck 4× wider than the embedding.

Residual connections (inspired by ResNets) ensure gradients flow directly through many layers. Layer Normalization normalizes along the feature dimension (not batch), which is more stable for variable-length sequences.

Decoder and Autoregressive Generation

The decoder generates tokens one at a time, conditioned on everything generated so far:

\[ p(\text{output}) = \prod_{t=1}^{T} p(y_t \mid y_{<t},\; \text{encoder output}) \]

To prevent token \(t\) from seeing future tokens during training, masked self-attention applies a causal mask:

Causal mask (n=4 tokens):
     pos0  pos1  pos2  pos3
pos0 [ 0   -inf  -inf  -inf ]   (sees only itself)
pos1 [ 0    0   -inf  -inf ]    (sees pos0 and pos1)
pos2 [ 0    0    0   -inf ]
pos3 [ 0    0    0    0   ]     (sees everything)

BERT vs. GPT — Encoder vs. Decoder

BERT (Encoder-only)

Bidirectional: sees left and right context
Pre-trained with Masked Language Model
Excellent for classification, NER, QA
Not generative

GPT (Decoder-only)

Causal: sees only left context
Pre-trained with Next Token Prediction
Excellent for text generation, chat, code
Scale leads to emergent capabilities

Vision Transformer (ViT)

The encoder is not limited to text. In 2020, Dosovitskiy et al.³ applied this same architecture to images by dividing them into \(16 \times 16\) pixel patches and treating each patch as a token — surpassing CNNs on ImageNet at sufficient data scale. The next class, Vision Transformers, is dedicated to this idea.

Scaling Laws and Large Language Models

Kaplan et al. (2020)⁴ discovered that language model loss follows power-law scaling laws:

\[ L(N, D) = \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + L_\infty \]

where \(N\) is the number of parameters, \(D\) the data size, and \(L_\infty\) the irreducible loss.

This led to the LLM paradigm: training enormous models (billions of parameters) on trillions of tokens. The next class explores this world.

Quick Implementation Reference

PyTorch (Attention)PyTorch (Transformer Block)

import torch
import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, q, k, v, mask=None):
        B, n, _ = q.shape
        Q = self.W_q(q).view(B, n, self.num_heads, self.d_k).transpose(1,2)
        K = self.W_k(k).view(B, -1, self.num_heads, self.d_k).transpose(1,2)
        V = self.W_v(v).view(B, -1, self.num_heads, self.d_k).transpose(1,2)

        scores = Q @ K.transpose(-2,-1) / self.d_k**0.5
        if mask is not None:
            scores = scores.masked_fill(mask==0, float('-inf'))
        attn = scores.softmax(dim=-1)
        out = (attn @ V).transpose(1,2).reshape(B, n, -1)
        return self.W_o(out)

class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attn = MultiHeadAttention(d_model, num_heads)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff), nn.GELU(),
            nn.Linear(d_ff, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.drop = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        x = self.norm1(x + self.drop(self.attn(x, x, x, mask)))
        x = self.norm2(x + self.drop(self.ff(x)))
        return x

Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS. ↩
Devlin, J. et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers. ↩
Dosovitskiy, A. et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ↩
Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. ↩