Skip to content

13. Transformers

Transformers

In 2017, the paper "Attention Is All You Need"1 eliminated recurrence and convolutions from sequence models, replacing them entirely with attention. The result was the Transformer — the architecture powering GPT, BERT, ViT, Whisper, and virtually every state-of-the-art model today.


Architecture Overview

The original Transformer is an encoder-decoder model for machine translation. Today there are encoder-only variants (BERT, for classification/representation) and decoder-only variants (GPT, for generation).


The Encoder Block in Detail

Each of the \(N\) identical encoder blocks consists of:

\[ \text{SubLayer}_1(x) = \text{LayerNorm}(x + \text{MultiHead}(x, x, x)) \]
\[ \text{SubLayer}_2(x) = \text{LayerNorm}(x + \text{FFN}(x)) \]

The FFN (Feed-Forward Network) is applied independently to each position:

\[ \text{FFN}(x) = \max(0,\; xW_1 + b_1)W_2 + b_2 \]

Typically \(d_{\text{model}} = 512\), \(d_{\text{ff}} = 2048\) — a bottleneck 4× wider than the embedding.

Residual connections (inspired by ResNets) ensure gradients flow directly through many layers. Layer Normalization normalizes along the feature dimension (not batch), which is more stable for variable-length sequences.


Decoder and Autoregressive Generation

The decoder generates tokens one at a time, conditioned on everything generated so far:

\[ p(\text{output}) = \prod_{t=1}^{T} p(y_t \mid y_{<t},\; \text{encoder output}) \]

To prevent token \(t\) from seeing future tokens during training, masked self-attention applies a causal mask:

Causal mask (n=4 tokens):
     pos0  pos1  pos2  pos3
pos0 [ 0   -inf  -inf  -inf ]   (sees only itself)
pos1 [ 0    0   -inf  -inf ]    (sees pos0 and pos1)
pos2 [ 0    0    0   -inf ]
pos3 [ 0    0    0    0   ]     (sees everything)

BERT vs. GPT — Encoder vs. Decoder

BERT (Encoder-only)

  • Bidirectional: sees left and right context
  • Pre-trained with Masked Language Model
  • Excellent for classification, NER, QA
  • Not generative
GPT (Decoder-only)

  • Causal: sees only left context
  • Pre-trained with Next Token Prediction
  • Excellent for text generation, chat, code
  • Scale leads to emergent capabilities

Vision Transformer (ViT)

In 2020, Dosovitskiy et al.3 applied Transformers to images by dividing them into \(16 \times 16\) pixel patches, linearizing each patch as a token:

\[ \text{Input} \in \mathbb{R}^{H \times W \times C} \xrightarrow{\text{patches}} \mathbb{R}^{N \times (P^2 C)} \xrightarrow{\text{linear}} \mathbb{R}^{N \times d} \]

where \(N = HW/P^2\) is the number of patches. A special [CLS] token is prepended and its final output is used for classification.

ViT surpassed CNNs on ImageNet with sufficient data scale — showing that convolutional inductive biases (locality, translation equivariance) are not necessary when enough data is available.


Scaling Laws and Large Language Models

Kaplan et al. (2020)4 discovered that language model loss follows power-law scaling laws:

\[ L(N, D) = \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + L_\infty \]

where \(N\) is the number of parameters, \(D\) the data size, and \(L_\infty\) the irreducible loss.

This led to the LLM paradigm: training enormous models (billions of parameters) on trillions of tokens. The next class explores this world.


Quick Implementation Reference

import torch
import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, q, k, v, mask=None):
        B, n, _ = q.shape
        Q = self.W_q(q).view(B, n, self.num_heads, self.d_k).transpose(1,2)
        K = self.W_k(k).view(B, -1, self.num_heads, self.d_k).transpose(1,2)
        V = self.W_v(v).view(B, -1, self.num_heads, self.d_k).transpose(1,2)

        scores = Q @ K.transpose(-2,-1) / self.d_k**0.5
        if mask is not None:
            scores = scores.masked_fill(mask==0, float('-inf'))
        attn = scores.softmax(dim=-1)
        out = (attn @ V).transpose(1,2).reshape(B, n, -1)
        return self.W_o(out)
class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attn = MultiHeadAttention(d_model, num_heads)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff), nn.GELU(),
            nn.Linear(d_ff, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.drop = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        x = self.norm1(x + self.drop(self.attn(x, x, x, mask)))
        x = self.norm2(x + self.drop(self.ff(x)))
        return x



  1. Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS. 

  2. Devlin, J. et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers

  3. Dosovitskiy, A. et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

  4. Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models