4. Transformers & Attention
Deadline and Submission
TBD
Commits until 23:59
Individual
GitHub Pages link via insper.blackboard.com.
Activity: Building Attention and Transformers from Scratch
This activity solidifies your understanding of attention mechanisms and Transformer architecture by implementing them from the ground up using only NumPy and Python (no PyTorch or TensorFlow for the core logic).
Exercise 1 β Scaled Dot-Product Attention
Implement the full scaled dot-product attention function:
Instructions
- Implement
softmax(x)β numerically stable version (subtract max before exponentiating) - Implement
scaled_dot_product_attention(Q, K, V, mask=None): - Compute raw scores:
scores = Q @ K.T / sqrt(d_k) - Apply mask if provided (set masked positions to
-infbefore softmax) - Apply softmax row-wise
- Return weighted sum of Values:
output = attn_weights @ V - Test with the following inputs:
import numpy as np
d_k = 4
Q = np.array([[1.0, 0.0, 1.0, 0.0], # token 1 query
[0.0, 1.0, 0.0, 1.0]]) # token 2 query
K = np.array([[1.0, 0.0, 1.0, 0.0], # token 1 key
[0.0, 1.0, 0.0, 1.0], # token 2 key
[1.0, 1.0, 0.0, 0.0]]) # token 3 key
V = np.array([[1.0, 0.0],
[0.0, 1.0],
[0.5, 0.5]])
-
Plot the attention weight matrix as a heatmap (use matplotlib). What pattern do you observe? Does token 1 attend more to token 1 or token 3? Why?
-
Apply a causal mask (lower-triangular) and re-run. Show how the attention weights change and explain why this mask is necessary for autoregressive generation.
Expected output
Report:
- The attention weight matrix (2Γ3) before and after masking
- The output matrix (2Γ2)
- Heatmap visualizations
- A brief explanation of why Q1 attends more to K1 than to K2
Exercise 2 β Multi-Head Attention from Scratch
Extend your implementation to Multi-Head Attention with \(h=2\) heads.
Architecture
where each head uses its own projection matrices.
Instructions
- Implement the class
MultiHeadAttentionwith: __init__(d_model, num_heads)β initialize random weight matrices \(W_Q^i, W_K^i, W_V^i \in \mathbb{R}^{d_{model} \times d_k}\) and \(W^O \in \mathbb{R}^{d_{model} \times d_{model}}\) for each head \(i\)forward(Q, K, V)β project, apply per-head attention, concatenate, project output- Use fixed random seed
np.random.seed(42)for reproducibility - Test with a sequence of 5 tokens, each with
d_model=8,num_heads=2(sod_k=4per head) - Verify the output shape is
(5, 8)β same as input
Questions to answer in your report
- Why does using \(h=2\) heads with \(d_k = d_{model}/h\) keep the total computation similar to \(h=1\)?
- If head 1 learns to attend to nearby tokens and head 2 to distant tokens, how does the concatenated output benefit from both?
Exercise 3 β Single-Layer Transformer Block
Combine your attention with a Feed-Forward Network to implement a Transformer Encoder Block:
Instructions
- Implement
layer_norm(x)β normalize per row (subtract mean, divide by std + Ξ΅), with learnable Ξ³=1, Ξ²=0 - Implement
ffn(x, W1, b1, W2, b2)β two linear layers with ReLU: \(\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2\), where \(d_{ff} = 4 \times d_{model}\) - Stack it all: build
transformer_encoder_block(x, mha, W1, b1, W2, b2)using your implementations above - Test with 5 tokens, d_model=8
Visualization
Plot the token representations before and after the block as a heatmap (tokens Γ dimensions). Do the representations become richer after the block? Compute and report the cosine similarity matrix between tokens before and after the block.
Exercise 4 β Positional Encoding
Implement sinusoidal positional encoding and visualize it.
Instructions
- Implement
positional_encoding(max_len, d_model)β returns matrix of shape(max_len, d_model) - Plot two visualizations:
- Heatmap: rows=positions (0β99), cols=dimensions, color=PE value
- Line plot: PE values for positions 0, 10, 50 across all dimensions
Questions
- Which dimensions encode high-frequency oscillations and which encode low-frequency?
- Why does adding PE to the token embedding allow the Transformer to distinguish position 1 from position 50?
- What happens if you add the same positional encoding to shuffled tokens?
Evaluation Criteria
Usage of Toolboxes
You may only use NumPy for matrix operations and Matplotlib/Seaborn for plots. PyTorch, TensorFlow, and other ML frameworks are strictly prohibited for the core implementation. Verify your results against PyTorch's nn.MultiheadAttention output as a sanity check only.
Failure to comply will result in your submission being rejected.
| Criteria | Points |
|---|---|
| 3 pts | Correct implementations of attention (Ex. 1) and Multi-Head Attention (Ex. 2) |
| 2 pts | Transformer block (Ex. 3): correct layer norm, FFN, and residual connections |
| 2 pts | Positional encoding (Ex. 4): correct implementation and visualizations |
| 2 pts | Visualizations: attention heatmaps, PE plots, token representation heatmaps |
| 1 pt | Report quality: clear explanations, mathematical notation, and discussion of results |
Submission format: GitHub Pages (using the course template). No other format accepted.
AI Collaboration: Allowed, but every student must be able to explain all code and analysis. Oral exams may be required.