12. Attention Mechanisms

Attention Mechanisms

The attention mechanism is one of the most impactful innovations in the history of deep learning. Introduced by Bahdanau et al. (2015)¹ for machine translation, it allowed neural networks to learn where to look in an input sequence — instead of compressing everything into a single context vector.

Intuitively, attention is what you do when reading this sentence: your eyes and brain don't process all words with equal weight. When interpreting "The cat sat on the mat because it was comfortable", the pronoun it draws attention toward mat — not cat or sat. Networks with attention learn this behavior automatically.

Intuition: Query, Key and Value

The attention mechanism is formalized by three concepts: Query (Q), Key (K) and Value (V).

Think of a database search analogy:

Query — what you are looking for (e.g., vector of the word "it")
Key — the index of each available item (e.g., vector of each word in the sentence)
Value — the actual content returned on a match (e.g., semantic representation of each word)

Attention computes a dot product between the Query and each Key, normalizes with softmax, and uses the resulting weights to combine the Values:

\[ \text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V \]

The factor \(\sqrt{d_k}\) stabilizes gradients when dimension \(d_k\) is large.

Interactive: Attention Over a Sentence

Click any word to see how it "pays attention" to other words. Weights are illustrative and pre-computed to demonstrate the concept.

← Click a word above

Scaled Dot-Product Attention — Step by Step

Given a set of input vectors \(X \in \mathbb{R}^{n \times d}\) (n tokens, dimension d), projection matrices \(W_Q, W_K, W_V\) produce:

\[ Q = X W_Q, \quad K = X W_K, \quad V = X W_V \]

Step 1 — Similarity scores:

\[ S = \frac{Q K^\top}{\sqrt{d_k}} \in \mathbb{R}^{n \times n} \]

Step 2 — Softmax normalization:

\[ A = \text{softmax}(S), \quad A_{ij} = \frac{e^{S_{ij}}}{\sum_k e^{S_{ik}}} \]

Step 3 — Weighted output:

\[ \text{Out} = A \cdot V \]

The matrix \(A\) is the attention matrix: each row sums to 1 and represents how much token \(i\) attends to every other token.

Attention Weight Playground

The playground below computes attention with 2-dimensional vectors. Adjust values and watch the weights change.

Q (query): [q1, q2]
   
d_k (key dimension)
Keys K (3 tokens × 2 dims)
K1: , 
K2: , 
K3: , 

Multi-Head Attention

A single attention head captures one type of relationship between tokens. Multi-Head Attention runs \(h\) heads in parallel, each with independent projections \(W_Q^i, W_K^i, W_V^i\):

\[ \text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\, W^O \]

\[ \text{head}_i = \text{Attention}(Q W_Q^i,\; K W_K^i,\; V W_V^i) \]

Each head can specialize: one captures syntactic dependencies, another co-references, another positional patterns.

Self-Attention vs. Cross-Attention

Type	Q from	K, V from	Typical use
Self-Attention	same sequence	same sequence	Transformer encoder, BERT
Cross-Attention	target sequence	source sequence	Transformer decoder, CLIP
Causal Self-Attention	same seq (masked)	same seq	GPT, autoregressive decoders

In causal self-attention, a lower-triangular mask is applied before softmax, preventing token \(i\) from seeing future tokens \(j > i\):

\[ M_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases} \]

Positional Encoding

Attention is permutation-invariant — shuffling tokens does not affect the scores. To inject position information, the original Transformer uses sinusoidal encodings:

\[ PE_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad PE_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d}}\right) \]

Modern models (LLaMA, GPT-4) use RoPE (Rotary Position Embedding), which applies rotations to Q and K vectors, capturing relative position more effectively.

Complexity and Efficiency

Standard attention is \(O(n^2 d)\) in time and memory — expensive for long sequences. Efficient variants:

Method	Complexity	Idea
Softmax Attention (standard)	\(O(n^2)\)	Full attention matrix
Sparse Attention	\(O(n\sqrt{n})\)	Local + global attention
Linear Attention	\(O(n)\)	Kernel decomposition
FlashAttention	\(O(n^2)\) time, \(O(n)\) memory	IO-aware tiling on SRAM

FlashAttention³ is the modern standard: mathematically identical, but reorders computations to minimize HBM↔SRAM transfers on GPU.

Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. ↩
Vaswani, A. et al. (2017). Attention Is All You Need. ↩
Dao, T. et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. ↩