Skip to content

12. Attention Mechanisms

Attention Mechanisms

The attention mechanism is one of the most impactful innovations in the history of deep learning. Introduced by Bahdanau et al. (2015)1 for machine translation, it allowed neural networks to learn where to look in an input sequence — instead of compressing everything into a single context vector.

Intuitively, attention is what you do when reading this sentence: your eyes and brain don't process all words with equal weight. When interpreting "The cat sat on the mat because it was comfortable", the pronoun it draws attention toward mat — not cat or sat. Networks with attention learn this behavior automatically.


Intuition: Query, Key and Value

The attention mechanism is formalized by three concepts: Query (Q), Key (K) and Value (V).

Think of a database search analogy:

  • Query — what you are looking for (e.g., vector of the word "it")
  • Key — the index of each available item (e.g., vector of each word in the sentence)
  • Value — the actual content returned on a match (e.g., semantic representation of each word)

Attention computes a dot product between the Query and each Key, normalizes with softmax, and uses the resulting weights to combine the Values:

\[ \text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V \]

The factor \(\sqrt{d_k}\) stabilizes gradients when dimension \(d_k\) is large.


Interactive: Attention Over a Sentence

Click any word to see how it "pays attention" to other words. Weights are illustrative and pre-computed to demonstrate the concept.

← Click a word above

Scaled Dot-Product Attention — Step by Step

Given a set of input vectors \(X \in \mathbb{R}^{n \times d}\) (n tokens, dimension d), projection matrices \(W_Q, W_K, W_V\) produce:

\[ Q = X W_Q, \quad K = X W_K, \quad V = X W_V \]

Step 1 — Similarity scores:

\[ S = \frac{Q K^\top}{\sqrt{d_k}} \in \mathbb{R}^{n \times n} \]

Step 2 — Softmax normalization:

\[ A = \text{softmax}(S), \quad A_{ij} = \frac{e^{S_{ij}}}{\sum_k e^{S_{ik}}} \]

Step 3 — Weighted output:

\[ \text{Out} = A \cdot V \]

The matrix \(A\) is the attention matrix: each row sums to 1 and represents how much token \(i\) attends to every other token.


Attention Weight Playground

The playground below computes attention with 2-dimensional vectors. Adjust values and watch the weights change.

Q (query): [q1, q2]
d_k (key dimension)
Keys K (3 tokens × 2 dims)
K1: ,
K2: ,
K3: ,

Multi-Head Attention

A single attention head captures one type of relationship between tokens. Multi-Head Attention runs \(h\) heads in parallel, each with independent projections \(W_Q^i, W_K^i, W_V^i\):

\[ \text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\, W^O \]
\[ \text{head}_i = \text{Attention}(Q W_Q^i,\; K W_K^i,\; V W_V^i) \]

Each head can specialize: one captures syntactic dependencies, another co-references, another positional patterns.


Self-Attention vs. Cross-Attention

Type Q from K, V from Typical use
Self-Attention same sequence same sequence Transformer encoder, BERT
Cross-Attention target sequence source sequence Transformer decoder, CLIP
Causal Self-Attention same seq (masked) same seq GPT, autoregressive decoders

In causal self-attention, a lower-triangular mask is applied before softmax, preventing token \(i\) from seeing future tokens \(j > i\):

\[ M_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases} \]

Positional Encoding

Attention is permutation-invariant — shuffling tokens does not affect the scores. To inject position information, the original Transformer uses sinusoidal encodings:

\[ PE_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad PE_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d}}\right) \]

Modern models (LLaMA, GPT-4) use RoPE (Rotary Position Embedding), which applies rotations to Q and K vectors, capturing relative position more effectively.


Complexity and Efficiency

Standard attention is \(O(n^2 d)\) in time and memory — expensive for long sequences. Efficient variants:

Method Complexity Idea
Softmax Attention (standard) \(O(n^2)\) Full attention matrix
Sparse Attention \(O(n\sqrt{n})\) Local + global attention
Linear Attention \(O(n)\) Kernel decomposition
FlashAttention \(O(n^2)\) time, \(O(n)\) memory IO-aware tiling on SRAM

FlashAttention3 is the modern standard: mathematically identical, but reorders computations to minimize HBM↔SRAM transfers on GPU.