15. LLMs
Large Language Models (LLMs)
Large Language Models (LLMs) are Transformer neural networks trained at unprecedented scales โ billions of parameters, trillions of tokens โ with the goal of predicting the next token. This seemingly simple task, repeated over enough data, leads to emergent capabilities: reasoning, arithmetic, programming, and much more.
The Scale That Changes Everything
Pre-Training: Next Token Prediction
LLMs are pre-trained with autoregressive language modeling: given text \(x_1, x_2, \ldots, x_T\), the cross-entropy loss is minimized:
This is a self-supervised learning task โ the labels are the text tokens themselves, so data is extremely abundant (practically the entire internet).
The model learns a probability distribution over vocabularies of 30kโ100k tokens. At inference, it samples iteratively:
Tokenization
Before training, text is converted to tokens by a tokenizer. The modern standard is Byte Pair Encoding (BPE):
- Starts with individual characters
- Iterates: merges the most frequent pairs
- Results in a subword vocabulary
"tokenization" โ ["token", "iza", "tion"] (BPE)
"ChatGPT" โ ["Chat", "G", "PT"]
"hello world" โ ["hello", " world"]
This allows compact vocabularies that handle rare words and multiple languages without a separate tokenizer per language.
Emergent Capabilities
Upon crossing certain scale thresholds, LLMs exhibit capabilities that do not exist in smaller models โ they appear to emerge non-linearly:
English: "cat" โ French: "chat"
English: "dog" โ French: "chien"
English: "bird" โ French: "oiseau"
A: First, x=7-3=4.
Then 2x=2ร4=8.
The RLHF Pipeline
Base models predict text, but not necessarily helpful or safe text. RLHF (Reinforcement Learning from Human Feedback)3 adapts the model to human preferences:
flowchart LR
A[Pre-trained\nBase Model] --> B[SFT\nSupervised Fine-Tuning]
B --> C[Reward Model\nTrained with human rankings]
C --> D[PPO / DPO\nRL Optimization]
D --> E[Aligned Model\nChatGPT / Claude]
style A fill:#21262d,color:#8b949e
style E fill:#1f3244,color:#58a6ff - SFT: Supervised fine-tuning on high-quality demonstrations
- Reward Model: neural network that learns to rank responses according to human preference
- PPO/DPO: RL optimization using the Reward Model as signal
DPO (Direct Preference Optimization) simplifies this: no explicit RL needed, trains directly on preferences:
where \(y_w\) is the preferred response and \(y_l\) the rejected one.
Mixture of Experts (MoE)
To scale beyond dense models, modern LLMs use Mixture of Experts4: each FFN layer is replaced by \(E\) independent "experts", with a router that activates only \(k\) of them per token:
where \(G(x) = \text{Softmax}(W_g x)\) are the router weights.
Advantage: a model with \(E\) experts has \(E \times\) more parameters, but each forward pass activates only \(k/E\) of them โ same computational efficiency with more capacity.
| Model | Total Parameters | Active Parameters | Experts |
|---|---|---|---|
| Mixtral 8ร7B | 46.7B | 12.9B (28%) | 8, top-2 |
| DeepSeek-V3 | 671B | 37B (5.5%) | 256, top-8 |
| GPT-4 (speculated) | ~1.8T | ~110B | ~16 experts |
Advanced Prompting
LLM behavior is strongly influenced by the prompt:
| Technique | Description | When to use |
|---|---|---|
| Zero-shot | Direct instruction without examples | Simple tasks, large models |
| Few-shot | 3-5 inputโoutput examples | Specific format, new tasks |
| Chain-of-Thought | Ask "let's think step by step" | Math, logic, reasoning |
| System Prompt | Defines model role/persona | Specialized assistants |
| RAG | Retrieves documents before generating | Up-to-date knowledge, factuality |
| Tool Use | Model calls external functions/APIs | Calculation, search, actions in the world |
Challenges and Limitations
LLMs fabricate facts with confidence. RAG and grounding partially mitigate this.
Training data has a cutoff date. RAG, browsing, and tool use compensate.
Inference is expensive. Quantization, distillation, and caching reduce costs.
Model Landscape (2025)
| Family | Organization | Open-source? | Specialty |
|---|---|---|---|
| GPT-4o / o3 | OpenAI | โ | General SOTA, reasoning |
| Claude 3.7 | Anthropic | โ | Long context window, safety |
| Gemini 2.5 | โ | Multimodal, Google integration | |
| LLaMA 3.3 | Meta | โ | Base for fine-tuning |
| Mistral / Mixtral | Mistral AI | โ | Efficiency, MoE |
| DeepSeek-V3/R1 | DeepSeek | โ | Reasoning, code |
| Qwen 2.5 | Alibaba | โ | Multilingual |
| Gemma 3 | โ | Small and efficient |
-
Brown, T. et al. (2020). Language Models are Few-Shot Learners (GPT-3). โฉ
-
Wei, J. et al. (2022). Emergent Abilities of Large Language Models. โฉ
-
Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback (InstructGPT). โฉ
-
Shazeer, N. et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated MoE. โฉ
-
Rafailov, R. et al. (2023). Direct Preference Optimization. โฉ