16. LLMs

Large Language Models (LLMs)

Large Language Models (LLMs) are Transformer neural networks trained at unprecedented scales — billions of parameters, trillions of tokens — with the goal of predicting the next token. This seemingly simple task, repeated over enough data, leads to emergent capabilities: reasoning, arithmetic, programming, and much more.

The Scale That Changes Everything

Pre-Training: Next Token Prediction

LLMs are pre-trained with autoregressive language modeling: given text \(x_1, x_2, \ldots, x_T\), the cross-entropy loss is minimized:

\[ \mathcal{L} = -\sum_{t=1}^{T} \log p_\theta(x_t \mid x_1, \ldots, x_{t-1}) \]

This is a self-supervised learning task — the labels are the text tokens themselves, so data is extremely abundant (practically the entire internet).

The model learns a probability distribution over vocabularies of 30k–100k tokens. At inference, it samples iteratively:

\[ x_{t+1} \sim p_\theta(\cdot \mid x_1, \ldots, x_t) \]

Tokenization

Before training, text is converted to tokens by a tokenizer. The modern standard is Byte Pair Encoding (BPE):

Starts with individual characters
Iterates: merges the most frequent pairs
Results in a subword vocabulary

"tokenization" → ["token", "iza", "tion"]   (BPE)
"ChatGPT"      → ["Chat", "G", "PT"]
"hello world"  → ["hello", " world"]

This allows compact vocabularies that handle rare words and multiple languages without a separate tokenizer per language.

Emergent Capabilities

Upon crossing certain scale thresholds, LLMs exhibit capabilities that do not exist in smaller models — they appear to emerge non-linearly:

🧮 Few-Shot Learning

Learns tasks from 3-5 in-context examples, without weight updates.

 Translate to French:
 English: "cat" → French: "chat"
 English: "dog" → French: "chien"
 English: "bird" → French: "oiseau" 

🔗 Chain-of-Thought

Generates step-by-step reasoning before answering, improving accuracy in math and logic.

 Q: If x+3=7, what is 2x?
 A: First, x=7-3=4.
 Then 2x=2×4=8. 

💻 Code Generation

Writes, explains, and debugs code in dozens of programming languages.

🌍 Multilingual

Translates, reasons, and generates in multiple languages without language-specific training.

The RLHF Pipeline

Base models predict text, but not necessarily helpful or safe text. RLHF (Reinforcement Learning from Human Feedback)³ adapts the model to human preferences:

flowchart LR
    A[Pre-trained\nBase Model] --> B[SFT\nSupervised Fine-Tuning]
    B --> C[Reward Model\nTrained with human rankings]
    C --> D[PPO / DPO\nRL Optimization]
    D --> E[Aligned Model\nChatGPT / Claude]

    style A fill:#21262d,color:#8b949e
    style E fill:#1f3244,color:#58a6ff

SFT: Supervised fine-tuning on high-quality demonstrations
Reward Model: neural network that learns to rank responses according to human preference
PPO/DPO: RL optimization using the Reward Model as signal

DPO (Direct Preference Optimization) simplifies this: no explicit RL needed, trains directly on preferences:

\[ \mathcal{L}_{\text{DPO}} = -\mathbb{E}\left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right] \]

where \(y_w\) is the preferred response and \(y_l\) the rejected one.

Mixture of Experts (MoE)

To scale beyond dense models, modern LLMs use Mixture of Experts⁴: each FFN layer is replaced by \(E\) independent "experts", with a router that activates only \(k\) of them per token:

\[ \text{MoE}(x) = \sum_{i \in \text{Top-}k(G(x))} G(x)_i \cdot E_i(x) \]

where \(G(x) = \text{Softmax}(W_g x)\) are the router weights.

Advantage: a model with \(E\) experts has \(E \times\) more parameters, but each forward pass activates only \(k/E\) of them → same computational efficiency with more capacity.

Model	Total Parameters	Active Parameters	Experts
Mixtral 8×7B	46.7B	12.9B (28%)	8, top-2
DeepSeek-V3	671B	37B (5.5%)	256, top-8
GPT-4 (speculated)	~1.8T	~110B	~16 experts

Advanced Prompting

LLM behavior is strongly influenced by the prompt:

Technique	Description	When to use
Zero-shot	Direct instruction without examples	Simple tasks, large models
Few-shot	3-5 input→output examples	Specific format, new tasks
Chain-of-Thought	Ask "let's think step by step"	Math, logic, reasoning
System Prompt	Defines model role/persona	Specialized assistants
RAG	Retrieves documents before generating	Up-to-date knowledge, factuality
Tool Use	Model calls external functions/APIs	Calculation, search, actions in the world

Challenges and Limitations

Hallucination
LLMs fabricate facts with confidence. RAG and grounding partially mitigate this.

Knowledge Cutoff
Training data has a cutoff date. RAG, browsing, and tool use compensate.

Computational Cost
Inference is expensive. Quantization, distillation, and caching reduce costs.

Model Landscape (2025)

Family	Organization	Open-source?	Specialty
GPT-4o / o3	OpenAI	❌	General SOTA, reasoning
Claude 3.7	Anthropic	❌	Long context window, safety
Gemini 2.5	Google	❌	Multimodal, Google integration
LLaMA 3.3	Meta	✅	Base for fine-tuning
Mistral / Mixtral	Mistral AI	✅	Efficiency, MoE
DeepSeek-V3/R1	DeepSeek	✅	Reasoning, code
Qwen 2.5	Alibaba	✅	Multilingual
Gemma 3	Google	✅	Small and efficient

Brown, T. et al. (2020). Language Models are Few-Shot Learners (GPT-3). ↩
Wei, J. et al. (2022). Emergent Abilities of Large Language Models. ↩
Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback (InstructGPT). ↩
Shazeer, N. et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated MoE. ↩
Rafailov, R. et al. (2023). Direct Preference Optimization. ↩