Skip to content

15. LLMs

Large Language Models (LLMs)

Large Language Models (LLMs) are Transformer neural networks trained at unprecedented scales โ€” billions of parameters, trillions of tokens โ€” with the goal of predicting the next token. This seemingly simple task, repeated over enough data, leads to emergent capabilities: reasoning, arithmetic, programming, and much more.


The Scale That Changes Everything


Pre-Training: Next Token Prediction

LLMs are pre-trained with autoregressive language modeling: given text \(x_1, x_2, \ldots, x_T\), the cross-entropy loss is minimized:

\[ \mathcal{L} = -\sum_{t=1}^{T} \log p_\theta(x_t \mid x_1, \ldots, x_{t-1}) \]

This is a self-supervised learning task โ€” the labels are the text tokens themselves, so data is extremely abundant (practically the entire internet).

The model learns a probability distribution over vocabularies of 30kโ€“100k tokens. At inference, it samples iteratively:

\[ x_{t+1} \sim p_\theta(\cdot \mid x_1, \ldots, x_t) \]

Tokenization

Before training, text is converted to tokens by a tokenizer. The modern standard is Byte Pair Encoding (BPE):

  1. Starts with individual characters
  2. Iterates: merges the most frequent pairs
  3. Results in a subword vocabulary
"tokenization" โ†’ ["token", "iza", "tion"]   (BPE)
"ChatGPT"      โ†’ ["Chat", "G", "PT"]
"hello world"  โ†’ ["hello", " world"]

This allows compact vocabularies that handle rare words and multiple languages without a separate tokenizer per language.


Emergent Capabilities

Upon crossing certain scale thresholds, LLMs exhibit capabilities that do not exist in smaller models โ€” they appear to emerge non-linearly:

๐Ÿงฎ Few-Shot Learning
Learns tasks from 3-5 in-context examples, without weight updates.
Translate to French:
English: "cat" โ†’ French: "chat"
English: "dog" โ†’ French: "chien"
English: "bird" โ†’ French: "oiseau"
๐Ÿ”— Chain-of-Thought
Generates step-by-step reasoning before answering, improving accuracy in math and logic.
Q: If x+3=7, what is 2x?
A: First, x=7-3=4.
Then 2x=2ร—4=8.
๐Ÿ’ป Code Generation
Writes, explains, and debugs code in dozens of programming languages.
๐ŸŒ Multilingual
Translates, reasons, and generates in multiple languages without language-specific training.

The RLHF Pipeline

Base models predict text, but not necessarily helpful or safe text. RLHF (Reinforcement Learning from Human Feedback)3 adapts the model to human preferences:

flowchart LR
    A[Pre-trained\nBase Model] --> B[SFT\nSupervised Fine-Tuning]
    B --> C[Reward Model\nTrained with human rankings]
    C --> D[PPO / DPO\nRL Optimization]
    D --> E[Aligned Model\nChatGPT / Claude]

    style A fill:#21262d,color:#8b949e
    style E fill:#1f3244,color:#58a6ff
  1. SFT: Supervised fine-tuning on high-quality demonstrations
  2. Reward Model: neural network that learns to rank responses according to human preference
  3. PPO/DPO: RL optimization using the Reward Model as signal

DPO (Direct Preference Optimization) simplifies this: no explicit RL needed, trains directly on preferences:

\[ \mathcal{L}_{\text{DPO}} = -\mathbb{E}\left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right] \]

where \(y_w\) is the preferred response and \(y_l\) the rejected one.


Mixture of Experts (MoE)

To scale beyond dense models, modern LLMs use Mixture of Experts4: each FFN layer is replaced by \(E\) independent "experts", with a router that activates only \(k\) of them per token:

\[ \text{MoE}(x) = \sum_{i \in \text{Top-}k(G(x))} G(x)_i \cdot E_i(x) \]

where \(G(x) = \text{Softmax}(W_g x)\) are the router weights.

Advantage: a model with \(E\) experts has \(E \times\) more parameters, but each forward pass activates only \(k/E\) of them โ†’ same computational efficiency with more capacity.

Model Total Parameters Active Parameters Experts
Mixtral 8ร—7B 46.7B 12.9B (28%) 8, top-2
DeepSeek-V3 671B 37B (5.5%) 256, top-8
GPT-4 (speculated) ~1.8T ~110B ~16 experts

Advanced Prompting

LLM behavior is strongly influenced by the prompt:

Technique Description When to use
Zero-shot Direct instruction without examples Simple tasks, large models
Few-shot 3-5 inputโ†’output examples Specific format, new tasks
Chain-of-Thought Ask "let's think step by step" Math, logic, reasoning
System Prompt Defines model role/persona Specialized assistants
RAG Retrieves documents before generating Up-to-date knowledge, factuality
Tool Use Model calls external functions/APIs Calculation, search, actions in the world

Challenges and Limitations

Hallucination
LLMs fabricate facts with confidence. RAG and grounding partially mitigate this.
Knowledge Cutoff
Training data has a cutoff date. RAG, browsing, and tool use compensate.
Computational Cost
Inference is expensive. Quantization, distillation, and caching reduce costs.

Model Landscape (2025)

Family Organization Open-source? Specialty
GPT-4o / o3 OpenAI โŒ General SOTA, reasoning
Claude 3.7 Anthropic โŒ Long context window, safety
Gemini 2.5 Google โŒ Multimodal, Google integration
LLaMA 3.3 Meta โœ… Base for fine-tuning
Mistral / Mixtral Mistral AI โœ… Efficiency, MoE
DeepSeek-V3/R1 DeepSeek โœ… Reasoning, code
Qwen 2.5 Alibaba โœ… Multilingual
Gemma 3 Google โœ… Small and efficient