14. Transfer Learning
Transfer Learning and Fine-Tuning
Training a deep neural network from scratch requires enormous amounts of labeled data and compute. Transfer Learning solves this: instead of random weight initialization, we start from a model already trained on a rich task (usually at large scale) and adapt it to our specific task.
The intuition: the early layers of a CNN trained on ImageNet learn edge, texture, and shape detectors — useful for any vision task. The later layers specialize in ImageNet categories. We replace those final layers and fine-tune the model.
Taxonomy of Approaches
LoRA: Low-Rank Adaptation
LoRA2 is the most popular PEFT (Parameter-Efficient Fine-Tuning) technique. The idea is simple and elegant:
For a frozen pre-trained weight matrix \(W_0 \in \mathbb{R}^{d \times k}\), we add a low-rank perturbation:
where \(B \in \mathbb{R}^{d \times r}\) and \(A \in \mathbb{R}^{r \times k}\), with \(r \ll \min(d, k)\).
- During the forward pass: \(h = W_0 x + BAx = W_0 x + \Delta W x\)
- During training: only \(A\) and \(B\) update (\(W_0\) is frozen)
- Trainable parameters: \(r(d + k)\) vs \(dk\) in full fine-tuning
Example with GPT-3 (175B parameters):
| Configuration | Trainable Params |
|---|---|
| Full fine-tuning | 175 billion |
| LoRA (\(r=4\), attention) | 4.7 million (~0.003%) |
| LoRA (\(r=16\), attention) | ~18.9 million |
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
lora_config = LoraConfig(
r=16, # matrix rank
lora_alpha=32, # scaling (alpha/r = scale factor)
target_modules=["q_proj", "v_proj"], # where to apply LoRA
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 6,815,744 || all params: 8,037,269,504 || trainable%: 0.0848
Interactive: Fine-Tuning Cost Calculator
Other PEFT Techniques
| Technique | Idea | Parameters |
|---|---|---|
| LoRA | Low-rank matrices on attention weights | \(r(d+k)\) per layer |
| QLoRA | LoRA + model frozen in 4-bit (NF4) | ~LoRA, reduced VRAM |
| Prefix Tuning | Learns virtual tokens prepended to the sequence | \(\text{num\_prefix} \times d_{\text{model}}\) |
| Prompt Tuning | Only prompt embeddings are trainable | \(\text{num\_tokens} \times d_{\text{model}}\) |
| Adapter Layers | Inserts small FFN layers between existing ones | \(2 \times r \times d\) per layer |
| DoRA | LoRA decomposed into magnitude + direction | Similar to LoRA |
Domain Adaptation vs. Task Adaptation
flowchart LR
A[Pre-trained\nBase Model] --> B{Adaptation Type}
B -->|"Domain data\n(no labels)"| C[Continued Pre-Training\nDomain Adaptation]
B -->|"Labeled task\ndata"| D[Supervised Fine-Tuning\nSFT]
B -->|"Human feedback"| E[RLHF / DPO\nAlignment]
C --> F[Domain Model]
D --> G[Task Model]
E --> H[Aligned Model]
F -->|"Additional fine-tune"| G Practical recipe for LLM fine-tuning (2025):
- Start with a suitable base model (LLaMA-3, Mistral, Gemma)
- Quantize to 4-bit (QLoRA) if VRAM is limited
- Apply LoRA with \(r \in \{8, 16, 32\}\) on Q and V projections
- Use
TRL+SFTTrainerfor supervised fine-tuning - Optionally apply DPO for preference alignment
When to Use Each Approach
| Situation | Recommendation |
|---|---|
| Task data: < 1,000 samples | Feature Extraction or Prompt Tuning |
| Data: 1k–100k samples, limited hardware | LoRA/QLoRA |
| Data: > 100k samples, hardware available | Full Fine-Tuning |
| New domain (medical, legal, code) | Domain Adaptation → Fine-Tuning |
| Alignment with values/preferences | SFT → RLHF or DPO |
-
Pan, S. J., & Yang, Q. (2010). A Survey on Transfer Learning. IEEE TKDE. ↩
-
Hu, E. et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. ↩
-
Dettmers, T. et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. ↩
-
Rafailov, R. et al. (2023). Direct Preference Optimization. ↩