15. Transfer Learning

Transfer Learning and Fine-Tuning

Training a deep neural network from scratch requires enormous amounts of labeled data and compute. Transfer Learning solves this: instead of random weight initialization, we start from a model already trained on a rich task (usually at large scale) and adapt it to our specific task.

The intuition: the early layers of a CNN trained on ImageNet learn edge, texture, and shape detectors — useful for any vision task. The later layers specialize in ImageNet categories. We replace those final layers and fine-tune the model.

Taxonomy of Approaches

LoRA: Low-Rank Adaptation

LoRA² is the most popular PEFT (Parameter-Efficient Fine-Tuning) technique. The idea is simple and elegant:

For a frozen pre-trained weight matrix \(W_0 \in \mathbb{R}^{d \times k}\), we add a low-rank perturbation:

\[ W = W_0 + \Delta W = W_0 + BA \]

where \(B \in \mathbb{R}^{d \times r}\) and \(A \in \mathbb{R}^{r \times k}\), with \(r \ll \min(d, k)\).

During the forward pass: \(h = W_0 x + BAx = W_0 x + \Delta W x\)
During training: only \(A\) and \(B\) update (\(W_0\) is frozen)
Trainable parameters: \(r(d + k)\) vs \(dk\) in full fine-tuning

Example with GPT-3 (175B parameters):

Configuration	Trainable Params
Full fine-tuning	175 billion
LoRA (\(r=4\), attention)	4.7 million (~0.003%)
LoRA (\(r=16\), attention)	~18.9 million

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")

lora_config = LoraConfig(
    r=16,                  # matrix rank
    lora_alpha=32,         # scaling (alpha/r = scale factor)
    target_modules=["q_proj", "v_proj"],  # where to apply LoRA
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 6,815,744 || all params: 8,037,269,504 || trainable%: 0.0848

Interactive: Fine-Tuning Cost Calculator

Model size (billion parameters):

LoRA rank (r):

Other PEFT Techniques

Technique	Idea	Parameters
LoRA	Low-rank matrices on attention weights	\(r(d+k)\) per layer
QLoRA	LoRA + model frozen in 4-bit (NF4)	~LoRA, reduced VRAM
Prefix Tuning	Learns virtual tokens prepended to the sequence	\(\text{num\_prefix} \times d_{\text{model}}\)
Prompt Tuning	Only prompt embeddings are trainable	\(\text{num\_tokens} \times d_{\text{model}}\)
Adapter Layers	Inserts small FFN layers between existing ones	\(2 \times r \times d\) per layer
DoRA	LoRA decomposed into magnitude + direction	Similar to LoRA

Domain Adaptation vs. Task Adaptation

flowchart LR
    A[Pre-trained\nBase Model] --> B{Adaptation Type}
    B -->|"Domain data\n(no labels)"| C[Continued Pre-Training\nDomain Adaptation]
    B -->|"Labeled task\ndata"| D[Supervised Fine-Tuning\nSFT]
    B -->|"Human feedback"| E[RLHF / DPO\nAlignment]
    C --> F[Domain Model]
    D --> G[Task Model]
    E --> H[Aligned Model]
    F -->|"Additional fine-tune"| G

Practical recipe for LLM fine-tuning (2025):

Start with a suitable base model (LLaMA-3, Mistral, Gemma)
Quantize to 4-bit (QLoRA) if VRAM is limited
Apply LoRA with \(r \in \{8, 16, 32\}\) on Q and V projections
Use TRL + SFTTrainer for supervised fine-tuning
Optionally apply DPO for preference alignment

When to Use Each Approach

Situation	Recommendation
Task data: < 1,000 samples	Feature Extraction or Prompt Tuning
Data: 1k–100k samples, limited hardware	LoRA/QLoRA
Data: > 100k samples, hardware available	Full Fine-Tuning
New domain (medical, legal, code)	Domain Adaptation → Fine-Tuning
Alignment with values/preferences	SFT → RLHF or DPO

Pan, S. J., & Yang, Q. (2010). A Survey on Transfer Learning. IEEE TKDE. ↩
Hu, E. et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. ↩
Dettmers, T. et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. ↩
Rafailov, R. et al. (2023). Direct Preference Optimization. ↩