Skip to content

14. Transfer Learning

Transfer Learning and Fine-Tuning

Training a deep neural network from scratch requires enormous amounts of labeled data and compute. Transfer Learning solves this: instead of random weight initialization, we start from a model already trained on a rich task (usually at large scale) and adapt it to our specific task.

The intuition: the early layers of a CNN trained on ImageNet learn edge, texture, and shape detectors — useful for any vision task. The later layers specialize in ImageNet categories. We replace those final layers and fine-tune the model.


Taxonomy of Approaches


LoRA: Low-Rank Adaptation

LoRA2 is the most popular PEFT (Parameter-Efficient Fine-Tuning) technique. The idea is simple and elegant:

For a frozen pre-trained weight matrix \(W_0 \in \mathbb{R}^{d \times k}\), we add a low-rank perturbation:

\[ W = W_0 + \Delta W = W_0 + BA \]

where \(B \in \mathbb{R}^{d \times r}\) and \(A \in \mathbb{R}^{r \times k}\), with \(r \ll \min(d, k)\).

  • During the forward pass: \(h = W_0 x + BAx = W_0 x + \Delta W x\)
  • During training: only \(A\) and \(B\) update (\(W_0\) is frozen)
  • Trainable parameters: \(r(d + k)\) vs \(dk\) in full fine-tuning

Example with GPT-3 (175B parameters):

Configuration Trainable Params
Full fine-tuning 175 billion
LoRA (\(r=4\), attention) 4.7 million (~0.003%)
LoRA (\(r=16\), attention) ~18.9 million
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")

lora_config = LoraConfig(
    r=16,                  # matrix rank
    lora_alpha=32,         # scaling (alpha/r = scale factor)
    target_modules=["q_proj", "v_proj"],  # where to apply LoRA
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 6,815,744 || all params: 8,037,269,504 || trainable%: 0.0848

Interactive: Fine-Tuning Cost Calculator


Other PEFT Techniques

Technique Idea Parameters
LoRA Low-rank matrices on attention weights \(r(d+k)\) per layer
QLoRA LoRA + model frozen in 4-bit (NF4) ~LoRA, reduced VRAM
Prefix Tuning Learns virtual tokens prepended to the sequence \(\text{num\_prefix} \times d_{\text{model}}\)
Prompt Tuning Only prompt embeddings are trainable \(\text{num\_tokens} \times d_{\text{model}}\)
Adapter Layers Inserts small FFN layers between existing ones \(2 \times r \times d\) per layer
DoRA LoRA decomposed into magnitude + direction Similar to LoRA

Domain Adaptation vs. Task Adaptation

flowchart LR
    A[Pre-trained\nBase Model] --> B{Adaptation Type}
    B -->|"Domain data\n(no labels)"| C[Continued Pre-Training\nDomain Adaptation]
    B -->|"Labeled task\ndata"| D[Supervised Fine-Tuning\nSFT]
    B -->|"Human feedback"| E[RLHF / DPO\nAlignment]
    C --> F[Domain Model]
    D --> G[Task Model]
    E --> H[Aligned Model]
    F -->|"Additional fine-tune"| G

Practical recipe for LLM fine-tuning (2025):

  1. Start with a suitable base model (LLaMA-3, Mistral, Gemma)
  2. Quantize to 4-bit (QLoRA) if VRAM is limited
  3. Apply LoRA with \(r \in \{8, 16, 32\}\) on Q and V projections
  4. Use TRL + SFTTrainer for supervised fine-tuning
  5. Optionally apply DPO for preference alignment

When to Use Each Approach

Situation Recommendation
Task data: < 1,000 samples Feature Extraction or Prompt Tuning
Data: 1k–100k samples, limited hardware LoRA/QLoRA
Data: > 100k samples, hardware available Full Fine-Tuning
New domain (medical, legal, code) Domain Adaptation → Fine-Tuning
Alignment with values/preferences SFT → RLHF or DPO



  1. Pan, S. J., & Yang, Q. (2010). A Survey on Transfer Learning. IEEE TKDE. â†©

  2. Hu, E. et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models↩

  3. Dettmers, T. et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs↩

  4. Rafailov, R. et al. (2023). Direct Preference Optimization↩