6. LLM Fine-Tuning
Deadline and Submission
TBD
Commits until 23:59
Team (2β3 members)
GitHub Pages link via insper.blackboard.com.
Activity: Fine-Tuning a Large Language Model with LoRA
In this activity you will fine-tune a pre-trained LLM on a custom task using LoRA (Low-Rank Adaptation), one of the most widely used Parameter-Efficient Fine-Tuning (PEFT) techniques in the industry. You will use the Hugging Face ecosystem: transformers, peft, and trl.
Learning Objectives
By the end of this activity you will be able to:
- Load and inspect a pre-trained LLM and its tokenizer
- Configure and apply LoRA adapters to specific attention modules
- Fine-tune the model on a custom instruction dataset using
SFTTrainer - Evaluate the fine-tuned model qualitatively and quantitatively
- Understand the trade-off between trainable parameters, memory, and quality
Prerequisites
Install the required packages:
You will need access to a GPU (Google Colab Pro or similar). For a minimal experiment, microsoft/Phi-3-mini-4k-instruct (3.8B parameters) or TinyLlama/TinyLlama-1.1B-Chat-v1.0 (1.1B) work on a T4 GPU.
Exercise 1 β Model Inspection and Baseline
Instructions
- Load a pre-trained model and its tokenizer:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
- Count total parameters and report:
- Total parameters
- Parameters per layer type (embedding, attention, FFN)
-
Memory usage (use
model.get_memory_footprint()) -
Run a baseline inference with 5 prompts related to your chosen task (e.g., sentiment analysis, Q&A, code generation). Record the raw model responses.
-
Evaluate baseline using an appropriate metric for your task (accuracy for classification, BLEU/ROUGE for generation). This establishes the baseline before fine-tuning.
Exercise 2 β Dataset Preparation
Instructions
- Choose or create a task dataset with at least 500 instruction-following examples. Suggested sources:
- Hugging Face Datasets (e.g.,
financial_phrasebank,medical_questions_pairs,code_x_glue_ct_code_to_code_trans) - Custom dataset relevant to your domain
-
Avoid: general chat datasets (too broad), datasets already in the model's training data
-
Format in instruction-following format (compatible with the model's chat template):
def format_example(example):
return {
"text": tokenizer.apply_chat_template([
{"role": "user", "content": example["input"]},
{"role": "assistant", "content": example["output"]}
], tokenize=False, add_generation_prompt=False)
}
- Split 80% train / 10% validation / 10% test. Report:
- Dataset size, label distribution (if classification)
- Average input/output length in tokens
- Example of a formatted training sample
Exercise 3 β LoRA Configuration and Fine-Tuning
Instructions
- Configure LoRA using
peft:
from peft import LoraConfig, get_peft_model, TaskType
lora_config = LoraConfig(
r=8, # rank β experiment with 4, 8, 16
lora_alpha=16, # scaling factor (typically 2Γr)
target_modules=["q_proj", "v_proj"], # modules to adapt
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
- Fine-tune with SFTTrainer:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model=model,
train_dataset=train_dataset,
eval_dataset=val_dataset,
args=SFTConfig(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=2,
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.05,
logging_steps=10,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
fp16=True,
),
)
trainer.train()
- Run experiments with at least two different LoRA configurations and compare:
r=4vsr=16-
target_modules=["q_proj", "v_proj"]vs["q_proj", "k_proj", "v_proj", "o_proj"] -
Plot training and validation loss curves for each configuration.
Required reporting
| Config | Rank r | Target modules | Trainable params | Train loss | Val loss | Test metric |
|---|---|---|---|---|---|---|
| A | 4 | q, v | ? | ? | ? | ? |
| B | 16 | q, k, v, o | ? | ? | ? | ? |
Exercise 4 β Evaluation and Analysis
Instructions
- Quantitative evaluation on the test set:
- Classification: accuracy, F1
- Generation: BLEU-4, ROUGE-L, or task-specific metric
-
Compare baseline vs. each LoRA configuration
-
Qualitative evaluation: run 10 test prompts through baseline and best fine-tuned model. Show 3 examples where fine-tuning clearly improved the output and 1β2 where it did not.
-
Error analysis: what types of inputs does the fine-tuned model still struggle with? Is it a data problem, prompt problem, or capacity problem?
-
Ablation β rank analysis: if time allows, test
r β {2, 4, 8, 16, 32}and plot test metric vs. trainable parameter count. At what rank does performance plateau?
Exercise 5 β Reflection
Answer the following questions in your report (1 paragraph each):
-
How many parameters did LoRA actually train? What percentage of the full model? Why is this enough to adapt the model to your task?
-
What would full fine-tuning have required in terms of memory? Why is that prohibitive for most teams?
-
Did the fine-tuned model "forget" any general capabilities of the base model? Give evidence from your qualitative evaluation.
-
If you were deploying this fine-tuned model in production, what additional steps would you take before launching?
Evaluation Criteria
Important Constraints
- Use only open-source models (no GPT-4 API, no Claude API). The model must be loadable from Hugging Face Hub.
- Report GPU hours used and estimated cost (Google Colab compute units or equivalent).
- All code must be reproducible: set random seeds, pin library versions.
| Criteria | Points |
|---|---|
| 2 pts | Dataset selection, preparation, and formatting |
| 2 pts | LoRA configuration and successful fine-tuning |
| 2 pts | Quantitative evaluation and comparison of configurations |
| 2 pts | Qualitative evaluation and error analysis |
| 2 pts | Reflection questions and report quality |
Submission format: GitHub Pages report + link to training notebook (Google Colab or similar). Include all plots and tables.
AI Collaboration: Allowed, but you must understand every configuration parameter. The report must be your own analysis.