3. Generative
This is an open-themed project where you explore modern generative models. You must use at least one architecture from the list below and build a complete generation pipeline. The focus is on understanding the underlying models, not just running demos — you must explain the architecture, the connections between components, and the design choices.
Eligible Architectures
Choose at least one of the following as your primary model:
| Track | Model family | Examples |
|---|---|---|
| A | Latent Diffusion + U-Net | Stable Diffusion 1.5/XL |
| B | Flow Matching + DiT | FLUX.1-dev, SD3 |
| C | Autoregressive Image Generation | LlamaGen, MaskGIT |
| D | Video Generation | CogVideoX, AnimateDiff |
| E | Audio Generation | Stable Audio, AudioCraft |
| F | Any-to-any Multimodal | Open-source Chameleon variants |
Recommended starting point
Track B (FLUX.1) or Track A (Stable Diffusion XL via ComfyUI) are the most accessible while covering the most course content. Track D or E are excellent if your team wants to go further.
Pipeline Requirements
Your pipeline must chain at least two model components. Examples:
- Text → CLIP encoder → FLUX DiT (Flow Matching) → VAE decoder → Image
- Image → Depth estimator → ControlNet + SD → Styled image
- Text → LLM (enhanced prompt) → FLUX → Image → BLIP captioner → refined prompt
- Audio → Whisper → LLM → TTS → new audio
- Text → LLM story → SD image per scene → assembled video
What You Must Explain
For each model component in your pipeline, your report must describe:
- Architecture: what type of network (U-Net, DiT, Transformer, VAR…), number of parameters, key design choices
- Training objective: diffusion loss, flow matching, contrastive, autoregressive, etc.
- Role in the pipeline: what input it receives, what output it produces, why this component is here
- Connection to course content: explicitly link to the relevant lecture (e.g., "This U-Net uses cross-attention as described in the Attention lecture")
No Free Lunch
Use only open-source models and free compute (Google Colab, Kaggle, Hugging Face Spaces). Do not use paid APIs (OpenAI, Midjourney, Adobe Firefly). Document GPU hours used.
Examples of Input–Output Pairs
Provide at least 8 examples showing:
- Different text prompts / input styles
- Different inference parameters (CFG scale, number of steps, seed)
- At least 2 failure cases with analysis of why they failed
Criteria
| Criterion | Description |
|---|---|
| I | Incomplete delivery or no architecture explanation. |
| D | Basic working pipeline with errors; architecture explanation missing or superficial. |
| C | One working pipeline (Track A or B) with full architecture explanation for each component. At least 8 input-output examples with varied parameters. |
| B | Two working pipelines or one pipeline with advanced techniques (ControlNet, IP-Adapter, LoRA fine-tuning, or video generation). Full architecture documentation. |
| A | Grade B plus: custom fine-tuning (LoRA/DreamBooth), original pipeline combining ≥3 components, or a Track D/E implementation. Benchmarked results (FID, CLIP Score, or domain-specific metric). |
A half-grade will be added or subtracted based on report quality, creativity, and depth of architectural analysis.
Report Structure
Your GitHub Pages report must include:
- Introduction: what pipeline you built and why you chose it
- Architecture diagrams: flow diagrams showing data flow between components (use Mermaid or draw.io)
- Component deep-dives: one section per component with architecture description and math where relevant
- Results gallery: annotated input-output pairs with parameter settings
- Failure analysis: what does not work and why
- Reflection: what you learned, what surprised you, what you would do differently
Architecture Diagram Example
flowchart LR
A["Text Prompt"] --> B["CLIP Text Encoder\n(ViT-L/14, 123M params)"]
N["Gaussian Noise\nz~N(0,I)"] --> C
B --> C["FLUX DiT\n(12B params, Flow Matching)"]
C -->|"ODE: 20 steps"| D["Clean Latent z₁"]
D --> E["VAE Decoder\n(83M params)"]
E --> F["Output Image\n1024×1024px"]