Skip to content

9.3. Generative

Metrics for Generative AI

Generative AI models, such as those for text (e.g., GPT series), images (e.g., DALL-E), or audio, are evaluated using a mix of automated quantitative metrics and qualitative human assessments. These metrics assess aspects like quality, coherence, diversity, fidelity to inputs, and ethical considerations. Below is a table summarizing key metrics commonly used across generative tasks, with descriptions and primary use cases. Note that no single metric is perfect, and combinations (including human evaluation) are often recommended.


1. Text Generation & Language Modeling

Metric Description Key Use Cases
Perplexity Measures how well a probability model predicts a sample; lower = better fluency & coherence Language modeling, next-word prediction
BLEU N-gram precision overlap with reference(s); penalizes short outputs Machine translation, dialogue, text generation
ROUGE Recall-oriented n-gram/LCS overlap Summarization, headline generation
METEOR Aligns unigrams with synonyms, stemming, and word order Translation, paraphrasing
BERTScore Cosine similarity of BERT embeddings (semantic) Any text: faithfulness, QA, summarization
Self-BLEU / Unique n-grams Measures diversity by treating one output as "reference" for others Story generation, open-ended chat

2. Image & Visual Generation

Metric Description Key Use Cases
FID (Fréchet Inception Distance) Compares feature distributions of real vs. generated images GANs, diffusion models (e.g., Stable Diffusion)
Inception Score (IS) Quality + diversity via classifier confidence & entropy GAN evaluation (legacy; less used now)
Precision & Recall for Distributions Separately measures realism (precision) and coverage (recall) High-res image synthesis
CLIP Score Cosine similarity between image and text prompt embeddings Text-to-image alignment (DALL·E, Midjourney)

3. Multimodal & Cross-Modal Tasks

Metric Description Key Use Cases
CLIP Score / T5 Score Text-image or text-text semantic alignment Image captioning, visual QA, retrieval
R@K (Recall at K) Retrieval accuracy in joint embedding space Image-text retrieval
Human Preference (Elo, A/B) Pairwise human judgments Text-to-image, video, music

4. Safety, Ethics & Fairness

Metric Description Key Use Cases
Toxicity Score (Perspective API, RealToxicityPrompts) Probability of harmful content Chatbots, content generation
Bias Metrics (WEAT, CrowS-Pairs, Bias-in-Bios) Measures stereotyping in embeddings or outputs Fairness in hiring, gender/race bias
Regard / Honesty Scores Evaluates respectfulness or truthfulness Dialogue systems, factuality

5. General / Human-Centric Evaluation

Metric Description Key Use Cases
Human Evaluation (Likert, Ranking, Fluency/Coherence) Crowdsourced ratings on multiple axes All domains – gold standard
LLM-as-a-Judge (GPT-4 Eval, Reward Models) Uses strong LLM to score outputs vs. references Scalable alternative to human eval
HELM / BIG-bench / MMLU-style Probes Holistic benchmark suites General capability assessment

Quick Reference by Task

Task Recommended Metrics
Machine Translation BLEU, METEOR, BERTScore, chrF
Summarization ROUGE, BERTScore, Factuality (e.g., QAGS)
Text-to-Image FID, CLIP Score, Human pref
Dialogue / Chat Perplexity, Diversity, Toxicity, Human rating
Creative Writing Self-BLEU, MAUVE, Human creativity score