9. Reliability
Reliability is a critical aspect of modern software systems, ensuring that services remain available, performant, and trustworthy for users. Central to managing reliability are three key concepts: Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs). This document provides a comprehensive overview of these concepts, their definitions, interrelationships, and practical applications.
1. Core Definitions
| Term | Full Name | Definition |
|---|---|---|
| SLI | Service Level Indicator | A measurable metric that reflects the actual performance or quality of a service. |
| SLO | Service Level Objective | A target value or range for an SLI that the team commits to achieve. |
| SLA | Service Level Agreement | A formal contract between a service provider and a customer that defines the expected level of service, usually expressed via SLOs and consequences for missing them. |
2. In-Depth Breakdown
SLI – The Raw Measurement
- What it is: A quantitative metric pulled from logs, monitoring systems, or instrumentation.
- Examples:
- Availability:
% of successful HTTP requests(e.g., 200 OK responses) - Latency:
95th percentile request latency < 200 ms - Error rate:
Errors per second / Total requests per second - Throughput:
Requests per second (RPS) - Durability:
% of data writes acknowledged by 3+ replicas
- Availability:
- Key properties:
- Objective (not subjective)
- Directly measurable
- Relevant to user experience
- How it's collected:
- Prometheus exporters, OpenTelemetry, Cloud monitoring, application logs, synthetic probes.
SLO – The Target You Commit To
- What it is: A specific, time-bound target for an SLI.
- Format:
SLI ≥ thresholdover a rolling window (e.g., 28 days). - Examples:
- Availability SLO:
99.9%of requests return 200 OK in a 28-day window. - Latency SLO:
95th percentile latency ≤ 150 msover 28 days. - Error Budget: Allowed error =
100% – 99.9% = 0.1%→ ~43 minutes downtime/month.
- Availability SLO:
- Error Budget =
1 – SLO target- Drives reliability decisions: If budget is consumed → stop features, focus on stability.
SLA – The Contractual Commitment
- What it is: A legal/business agreement with penalties (credits, refunds) if SLOs are breached.
- Structure:
- List of SLOs
- Measurement period
- Exclusions (maintenance, force majeure)
- Remedies (service credits)
- Example: > "Provider guarantees 99.95% monthly uptime. If below, customer receives 10% credit for that month."
- SLA ≠ SLO: SLA is external, SLO is internal target (often stricter than SLA to provide buffer).
3. How SLI → SLO → SLA Connect (The Reliability Stack)
flowchart TD
A("SLI: Measurable Metric<br><small>e.g., % successful requests</small>") --> B("SLO: Target Value<br><small>Internal target (e.g., 99.99% uptime)</small>")
B --> C("SLA: Formal Agreement<br><small>External contract (e.g., 99.95% uptime)</small>")
C --> D[Customer Expectations]
B --> E[Error Budget Management]
A --> F[Monitoring & Alerting] Flow:
- Choose SLIs → What matters to users?
- Set SLOs → What can we realistically achieve?
- Define SLA → What do we promise customers? (usually looser than SLO)
Best Practice:
SLO > SLA→ gives error budget buffer to avoid penalties.
4. Roles & Responsibilities
| Role | Works with SLI? | Works with SLO? | Works with SLA? | Primary Tasks |
|---|---|---|---|---|
| SRE (Site Reliability Engineer) | ✅ Yes | ✅ Yes | ⚠️ Indirectly | - Define & monitor SLIs - Propose & defend SLOs - Manage error budgets - Alerting & incident response |
| Product Manager | ⚠️ Reviews | ✅ Yes | ⚠️ Reviews | - Align SLOs with user needs - Balance features vs reliability - Approve risk during error budget spend |
| Engineering / Dev Team | ✅ Yes | ✅ Yes | No | - Instrument code for SLIs - Fix issues consuming error budget - Ship features within budget |
| DevOps / Platform Team | ✅ Yes | ⚠️ Indirectly | No | - Build monitoring pipelines - Export SLIs to dashboards - Automate SLO compliance checks |
| Customer Success / Account Manager | No | No | ✅ Yes | - Explain SLA to customers - Handle credit requests - Report SLA compliance |
| Legal / Contracts Team | No | No | ✅ Yes | - Draft & negotiate SLA terms - Define exclusions, remedies |
| Leadership (CTO, VP Eng) | ⚠️ Reviews | ✅ Yes | ✅ Yes | - Approve SLO targets - Sign off on SLA commitments - Strategic reliability goals |
5. Real-World Example (Web Service)
| Component | Value |
|---|---|
| SLI | % of HTTP requests returning 2xx/3xx in < 500ms |
| SLO | 99.9% over 28-day rolling window |
| Error Budget | 0.1% → ~43 min/month allowed failure |
| SLA | 99.5% monthly → if missed, 15% credit |
→ Team monitors SLI daily, uses error budget to decide:
"We have 10 min left this month → no new deploys until stabilized."
6. Key Best Practices
| Practice | Why |
|---|---|
| SLIs must reflect user experience | Avoid vanity metrics (e.g., CPU ≠ happiness) |
| SLOs should be ambitious but achievable | Too tight → constant alerts; too loose → no pressure |
| Error budgets drive product decisions | Google SRE: "If no budget left, reliability > features" |
| SLA < SLO | Buffer against penalties |
| Automate SLI/SLO dashboards | Real-time visibility |
7. Summary Table
| SLI | SLO | SLA | |
|---|---|---|---|
| Nature | Metric | Target | Contract |
| Audience | Engineers | Team | Customer |
| Example | 99.95% success | 99.9% | 99.5% + credit |
| Owner | SRE / Dev | SRE + PM | Legal + CS |
| Consequence | Alert | Pause features | Refund |
Final Takeaway
SLI is what you measure
SLO is what you promise internally
SLA is what you promise externally (with money on the line)
They form a hierarchy of reliability:
Used together, they enable data-driven reliability engineering — the foundation of modern SRE practices1.