SLI / SLO / SLA

Measuring reliability requires a shared vocabulary that connects raw telemetry to engineering decisions to customer commitments. Google's Site Reliability Engineering practice introduced a three-tier model for this purpose¹, now standard across the industry.

Term	Full name	Nature
SLI	Service Level Indicator	A measurable metric
SLO	Service Level Objective	An internal target for that metric
SLA	Service Level Agreement	A contractual commitment to customers

flowchart LR
    telemetry["Telemetry\n(logs, metrics, traces)"]
    sli["SLI\nRaw measurement"]
    slo["SLO\nInternal target"]
    sla["SLA\nExternal contract"]
    budget["Error Budget\nEngineering lever"]

    telemetry e1@-->|aggregated into| sli
    sli e2@-->|evaluated against| slo
    slo e3@-->|informs| sla
    slo e4@-->|remainder becomes| budget

    e1@{ animate: true }
    e2@{ animate: true }
    e3@{ animate: true }
    e4@{ animate: true }

SLI — The Raw Measurement

An SLI is a quantitative metric pulled directly from instrumentation that reflects user-perceived system quality. Not every metric is a good SLI; only those that correlate with whether the user is getting value from the service.

Canonical SLI categories

Category	What it measures	Example measurement
Availability	Fraction of requests served successfully	`HTTP 2xx or 3xx / total HTTP requests`
Latency	Time to serve a request	`95th percentile request duration < 200 ms`
Error rate	Fraction of requests resulting in errors	`HTTP 5xx / total HTTP requests`
Throughput	Volume of work the system processes	`Requests per second`
Durability	Fraction of data that can be retrieved	`Successful reads / total attempted reads`
Freshness	Age of the most recent successful write	`Time since last successful cache refresh`

Avoid vanity metrics

CPU utilisation, JVM heap size, and garbage collection pause time are not good SLIs on their own. A service can have 95% CPU and still serve all requests correctly — and conversely, a service can have 10% CPU and be completely down. SLIs must reflect what the user experiences, not what the infrastructure experiences.

How SLIs are collected

Application-level instrumentation: request counters and latency histograms exported via Spring Boot Actuator / Micrometer.
Synthetic monitoring: probes that send real requests from outside the system on a schedule (e.g. Prometheus Blackbox Exporter).
Log-based: parsing access logs to count 2xx vs 5xx responses.
Client-side: measuring from the browser or mobile client for true end-user experience.

SLO — The Target You Commit To

An SLO is a specific, time-bound target for an SLI. It answers: "How good does this metric need to be for us to consider the service healthy?"

Format:

SLI ≥ threshold over a rolling window

Examples:

SLO	Meaning
`Availability ≥ 99.9%` over 28 days	At most 0.1% of requests may fail in any 28-day period
`P95 latency ≤ 200 ms` over 7 days	95% of requests complete within 200 ms in any 7-day window
`Error rate ≤ 0.5%` over 24 hours	No more than 5 in 1000 requests return 5xx in any day

Choosing the right SLO

SLOs must be ambitious but achievable. An SLO that is never breached is not driving reliability work — it is too loose. An SLO that is constantly breached is noise — the team stops trusting it.

Start by measuring the current SLI over 30–90 days.
Set the initial SLO at the observed baseline minus a small buffer.
Tighten it over time as the system improves.
Make SLOs public within the engineering org so all teams understand the targets they are building toward.

Error Budget

The error budget is the quantity of unreliability the SLO permits. It is the engineering team's licence to take risk.

\[ \text{Error Budget} = 1 - \text{SLO target} \]

For a 99.9% availability SLO over 28 days:

\[ \text{Error Budget} = 0.1\% \times 28 \text{ days} \times 24 \text{ h} \times 60 \text{ min} \approx 43.8 \text{ minutes} \]

If the error budget is intact, the team can:

Deploy frequently and accept some deployment risk
Run experiments on production systems
Take on more technical debt

If the error budget is exhausted, the team must:

Freeze non-essential deployments
Prioritise reliability work over feature work
Investigate and fix the root causes consuming the budget

Error budget as a negotiation tool

When a product manager wants to ship a risky feature and the SRE team is reluctant, the error budget provides an objective answer: "We have 12 minutes of budget remaining this month. If we deploy and it causes 30 minutes of degradation, we will miss our SLO. Let's wait until next month's budget resets."

Burn rate

Burn rate measures how fast the error budget is being consumed relative to the SLO window. A burn rate of 1 means the budget will be exactly exhausted by the end of the window. A burn rate of 2 means the budget will run out in half the window.

Burn rate	Budget exhausted in	Severity
1×	End of SLO window (28 days)	Normal
2×	14 days	Warning
6×	~5 days	High
14.4×	~48 hours	Critical — page immediately

High burn-rate alerts catch fast-moving incidents early, while low burn-rate alerts catch slow degradation that would otherwise be invisible².

SLA — The Contractual Commitment

An SLA is a formal agreement between service provider and customer that defines the expected level of service and the remedies (typically service credits) if that level is not met. SLAs are legal documents; missing them has financial consequences.

Structure of an SLA

Covered services — which products and regions are in scope
SLO targets — the availability or latency commitments
Measurement methodology — how uptime is calculated, which incidents count
Exclusions — scheduled maintenance, force majeure, customer-caused outages
Remedies — service credit schedule for each tier of breach

Real-world SLA examples

Provider	Service	Committed uptime	Credit for breach
AWS	EC2	99.99% monthly	10–30% of affected charges
AWS	S3	99.9% monthly	10–25% of affected charges
AWS	RDS Multi-AZ	99.95% monthly	10–100% of affected charges
Google Cloud	GCE	99.99% monthly	10–50% of charges
Google Cloud	Cloud Storage	99.9% monthly	10–25% of charges
Azure	Virtual Machines (2+ instances)	99.99% monthly	10–25% of monthly charges

SLA vs SLO

SLA and SLO are often confused. The critical distinction:

	SLO	SLA
Audience	Engineering team	Customer / business
Nature	Internal target	Legal contract
Strictness	Higher (stricter)	Lower (more lenient)
Consequence of breach	Reliability work, feature freeze	Service credits, penalties

Best practice: set your SLO stricter than your SLA. The buffer between SLO and SLA is the safety margin that prevents a reliability incident from triggering contractual penalties. If the SLA is 99.9%, the SLO might be 99.95%.

Best practices

Practice	Rationale
Fewer, more meaningful SLIs	Two or three well-chosen indicators outperform ten noisy ones. Start with availability and P99 latency.
28-day rolling windows	Monthly windows smooth out weekday/weekend variation and match most billing cycles.
Automate SLO dashboards	Manual calculation is error-prone and too slow. Grafana + Prometheus or Cloud Monitoring should show current SLO compliance in real time.
Review SLOs quarterly	User expectations and system capabilities evolve. SLOs that were correct at launch may need adjustment after a year of scale.
Never set SLO = 100%	100% is unachievable and leaves no room for planned maintenance or safe deployments.

BEYER, B. et al. Site Reliability Engineering. Google / O'Reilly, 2016. Chapters 4–5. ↩
MURPHY, N. et al. The Site Reliability Workbook. Google / O'Reilly, 2018. Chapter 5 — Alerting on SLOs. ↩
AWS Service Level Agreements. Amazon Web Services. ↩
Google Cloud Service Level Agreements. Google LLC. ↩