Observability

A system you cannot observe is a system you cannot operate. Monitoring tells you that something is wrong; observability tells you why. The distinction matters in microservices: a request that touches eight services and three databases cannot be diagnosed from a single metric on a single dashboard. Observability is the property of a system that makes its internal state inferable from external outputs.

The Three Pillars

Observability is built on three complementary data types. Each answers a different question about system behaviour.

Pillar	Answers	Toolchain	Retention
Metrics	Is the system healthy right now? Is it trending up or down?	Prometheus → Grafana	Weeks to months (aggregated)
Logs	What exactly happened, and in what sequence?	ELK / EFK / Loki	Days to weeks
Traces	Which service in this request was slow? Where did the error originate?	Jaeger / Tempo / Zipkin	Days

Observability ≠ Monitoring

Monitoring asks pre-defined questions: "Is the error rate below 1%?" Observability lets you ask questions you didn't think of at design time: "Why are exactly 7% of requests from mobile clients in São Paulo timing out on a Tuesday?" Observability requires all three pillars working together.

Metrics

Prometheus Data Model

Every time series has a metric name and a set of key-value labels:

http_requests_total{method="GET", status="200", service="order"} 4823

Type	What it measures	Example
Counter	Monotonically increasing value; never decreases	`http_requests_total`, `errors_total`
Gauge	Current value that can go up or down	`memory_bytes`, `active_connections`
Histogram	Distribution of values in configurable buckets	`http_request_duration_seconds{le="0.1"}`
Summary	Pre-calculated quantiles (less flexible than histogram)	`http_request_duration_seconds{quantile="0.99"}`

PromQL Basics

Query	Meaning
`rate(http_requests_total[5m])`	Requests per second over 5-minute window
`sum by (service) (rate(http_errors_total[5m]))`	Error rate grouped by service
`histogram_quantile(0.99, rate(http_duration_bucket[5m]))`	P99 latency
`increase(http_requests_total[1h])`	Total requests in last hour

RED and USE Methods

Method	For	Metrics
RED (Brendan Gregg + Tom Wilkie)	Services / APIs	Rate (requests/sec), Errors (error rate), Duration (latency percentiles)
USE (Brendan Gregg)	Infrastructure / Resources	Utilisation (% busy), Saturation (queue depth), Errors

Which method to use?

Use RED for service-level dashboards (what users experience), USE for infrastructure dashboards (what resources experience). Both together give complete coverage.

Metrics Pipeline

flowchart LR
    app["Spring Boot\n(Actuator + Micrometer)"]
    prom["Prometheus"]
    grafana["Grafana\n(dashboards + alerts)"]
    app -->|"exposes /actuator/prometheus"| prom
    prom -->|"scrape every 15 s"| prom
    prom -->|"PromQL queries"| grafana

Spring Boot Actuator configuration to expose the Prometheus endpoint:

management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus
  metrics:
    tags:
      application: ${spring.application.name}

Logs

Structured vs Unstructured

	Unstructured	Structured (JSON)
Example	`2024-01-15 ERROR Order 42 failed`	`{"level":"ERROR","orderId":42,"msg":"failed","traceId":"abc123"}`
Searchable	grep / regex only	Full field queries
Aggregatable	Hard	Easy (Kibana, Loki)
Machine-parseable	Brittle	Reliable

Log Levels

Level	Use
TRACE	Extremely detailed — every method entry/exit. Never in production.
DEBUG	Diagnostic — variable values, decision points. Dev/staging only.
INFO	Significant events — startup, config loaded, request received
WARN	Something unexpected but handled — retry triggered, fallback used
ERROR	Failure requiring attention — exception caught, downstream unreachable

Correlation IDs and MDC

When a request touches multiple services, a shared traceId in every log line allows reconstructing the full call sequence. Spring's Mapped Diagnostic Context (MDC) propagates this automatically with OpenTelemetry. Add the trace ID to every log line by placing it in MDC at the entry point and referencing it in the Logback pattern:

MDC.put("traceId", traceId);
// ... process request
MDC.remove("traceId");

Logback pattern to include the trace ID:

%d{ISO8601} [%X{traceId}] %-5level %logger{36} - %msg%n

Log Aggregation Pipeline

flowchart LR
    svc1["Service A\n(JSON logs → stdout)"]
    svc2["Service B\n(JSON logs → stdout)"]
    svc3["Service C\n(JSON logs → stdout)"]
    collector["Filebeat / Fluentd\n(collector)"]
    elastic["Elasticsearch\n(index + store)"]
    kibana["Kibana\n(search + dashboards)"]
    svc1 --> collector
    svc2 --> collector
    svc3 --> collector
    collector --> elastic
    elastic --> kibana

Loki as a lighter alternative

Loki (Grafana Labs) indexes only labels, not full text, making it much cheaper to operate than Elasticsearch. If you already run Prometheus and Grafana, adding Loki gives you log aggregation with zero new UI to learn — all signals live in the same Grafana dashboards.

Distributed Traces

Anatomy of a Trace

A trace represents a single end-to-end request. It is composed of spans — one per operation (service call, DB query, cache lookup, etc.).

Field	Description	Example
`traceId`	Globally unique ID for the entire request	`4bf92f3577b34da6`
`spanId`	Unique ID for this operation	`00f067aa0ba902b7`
`parentSpanId`	Span that triggered this one	`a2fb4a1d1a96d312`
`name`	Operation name	`GET /orders/{id}`
`startTime` / `duration`	When it started and how long it took	`2024-01-15T10:30:00Z, 45ms`
`status`	OK / ERROR	`ERROR`
`attributes`	Key-value tags	`http.status_code=500, db.type=postgresql`

W3C `traceparent` Header

The traceparent header propagates trace context across service boundaries using the format version-traceId-parentSpanId-flags:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

Field	Value	Meaning
`version`	`00`	W3C spec version
`traceId`	`4bf92f3577b34da6a3ce929d0e0e4736`	128-bit globally unique trace ID
`parentSpanId`	`00f067aa0ba902b7`	64-bit ID of the span that sent this request
`flags`	`01`	Sampling flag (`01` = sampled)

Every service propagates this header unchanged on downstream calls — only the parentSpanId changes to the current span's ID. This is how trace context crosses service boundaries without any service needing to understand the full trace.

Request Flow with Spans

sequenceDiagram
    autonumber
    actor User
    participant GW as Gateway [span: root, 120ms]
    participant OS as OrderService [span: child, 95ms]
    participant DB as PostgreSQL [span: leaf, 40ms]

    User->>+GW: GET /orders/42<br/>traceparent: 00-abc...-root-01
    GW->>+OS: GET /orders/42<br/>traceparent: 00-abc...-gw_span-01
    OS->>+DB: SELECT * FROM orders WHERE id=42
    DB-->>-OS: row data
    OS-->>-GW: 200 OK (95ms)
    GW-->>-User: 200 OK (120ms)

Sampling Strategies

Strategy	How	Tradeoff
Head sampling	Decision made at the first span	Fast, low overhead, but misses rare errors
Tail sampling	Decision made after trace is complete	Can prioritise errors/slow traces, but higher overhead
Rate-based	Keep X% of all traces	Simple, predictable cost

Production recommendation

Use head sampling at 1–10% for normal traffic, combined with a tail-sampling rule that keeps 100% of traces containing at least one ERROR span. This keeps storage costs low while ensuring every incident has full trace data.

OpenTelemetry (OTel)

OpenTelemetry is the CNCF standard for generating, collecting, and exporting telemetry. It replaces vendor-specific SDKs (Zipkin client, Jaeger client, etc.) with a single neutral API, so switching backends requires only a configuration change, not a code change.

Component	Role
API	Language-specific interfaces — Tracer, Meter, Logger
SDK	Implementation of the API; includes sampling, batching, exporters
Collector	Standalone process that receives, processes, and exports telemetry
Exporter	Protocol adapter (OTLP, Jaeger, Zipkin, Prometheus)

flowchart LR
    app["Application\n(OTel SDK)"]
    collector["OTel Collector"]
    prom["Prometheus\n(metrics)"]
    jaeger["Jaeger\n(traces)"]
    loki["Loki\n(logs)"]
    app -->|"OTLP (gRPC/HTTP)"| collector
    collector --> prom
    collector --> jaeger
    collector --> loki

Java Zero-Code Instrumentation

The OTel Java agent instruments Spring Boot automatically via bytecode injection — no code changes required.

ADD https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar /app/opentelemetry-javaagent.jar

ENTRYPOINT ["java", \
  "-javaagent:/app/opentelemetry-javaagent.jar", \
  "-jar", "/app/app.jar"]

Configure the agent via environment variables in Docker Compose:

environment:
  OTEL_SERVICE_NAME: order-service
  OTEL_EXPORTER_OTLP_ENDPOINT: http://jaeger:4317
  OTEL_METRICS_EXPORTER: none
  OTEL_LOGS_EXPORTER: none

Why disable metrics/logs exporters?

The Java agent can export all three signals via OTLP. We disable metrics and logs here because Prometheus already scrapes metrics via Actuator and Loki (or ELK) handles logs separately. Enabling all three from the agent would duplicate data and inflate storage costs.

Manual Spans

When automatic instrumentation is not granular enough, add custom spans programmatically. Add the OTel API dependency:

<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-api</artifactId>
</dependency>

Wrap business logic in a span to capture timing, attributes, and exceptions:

@Autowired
private Tracer tracer;

public Order processOrder(OrderRequest request) {
    Span span = tracer.spanBuilder("processOrder").startSpan();
    try (Scope scope = span.makeCurrent()) {
        span.setAttribute("order.customerId", request.customerId());
        // ... business logic
        return order;
    } catch (Exception e) {
        span.recordException(e);
        span.setStatus(StatusCode.ERROR, e.getMessage());
        throw e;
    } finally {
        span.end();
    }
}

Grafana Unified Stack (LGTM)

Loki + Grafana + Tempo + Mimir/Prometheus — the LGTM stack — puts all three observability signals into a single UI. The key advantage is signal correlation: you can jump from a metric spike to the traces that caused it to the log lines that explain it, all without leaving Grafana.

flowchart LR
    app["Application"]
    prom["Prometheus\n(metrics)"]
    loki["Loki\n(logs)"]
    tempo["Tempo\n(traces)"]
    grafana["Grafana"]
    app --> prom
    app --> loki
    app --> tempo
    prom --> grafana
    loki --> grafana
    tempo --> grafana

A typical drill-down workflow: a metric alert fires in Grafana → click "View Traces" to open the correlated trace in Tempo → click a span to see the correlated log lines from Loki. This cross-signal navigation is what separates an observability platform from three isolated monitoring tools.

DORA Metrics

DORA (DevOps Research and Assessment) identified four metrics that predict software delivery performance and organisational performance. Elite-performing teams score well on all four simultaneously.

Metric	Measures	Elite benchmark
Deployment Frequency	How often code reaches production	Multiple times/day
Lead Time for Changes	Commit to production time	< 1 hour
Change Failure Rate	% of deployments causing incidents	< 5%
Mean Time to Recovery (MTTR)	How long to recover from an incident	< 1 hour

DORA and observability

DORA metrics are themselves telemetry. Deployment frequency comes from CI/CD logs. MTTR requires precise incident start and end timestamps from your monitoring system. Change Failure Rate requires correlating deployment events with error-rate spikes. You cannot improve what you cannot measure — and you cannot measure it without observability infrastructure.

Observability Snapshot: Spike Simulation

The chart below simulates what the three observability pillars look like during a traffic spike. Panel 1 shows the request rate rising sharply at steps 10–12; Panel 2 shows error events clustering in the same window; Panel 3 shows trace span durations stretching out under load.

The three panels share a timeline: the spike visible in Panel 1 directly explains the error cluster in Panel 2, and both explain the stretched span durations in Panel 3. Without all three pillars, you would see only one piece of the picture.

BEYER, B. et al. Site Reliability Engineering: How Google Runs Production Systems. O'Reilly, 2016. ↩
MAJORS, C.; FONG-JONES, L.; MIRANDA, G. Observability Engineering. O'Reilly, 2022. ↩
OPENTELEMETRY. opentelemetry.io — specification, SDKs, Collector. ↩
FORSGREN, N.; HUMBLE, J.; KIM, G. Accelerate: The Science of Lean Software and DevOps. IT Revolution, 2018. ↩