Skip to content

Observability

A system you cannot observe is a system you cannot operate. Monitoring tells you that something is wrong; observability tells you why. The distinction matters in microservices: a request that touches eight services and three databases cannot be diagnosed from a single metric on a single dashboard. Observability is the property of a system that makes its internal state inferable from external outputs.


The Three Pillars

Observability is built on three complementary data types. Each answers a different question about system behaviour.

Pillar Answers Toolchain Retention
Metrics Is the system healthy right now? Is it trending up or down? Prometheus → Grafana Weeks to months (aggregated)
Logs What exactly happened, and in what sequence? ELK / EFK / Loki Days to weeks
Traces Which service in this request was slow? Where did the error originate? Jaeger / Tempo / Zipkin Days

Observability ≠ Monitoring

Monitoring asks pre-defined questions: "Is the error rate below 1%?" Observability lets you ask questions you didn't think of at design time: "Why are exactly 7% of requests from mobile clients in São Paulo timing out on a Tuesday?" Observability requires all three pillars working together.


Metrics

Prometheus Data Model

Every time series has a metric name and a set of key-value labels:

http_requests_total{method="GET", status="200", service="order"} 4823
Type What it measures Example
Counter Monotonically increasing value; never decreases http_requests_total, errors_total
Gauge Current value that can go up or down memory_bytes, active_connections
Histogram Distribution of values in configurable buckets http_request_duration_seconds{le="0.1"}
Summary Pre-calculated quantiles (less flexible than histogram) http_request_duration_seconds{quantile="0.99"}

PromQL Basics

Query Meaning
rate(http_requests_total[5m]) Requests per second over 5-minute window
sum by (service) (rate(http_errors_total[5m])) Error rate grouped by service
histogram_quantile(0.99, rate(http_duration_bucket[5m])) P99 latency
increase(http_requests_total[1h]) Total requests in last hour

RED and USE Methods

Method For Metrics
RED (Brendan Gregg + Tom Wilkie) Services / APIs Rate (requests/sec), Errors (error rate), Duration (latency percentiles)
USE (Brendan Gregg) Infrastructure / Resources Utilisation (% busy), Saturation (queue depth), Errors

Which method to use?

Use RED for service-level dashboards (what users experience), USE for infrastructure dashboards (what resources experience). Both together give complete coverage.

Metrics Pipeline

flowchart LR
    app["Spring Boot\n(Actuator + Micrometer)"]
    prom["Prometheus"]
    grafana["Grafana\n(dashboards + alerts)"]
    app -->|"exposes /actuator/prometheus"| prom
    prom -->|"scrape every 15 s"| prom
    prom -->|"PromQL queries"| grafana

Spring Boot Actuator configuration to expose the Prometheus endpoint:

management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus
  metrics:
    tags:
      application: ${spring.application.name}

Logs

Structured vs Unstructured

Unstructured Structured (JSON)
Example 2024-01-15 ERROR Order 42 failed {"level":"ERROR","orderId":42,"msg":"failed","traceId":"abc123"}
Searchable grep / regex only Full field queries
Aggregatable Hard Easy (Kibana, Loki)
Machine-parseable Brittle Reliable

Log Levels

Level Use
TRACE Extremely detailed — every method entry/exit. Never in production.
DEBUG Diagnostic — variable values, decision points. Dev/staging only.
INFO Significant events — startup, config loaded, request received
WARN Something unexpected but handled — retry triggered, fallback used
ERROR Failure requiring attention — exception caught, downstream unreachable

Correlation IDs and MDC

When a request touches multiple services, a shared traceId in every log line allows reconstructing the full call sequence. Spring's Mapped Diagnostic Context (MDC) propagates this automatically with OpenTelemetry. Add the trace ID to every log line by placing it in MDC at the entry point and referencing it in the Logback pattern:

MDC.put("traceId", traceId);
// ... process request
MDC.remove("traceId");

Logback pattern to include the trace ID:

%d{ISO8601} [%X{traceId}] %-5level %logger{36} - %msg%n

Log Aggregation Pipeline

flowchart LR
    svc1["Service A\n(JSON logs → stdout)"]
    svc2["Service B\n(JSON logs → stdout)"]
    svc3["Service C\n(JSON logs → stdout)"]
    collector["Filebeat / Fluentd\n(collector)"]
    elastic["Elasticsearch\n(index + store)"]
    kibana["Kibana\n(search + dashboards)"]
    svc1 --> collector
    svc2 --> collector
    svc3 --> collector
    collector --> elastic
    elastic --> kibana

Loki as a lighter alternative

Loki (Grafana Labs) indexes only labels, not full text, making it much cheaper to operate than Elasticsearch. If you already run Prometheus and Grafana, adding Loki gives you log aggregation with zero new UI to learn — all signals live in the same Grafana dashboards.


Distributed Traces

Anatomy of a Trace

A trace represents a single end-to-end request. It is composed of spans — one per operation (service call, DB query, cache lookup, etc.).

Field Description Example
traceId Globally unique ID for the entire request 4bf92f3577b34da6
spanId Unique ID for this operation 00f067aa0ba902b7
parentSpanId Span that triggered this one a2fb4a1d1a96d312
name Operation name GET /orders/{id}
startTime / duration When it started and how long it took 2024-01-15T10:30:00Z, 45ms
status OK / ERROR ERROR
attributes Key-value tags http.status_code=500, db.type=postgresql

W3C traceparent Header

The traceparent header propagates trace context across service boundaries using the format version-traceId-parentSpanId-flags:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
Field Value Meaning
version 00 W3C spec version
traceId 4bf92f3577b34da6a3ce929d0e0e4736 128-bit globally unique trace ID
parentSpanId 00f067aa0ba902b7 64-bit ID of the span that sent this request
flags 01 Sampling flag (01 = sampled)

Every service propagates this header unchanged on downstream calls — only the parentSpanId changes to the current span's ID. This is how trace context crosses service boundaries without any service needing to understand the full trace.

Request Flow with Spans

sequenceDiagram
    autonumber
    actor User
    participant GW as Gateway [span: root, 120ms]
    participant OS as OrderService [span: child, 95ms]
    participant DB as PostgreSQL [span: leaf, 40ms]

    User->>+GW: GET /orders/42<br/>traceparent: 00-abc...-root-01
    GW->>+OS: GET /orders/42<br/>traceparent: 00-abc...-gw_span-01
    OS->>+DB: SELECT * FROM orders WHERE id=42
    DB-->>-OS: row data
    OS-->>-GW: 200 OK (95ms)
    GW-->>-User: 200 OK (120ms)

Sampling Strategies

Strategy How Tradeoff
Head sampling Decision made at the first span Fast, low overhead, but misses rare errors
Tail sampling Decision made after trace is complete Can prioritise errors/slow traces, but higher overhead
Rate-based Keep X% of all traces Simple, predictable cost

Production recommendation

Use head sampling at 1–10% for normal traffic, combined with a tail-sampling rule that keeps 100% of traces containing at least one ERROR span. This keeps storage costs low while ensuring every incident has full trace data.


OpenTelemetry (OTel)

OpenTelemetry is the CNCF standard for generating, collecting, and exporting telemetry. It replaces vendor-specific SDKs (Zipkin client, Jaeger client, etc.) with a single neutral API, so switching backends requires only a configuration change, not a code change.

Component Role
API Language-specific interfaces — Tracer, Meter, Logger
SDK Implementation of the API; includes sampling, batching, exporters
Collector Standalone process that receives, processes, and exports telemetry
Exporter Protocol adapter (OTLP, Jaeger, Zipkin, Prometheus)
flowchart LR
    app["Application\n(OTel SDK)"]
    collector["OTel Collector"]
    prom["Prometheus\n(metrics)"]
    jaeger["Jaeger\n(traces)"]
    loki["Loki\n(logs)"]
    app -->|"OTLP (gRPC/HTTP)"| collector
    collector --> prom
    collector --> jaeger
    collector --> loki

Java Zero-Code Instrumentation

The OTel Java agent instruments Spring Boot automatically via bytecode injection — no code changes required.

ADD https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar /app/opentelemetry-javaagent.jar

ENTRYPOINT ["java", \
  "-javaagent:/app/opentelemetry-javaagent.jar", \
  "-jar", "/app/app.jar"]

Configure the agent via environment variables in Docker Compose:

environment:
  OTEL_SERVICE_NAME: order-service
  OTEL_EXPORTER_OTLP_ENDPOINT: http://jaeger:4317
  OTEL_METRICS_EXPORTER: none
  OTEL_LOGS_EXPORTER: none

Why disable metrics/logs exporters?

The Java agent can export all three signals via OTLP. We disable metrics and logs here because Prometheus already scrapes metrics via Actuator and Loki (or ELK) handles logs separately. Enabling all three from the agent would duplicate data and inflate storage costs.

Manual Spans

When automatic instrumentation is not granular enough, add custom spans programmatically. Add the OTel API dependency:

<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-api</artifactId>
</dependency>

Wrap business logic in a span to capture timing, attributes, and exceptions:

@Autowired
private Tracer tracer;

public Order processOrder(OrderRequest request) {
    Span span = tracer.spanBuilder("processOrder").startSpan();
    try (Scope scope = span.makeCurrent()) {
        span.setAttribute("order.customerId", request.customerId());
        // ... business logic
        return order;
    } catch (Exception e) {
        span.recordException(e);
        span.setStatus(StatusCode.ERROR, e.getMessage());
        throw e;
    } finally {
        span.end();
    }
}

Grafana Unified Stack (LGTM)

Loki + Grafana + Tempo + Mimir/Prometheus — the LGTM stack — puts all three observability signals into a single UI. The key advantage is signal correlation: you can jump from a metric spike to the traces that caused it to the log lines that explain it, all without leaving Grafana.

flowchart LR
    app["Application"]
    prom["Prometheus\n(metrics)"]
    loki["Loki\n(logs)"]
    tempo["Tempo\n(traces)"]
    grafana["Grafana"]
    app --> prom
    app --> loki
    app --> tempo
    prom --> grafana
    loki --> grafana
    tempo --> grafana

A typical drill-down workflow: a metric alert fires in Grafana → click "View Traces" to open the correlated trace in Tempo → click a span to see the correlated log lines from Loki. This cross-signal navigation is what separates an observability platform from three isolated monitoring tools.


DORA Metrics

DORA (DevOps Research and Assessment) identified four metrics that predict software delivery performance and organisational performance. Elite-performing teams score well on all four simultaneously.

Metric Measures Elite benchmark
Deployment Frequency How often code reaches production Multiple times/day
Lead Time for Changes Commit to production time < 1 hour
Change Failure Rate % of deployments causing incidents < 5%
Mean Time to Recovery (MTTR) How long to recover from an incident < 1 hour

DORA and observability

DORA metrics are themselves telemetry. Deployment frequency comes from CI/CD logs. MTTR requires precise incident start and end timestamps from your monitoring system. Change Failure Rate requires correlating deployment events with error-rate spikes. You cannot improve what you cannot measure — and you cannot measure it without observability infrastructure.


Observability Snapshot: Spike Simulation

The chart below simulates what the three observability pillars look like during a traffic spike. Panel 1 shows the request rate rising sharply at steps 10–12; Panel 2 shows error events clustering in the same window; Panel 3 shows trace span durations stretching out under load.

2026-06-11T18:45:45.570543 image/svg+xml Matplotlib v3.10.9, https://matplotlib.org/

The three panels share a timeline: the spike visible in Panel 1 directly explains the error cluster in Panel 2, and both explain the stretched span durations in Panel 3. Without all three pillars, you would see only one piece of the picture.



  1. BEYER, B. et al. Site Reliability Engineering: How Google Runs Production Systems. O'Reilly, 2016. 

  2. MAJORS, C.; FONG-JONES, L.; MIRANDA, G. Observability Engineering. O'Reilly, 2022. 

  3. OPENTELEMETRY. opentelemetry.io — specification, SDKs, Collector. 

  4. FORSGREN, N.; HUMBLE, J.; KIM, G. Accelerate: The Science of Lean Software and DevOps. IT Revolution, 2018.