Observability, SLOs & alerts
Core details
Triad:
| Signal | Job | Pitfall |
|---|---|---|
| Logs | causal narrative | PII leaks, unbounded cardinality labels |
| Metrics | aggregates + SLO math | dashboard noise without owner |
| Traces | latency localization | always-on 100% sampling cost disaster |
RED (rate, errors, duration) on critical paths; USE (utilization, saturation, errors) on constrained resources (pools, disks, brokers).
SLO = target + error budget → drives change risk acceptance culture tangibly.
Problem this solves: production systems fail in partial, delayed, and user-specific ways. Without useful signals, teams debate anecdotes instead of evidence.
Simple mental model: logs explain one request, metrics explain populations, traces explain where time went.
Understanding
Observability succeeds when signals tie to actions. An alert without runbook or owning team is notification spam eroding on-call trust. High-cardinality tags (per user id in metric labels) implode cost & stability—engineers must consciously model dimensions.
Traces without propagated context waste money stitching unrelated spans.
Example SLO shape
| Field | Example |
|---|---|
| User journey | checkout submit |
| SLI | percentage of successful checkout requests under 900 ms |
| SLO | 99.5% over 30 days |
| Error budget use | failed or slow requests consume budget |
| Page condition | sustained burn rate high enough to threaten the window |
| Runbook first step | inspect checkout trace by dependency and pool wait |
Start from a user journey before adding service dashboards. CPU can be high without user pain, and user pain can exist while CPU looks normal.
Senior understanding
Tell a near-miss improvement: moved alert from “CPU > 80%” to “p95 checkout dependency > SLO with sustained window + user impact approximation” decreasing false positives materially—quantify briefly honest.
Relate post-incident action items to missing signals—detection gap narratives impress staff panels when humble concrete.
Common mistakes
- Alerting on raw CPU without customer impact or saturation context.
- Putting user IDs, request IDs, or full URLs into metric labels.
- Logging PII because “debugging might need it.”
- Sampling traces so aggressively that rare failure paths vanish.
- Creating dashboards with no owner, no runbook, and no decision tied to them.
Interview answer structure
“I define the user journey and SLO first, then instrument RED metrics for the service path and USE metrics for constrained resources. Logs carry a trace ID and safe business identifiers, traces show the critical path, and alerts page only when user impact or error-budget burn needs human action.”
Follow-ups to expect:
- What should be a metric label versus a log field?
- How do you reduce false pages?
- How do you debug a slow request using traces?
- What action item would you create after a missed incident?
Diagram
See also
Mark this page when you finish learning it.
Spotted something unclear or wrong on this page?