Observability, SLOs & alerts
Core details
Triad:
| Signal | Job | Pitfall |
|---|---|---|
| Logs | causal narrative | PII leaks, unbounded cardinality labels |
| Metrics | aggregates + SLO math | dashboard noise without owner |
| Traces | latency localization | always-on 100% sampling cost disaster |
RED (rate, errors, duration) on critical paths; USE (utilization, saturation, errors) on constrained resources (pools, disks, brokers).
SLO = target + error budget → drives change risk acceptance culture tangibly.
Understanding
Observability succeeds when signals tie to actions. An alert without runbook or owning team is notification spam eroding on-call trust. High-cardinality tags (per user id in metric labels) implode cost & stability—engineers must consciously model dimensions.
Traces without propagated context waste money stitching unrelated spans.
Senior understanding
Tell a near-miss improvement: moved alert from “CPU > 80%” to “p95 checkout dependency > SLO with sustained window + user impact approximation” decreasing false positives materially—quantify briefly honest.
Relate post-incident action items to missing signals—detection gap narratives impress staff panels when humble concrete.
Diagram
See also
Spotted something unclear or wrong on this page?