Observability, SLOs & alerts

Core details

Triad:

Signal	Job	Pitfall
Logs	causal narrative	PII leaks, unbounded cardinality labels
Metrics	aggregates + SLO math	dashboard noise without owner
Traces	latency localization	always-on 100% sampling cost disaster

RED (rate, errors, duration) on critical paths; USE (utilization, saturation, errors) on constrained resources (pools, disks, brokers).

SLO = target + error budget → drives change risk acceptance culture tangibly.

Problem this solves: production systems fail in partial, delayed, and user-specific ways. Without useful signals, teams debate anecdotes instead of evidence.

Simple mental model: logs explain one request, metrics explain populations, traces explain where time went.

Observability succeeds when signals tie to actions. An alert without runbook or owning team is notification spam eroding on-call trust. High-cardinality tags (per user id in metric labels) implode cost & stability—engineers must consciously model dimensions.

Traces without propagated context waste money stitching unrelated spans.

Example SLO shape

Field	Example
User journey	checkout submit
SLI	percentage of successful checkout requests under 900 ms
SLO	99.5% over 30 days
Error budget use	failed or slow requests consume budget
Page condition	sustained burn rate high enough to threaten the window
Runbook first step	inspect checkout trace by dependency and pool wait

Start from a user journey before adding service dashboards. CPU can be high without user pain, and user pain can exist while CPU looks normal.

Senior understanding

Tell a near-miss improvement: moved alert from “CPU > 80%” to “p95 checkout dependency > SLO with sustained window + user impact approximation” decreasing false positives materially—quantify briefly honest.

Relate post-incident action items to missing signals—detection gap narratives impress staff panels when humble concrete.

Common mistakes

Alerting on raw CPU without customer impact or saturation context.
Putting user IDs, request IDs, or full URLs into metric labels.
Logging PII because “debugging might need it.”
Sampling traces so aggressively that rare failure paths vanish.
Creating dashboards with no owner, no runbook, and no decision tied to them.

Interview answer structure

“I define the user journey and SLO first, then instrument RED metrics for the service path and USE metrics for constrained resources. Logs carry a trace ID and safe business identifiers, traces show the critical path, and alerts page only when user impact or error-budget burn needs human action.”

Follow-ups to expect:

What should be a metric label versus a log field?
How do you reduce false pages?
How do you debug a slow request using traces?
What action item would you create after a missed incident?

Diagram

Loading diagram…