THN Interview Prep

Observability, SLOs & alerts

Core details

Triad:

SignalJobPitfall
Logscausal narrativePII leaks, unbounded cardinality labels
Metricsaggregates + SLO mathdashboard noise without owner
Traceslatency localizationalways-on 100% sampling cost disaster

RED (rate, errors, duration) on critical paths; USE (utilization, saturation, errors) on constrained resources (pools, disks, brokers).

SLO = target + error budget → drives change risk acceptance culture tangibly.

Understanding

Observability succeeds when signals tie to actions. An alert without runbook or owning team is notification spam eroding on-call trust. High-cardinality tags (per user id in metric labels) implode cost & stability—engineers must consciously model dimensions.

Traces without propagated context waste money stitching unrelated spans.

Senior understanding

Tell a near-miss improvement: moved alert from “CPU > 80%” to “p95 checkout dependency > SLO with sustained window + user impact approximation” decreasing false positives materially—quantify briefly honest.

Relate post-incident action items to missing signals—detection gap narratives impress staff panels when humble concrete.

Diagram

Loading diagram…

See also

Spotted something unclear or wrong on this page?

On this page