THN Interview Prep

Observability, incidents & safe rollouts

Core details

Three pillars: logs (events), metrics (aggregates), traces (request paths).

RED (service): rate, errors, duration. USE (resource): utilization, saturation, errors.

Structured logs + correlation id across services. Dashboards answer SLO questions—not vanity graphs.

Incident loop (spoken habit)

  1. Detect (alert routes to owner)
  2. Mitigate (rollback, feature flag off, scale, degrade)
  3. Fix root cause
  4. Postmortem with action items

Rollout strategies

StrategyTrade
Big bangsimple; high blast radius
Canarysmall % traffic; needs metrics gate
Blue/greeninstant switch; double capacity cost
Feature flagsoff bad path without full redeploy

Understanding

Alerts fire on customer-visible symptoms where possible—not only CPU. Runbooks reduce MTTR.

Senior understanding

Error budgets link reliability to feature velocity (Tail latency & SLOs). Chaos / game days validate assumptions.

Diagram

Loading diagram…

See also

Last updated on

Spotted something unclear or wrong on this page?

On this page