Observability, incidents & safe rollouts
Core details
Three pillars: logs (events), metrics (aggregates), traces (request paths).
RED (service): rate, errors, duration. USE (resource): utilization, saturation, errors.
Structured logs + correlation id across services. Dashboards answer SLO questions—not vanity graphs.
Incident loop (spoken habit)
- Detect (alert routes to owner)
- Mitigate (rollback, feature flag off, scale, degrade)
- Fix root cause
- Postmortem with action items
Rollout strategies
| Strategy | Trade |
|---|---|
| Big bang | simple; high blast radius |
| Canary | small % traffic; needs metrics gate |
| Blue/green | instant switch; double capacity cost |
| Feature flags | off bad path without full redeploy |
Understanding
Alerts fire on customer-visible symptoms where possible—not only CPU. Runbooks reduce MTTR.
Senior understanding
Error budgets link reliability to feature velocity (Tail latency & SLOs). Chaos / game days validate assumptions.
Diagram
Loading diagram…
See also
Last updated on
Spotted something unclear or wrong on this page?