Observability, incidents & safe rollouts

Observability and safe rollout loop showing logs, metrics, traces, customer-visible alerts, canary metric gates, rollback or promotion, incident mitigation, learning, and guardrail updates.

Core details

Problem this solves: know when users are harmed, find the responsible path quickly, mitigate before perfect root cause, and turn the lesson into a guardrail.

Three pillars: logs (events), metrics (aggregates), traces (request paths).

RED (service): rate, errors, duration. USE (resource): utilization, saturation, errors.

Structured logs + correlation id across services. Dashboards answer SLO questions—not vanity graphs.

Incident loop (spoken habit)

Detect (alert routes to owner)
Mitigate (rollback, feature flag off, scale, degrade)
Fix root cause
Postmortem with action items

Rollout strategies

Strategy	Trade
Big bang	simple; high blast radius
Canary	small % traffic; needs metrics gate
Blue/green	instant switch; double capacity cost
Feature flags	off bad path without full redeploy

Understanding

Alerts fire on customer-visible symptoms where possible—not only CPU. Runbooks reduce MTTR.

Observability is not a dashboard collection. It is a debugging contract:

Metrics say that something is wrong.
Traces say where time or errors entered the path.
Logs say why one request or job made a decision.
Deploy markers say what changed near the symptom.
Runbooks say what to do now.

Safe rollout uses the same contract before full exposure. A canary should compare the new path against a stable baseline for the metrics that matter: error rate, latency, saturation, business success rate, and dependency pressure.

Senior understanding

Error budgets link reliability to feature velocity (Tail latency & SLOs). Chaos / game days validate assumptions.

Probe	Strong answer
"What pages someone?"	SLO burn, customer-visible failures, and exhausted recovery budget
"High CPU alert?"	useful only if tied to saturation or user impact; otherwise route to dashboard, not pager
"Canary metric?"	compare new vs baseline for error rate, p95/p99, saturation, and core business conversion
"Postmortem quality?"	timeline, contributing factors, missed detection, action owners, and guardrail due dates

Failure modes

Alerting on host CPU while checkout errors go unnoticed.
Logs lack request/user/tenant/correlation identifiers, so incidents become search exercises.
Traces sample away the exact error path during low-volume but high-impact failures.
Canary watches only infrastructure health and misses business regressions.
Postmortems assign blame instead of changing tests, dashboards, rollout gates, or ownership.

Interview drill

Question: "A canary deploy increases p99 latency but error rate is flat. What do you do?"

Model answer structure:

Stop or hold the canary; do not promote while the named SLO is regressing.
Compare traces between baseline and canary for dependency latency, pool waits, retries, payload size, and CPU time.
Check saturation metrics: event loop lag, connection pool wait, queue depth, memory/GC, throttling.
Decide mitigation: rollback, feature flag off, reduce traffic, or apply safe config change.
Add a rollout gate for the discovered leading indicator.

Follow-ups to expect:

"What if p99 only regresses for one tenant?"
"When do you page versus create a ticket?"
"How do you handle an incident with no clear owner?"

Diagram

Loading diagram…