Tail latency & SLOs

Core details

SLI = how you measure good (availability, success ratio, latency threshold). SLO = target for that SLI over a window (e.g. “99% of Checkout API requests < 400 ms monthly”). Error budget = allowed “bad” fraction before you freeze features for reliability work (100% − SLO over the window).

Percentile	Typical use
p50	typical experience, capacity shorthand
p90–p95	UX + many product SLOs
p99+	worst users, retry storms, abandonment

Why means lie: Arithmetic mean hides bimodal distributions (fast cache hits vs slow misses) and outliers that dominate user frustration.

Little’s Law (qualitative): average work in system ≈ arrival rate × average time in system. Backlog growth often means saturation before CPU is pegged.

Queueing intuition: as utilization → 100%, wait time explodes non-linearly—tails get fat.

Understanding

Teams that optimize mean latency while p99 regresses ship “fast on average, broken for real users.” Retries and fan-out make tails worse: one slow hop becomes many concurrent waits unless budgeted.

SLOs are product contracts—not dashboards. If you don’t allocate error budget consciously, every release erodes reliability by a thousand paper cuts.

Senior understanding

Probe	Strong answer
“Which percentile for SLO?”	Named percentile matches user journey + retry behavior; explain trade vs cost
“Multi-region?”	Regional SLO vs global; replica lag and routing affect measured SLI
“Error budget exhausted?”	feature freeze, tighten rollouts, incident prevention narrative

Diagram

Loading diagram…

Tail latency & SLOs

Core details

Understanding

Senior understanding

Diagram

See also

On this page