Tail latency & SLOs
Core details
SLI = how you measure good (availability, success ratio, latency threshold). SLO = target for that SLI over a window (e.g. “99% of Checkout API requests < 400 ms monthly”). Error budget = allowed “bad” fraction before you freeze features for reliability work (100% − SLO over the window).
| Percentile | Typical use |
|---|---|
| p50 | typical experience, capacity shorthand |
| p90–p95 | UX + many product SLOs |
| p99+ | worst users, retry storms, abandonment |
Why means lie: Arithmetic mean hides bimodal distributions (fast cache hits vs slow misses) and outliers that dominate user frustration.
Little’s Law (qualitative): average work in system ≈ arrival rate × average time in system. Backlog growth often means saturation before CPU is pegged.
Queueing intuition: as utilization → 100%, wait time explodes non-linearly—tails get fat.
Understanding
Teams that optimize mean latency while p99 regresses ship “fast on average, broken for real users.” Retries and fan-out make tails worse: one slow hop becomes many concurrent waits unless budgeted.
SLOs are product contracts—not dashboards. If you don’t allocate error budget consciously, every release erodes reliability by a thousand paper cuts.
Senior understanding
| Probe | Strong answer |
|---|---|
| “Which percentile for SLO?” | Named percentile matches user journey + retry behavior; explain trade vs cost |
| “Multi-region?” | Regional SLO vs global; replica lag and routing affect measured SLI |
| “Error budget exhausted?” | feature freeze, tighten rollouts, incident prevention narrative |
Diagram
See also
Last updated on
Spotted something unclear or wrong on this page?