THN Interview Prep

Tail latency & SLOs

Core details

SLI = how you measure good (availability, success ratio, latency threshold). SLO = target for that SLI over a window (e.g. “99% of Checkout API requests < 400 ms monthly”). Error budget = allowed “bad” fraction before you freeze features for reliability work (100% − SLO over the window).

PercentileTypical use
p50typical experience, capacity shorthand
p90–p95UX + many product SLOs
p99+worst users, retry storms, abandonment

Why means lie: Arithmetic mean hides bimodal distributions (fast cache hits vs slow misses) and outliers that dominate user frustration.

Little’s Law (qualitative): average work in systemarrival rate × average time in system. Backlog growth often means saturation before CPU is pegged.

Queueing intuition: as utilization → 100%, wait time explodes non-linearly—tails get fat.

Understanding

Teams that optimize mean latency while p99 regresses ship “fast on average, broken for real users.” Retries and fan-out make tails worse: one slow hop becomes many concurrent waits unless budgeted.

SLOs are product contracts—not dashboards. If you don’t allocate error budget consciously, every release erodes reliability by a thousand paper cuts.

Senior understanding

ProbeStrong answer
“Which percentile for SLO?”Named percentile matches user journey + retry behavior; explain trade vs cost
“Multi-region?”Regional SLO vs global; replica lag and routing affect measured SLI
“Error budget exhausted?”feature freeze, tighten rollouts, incident prevention narrative

Diagram

Loading diagram…

See also

Last updated on

Spotted something unclear or wrong on this page?

On this page