Latency vs Throughput: Little's Law, Percentiles, Tails, Batching
Definition
Latency is time to complete one unit of work (request, message, job). Throughput is work completed per unit time (RPS, jobs/sec). They are related but not interchangeable: a system can have high throughput and terrible tail latency if a few requests stall.
Little's Law (stable system, long-run averages): L = λ × W
L— average number of items in the system (queue + service)λ— average arrival rateW— average time in system (waiting + processing)
Intuition: if items arrive faster than they leave, queues grow and latency explodes—non-linear pain near saturation.
Percentiles — p50 (median), p95, p99: the latency below which 50%, 95%, 99% of requests fall. Tail latency — behavior at high percentiles; often dominates user-perceived experience when requests fan out to many dependencies (tail amplification).
Batching — combining work to improve throughput (fewer syscalls, better disk write patterns) at the cost of per-item latency (wait to fill a batch).
Why it matters in interviews
Optimizing only average latency is a classic miss. Interviewers want SLOs in p99, discussion of fan-out to microservices, and backpressure when Little's Law says the queue is blowing up. Batching appears in Kafka, DB commit logs, and GPU inference—know the tradeoff.
Tradeoffs
- Lower latency — Smaller batches, more connections, more CPU waste, less efficiency.
- Higher throughput — Larger batches, more buffering, risk of head-of-line blocking.
- Chasing p99 — Expensive: timeouts, retries, hedged requests can hurt if misapplied.
Concrete examples
- Search service — Each query touches 20 shards; one slow shard sets p99 unless you use hedged reads or deadline propagation—tail dominates.
- Log ingestion — Batching 8–64 KB writes improves throughput; flush interval adds milliseconds of latency acceptable for logs.
- Checkout API — p95 SLO 300 ms means dependency budgets must sum under that with margin—Little's Law warns you cannot keep raising λ without W growing unless you add capacity (
Wper server drops when you scale out correctly).
How to say it in 30 seconds
"I separate median from p99—SLIs should be percentile-based when UX depends on slow paths. Little's Law reminds me that rising queue depth means latency rises unless I add service capacity or cut arrival rate. Batching buys throughput but pays latency; I size batches against SLOs."
Common follow-up questions
- Why does fan-out make p99 worse? Probability that any child is slow rises with child count; max of latencies approximates tail.
- What is head-of-line blocking? One slow request blocks others behind it in the same TCP stream or single-threaded pipeline—HTTP/2 multiplexing and separate queues mitigate.
- Coordinated omission — Benchmarks that only measure successful fast paths lie; measuring during overload matters.
Cross-links (building blocks)
- Load balancers, message queues, and rate limiting are primary levers for protecting tail latency under load—see System design curriculum overview.
See also: System design curriculum overview
Last updated on
Spotted something unclear or wrong on this page?