Latency vs Throughput: Little's Law, Percentiles, Tails, Batching

Definition

Latency is time to complete one unit of work (request, message, job). Throughput is work completed per unit time (RPS, jobs/sec). They are related but not interchangeable: a system can have high throughput and terrible tail latency if a few requests stall.

Little's Law (stable system, long-run averages): L = λ × W

L — average number of items in the system (queue + service)
λ — average arrival rate
W — average time in system (waiting + processing)

Intuition: if items arrive faster than they leave, queues grow and latency explodes—non-linear pain near saturation.

Percentiles — p50 (median), p95, p99: the latency below which 50%, 95%, 99% of requests fall. Tail latency — behavior at high percentiles; often dominates user-perceived experience when requests fan out to many dependencies (tail amplification).

Batching — combining work to improve throughput (fewer syscalls, better disk write patterns) at the cost of per-item latency (wait to fill a batch).

Why it matters in interviews

Optimizing only average latency is a classic miss. Interviewers want SLOs in p99, discussion of fan-out to microservices, and backpressure when Little's Law says the queue is blowing up. Batching appears in Kafka, DB commit logs, and GPU inference—know the tradeoff.

Tradeoffs

Lower latency — Smaller batches, more connections, more CPU waste, less efficiency.
Higher throughput — Larger batches, more buffering, risk of head-of-line blocking.
Chasing p99 — Expensive: timeouts, retries, hedged requests can hurt if misapplied.

Concrete examples

Search service — Each query touches 20 shards; one slow shard sets p99 unless you use hedged reads or deadline propagation—tail dominates.
Log ingestion — Batching 8–64 KB writes improves throughput; flush interval adds milliseconds of latency acceptable for logs.
Checkout API — p95 SLO 300 ms means dependency budgets must sum under that with margin—Little's Law warns you cannot keep raising λ without W growing unless you add capacity (W per server drops when you scale out correctly).

How to say it in 30 seconds

"I separate median from p99—SLIs should be percentile-based when UX depends on slow paths. Little's Law reminds me that rising queue depth means latency rises unless I add service capacity or cut arrival rate. Batching buys throughput but pays latency; I size batches against SLOs."

Common follow-up questions

Why does fan-out make p99 worse? Probability that any child is slow rises with child count; max of latencies approximates tail.
What is head-of-line blocking? One slow request blocks others behind it in the same TCP stream or single-threaded pipeline—HTTP/2 multiplexing and separate queues mitigate.
Coordinated omission — Benchmarks that only measure successful fast paths lie; measuring during overload matters.

Cross-links (building blocks)

Load balancers, message queues, and rate limiting are primary levers for protecting tail latency under load—see System design curriculum overview.

On this page