THN Interview Prep

Latency vs Throughput: Little's Law, Percentiles, Tails, Batching

Definition

Latency is time to complete one unit of work (request, message, job). Throughput is work completed per unit time (RPS, jobs/sec). They are related but not interchangeable: a system can have high throughput and terrible tail latency if a few requests stall.

Little's Law (stable system, long-run averages): L = λ × W

  • L — average number of items in the system (queue + service)
  • λ — average arrival rate
  • W — average time in system (waiting + processing)

Intuition: if items arrive faster than they leave, queues grow and latency explodes—non-linear pain near saturation.

Percentilesp50 (median), p95, p99: the latency below which 50%, 95%, 99% of requests fall. Tail latency — behavior at high percentiles; often dominates user-perceived experience when requests fan out to many dependencies (tail amplification).

Batching — combining work to improve throughput (fewer syscalls, better disk write patterns) at the cost of per-item latency (wait to fill a batch).

Why it matters in interviews

Optimizing only average latency is a classic miss. Interviewers want SLOs in p99, discussion of fan-out to microservices, and backpressure when Little's Law says the queue is blowing up. Batching appears in Kafka, DB commit logs, and GPU inference—know the tradeoff.

Tradeoffs

  • Lower latency — Smaller batches, more connections, more CPU waste, less efficiency.
  • Higher throughput — Larger batches, more buffering, risk of head-of-line blocking.
  • Chasing p99 — Expensive: timeouts, retries, hedged requests can hurt if misapplied.

Concrete examples

  1. Search service — Each query touches 20 shards; one slow shard sets p99 unless you use hedged reads or deadline propagation—tail dominates.
  2. Log ingestionBatching 8–64 KB writes improves throughput; flush interval adds milliseconds of latency acceptable for logs.
  3. Checkout APIp95 SLO 300 ms means dependency budgets must sum under that with margin—Little's Law warns you cannot keep raising λ without W growing unless you add capacity (W per server drops when you scale out correctly).

How to say it in 30 seconds

"I separate median from p99—SLIs should be percentile-based when UX depends on slow paths. Little's Law reminds me that rising queue depth means latency rises unless I add service capacity or cut arrival rate. Batching buys throughput but pays latency; I size batches against SLOs."

Common follow-up questions

  • Why does fan-out make p99 worse? Probability that any child is slow rises with child count; max of latencies approximates tail.
  • What is head-of-line blocking? One slow request blocks others behind it in the same TCP stream or single-threaded pipeline—HTTP/2 multiplexing and separate queues mitigate.
  • Coordinated omission — Benchmarks that only measure successful fast paths lie; measuring during overload matters.

See also: System design curriculum overview

Last updated on

Spotted something unclear or wrong on this page?

On this page