Performance & Optimization
Performance engineering is measurement → hypothesis → single change → proof, tied to a user- or revenue-facing metric. Without that anchor, “optimization” becomes guesswork and resume bullet theater.
How to use this page
- Use Core basics as a vocabulary and mental model checklist.
- Use Profiling playbooks as literal tool menus (what you open first).
- Use Recognition cues to avoid mis-classifying a queueing problem as “CPU.”
- Run Study sessions with real traces or realistic invented waterfalls.
Topic study plan (deep pages)
Topic notes: /performance/topics/... — Core details → Understanding → Senior understanding → Diagram.
| Topic | Focus |
|---|---|
| Tail latency & SLOs | Percentiles, error budgets, queueing intuition |
| Profiling the browser | Long tasks, layout, network waterfall |
| Profiling services & async | CPU vs wait, pools, traces |
| Database query path | Plans, stats, locks, replicas |
| Caching & retry amplification | Stampede, jitter, hedging cautions |
Authoring template (not in sidebar): `content/core-docs/performance/topics/topic-page-template.mdx` (`publishDocs: false`).
Core basics — decompose wall time
For any hop (browser frame, service request, DB query), break latency into:
| Bucket | What it usually is | First instruments |
|---|---|---|
| Queueing | waiting for thread, pool, partition lock, event-loop turn | pool depth, sched latency, queue depth |
| Compute | hot code, regex, (de)serialization | CPU flame / sample |
| IO | remote calls, disk, replication | span waterfall, iostat class tools |
| Serialization | JSON/protobuf size, encoding | payload metrics, alloc profile |
| Coordination | locks, barriers, chatty fan-out | contention profiles, trace spans |
Interview line: “I’d prove which bucket dominates before tuning—otherwise you ‘speed up’ the wrong stage.”
Little’s Law (qualitative)
In stable systems: average items in flight ≈ arrival rate × average time in system.
Sudden backlog growth often means saturation before CPU hits 100%.
Percentiles matter more than averages
| Stat | Good for |
|---|---|
| p50 / median | typical feel, capacity planning shorthand |
| p90–p95 | UX + many SLOs |
| high tails (p99+) | catastrophic retries / user abandonment |
Optimize the percentile your product SLO names—and say explicitly when you sacrificed tail for median intentionally (rare—but must be deliberate).
Frontend performance (what to open first)
Chrome / Chromium Performance workflow
- Start recording → reproduce interaction once cleanly.
- Look for long tasks (main-thread blocks).
- Bottom-up: sort by self time JS functions.
- Layout markers: flagged reflow/layout events → map to offending read/write geometry pattern.
- Network: check critical request priority, LCP candidate discovery.
Heap memory investigations:
| Symptom | Tooling move |
|---|---|
| Detached DOM after route changes | Memory snapshot diff dominator paths |
| Steady climb | Allocation timeline / sampling profiler |
Synthetic vs field:
- Lighthouse lab reproducible regressions CI-friendly.
- CrUX field validates low-end phones & flaky networks—you need both narratives.
Synergy: revisit /frontend for rendering & UX contracts.
Backend / services performance playbooks
CPU vs wall time split
Wall ≫ CPU ⇒ waiting (locks, pools, downstream). Profile async traces before micro-optimizing functions.
Classic mistakes:
| Mistake | What it looks like | Fix direction |
|---|---|---|
| Blocking EL | latency spikes punctually | offload / non-blocking libs |
| Unbounded parallelism | downstream brownout | semaphores, bulkheads |
| Retry storm | cascading 429/503 | backoff + jitter + budgets |
Pools & saturation signals
Monitor waiting time acquiring DB connections—not only query duration. Exhausted pools mimic “DB slow” falsely.
Tracing must carry budget: parent deadline propagated to children to avoid hopeless late work.
Synergy: /backend, /databases.
Database performance fundamentals
Read path
Steps an interviewer expects you to articulate:
| Step | Investigation |
|---|---|
| Find query text | pg_stat_statements class metrics / slow log |
| Get plan shape | EXPLAIN ANALYZE (vendor equivalent) |
| Check row estimates | cardinality / stats freshness story |
| Check access path | sequential vs index scan vs bitmap—why chosen |
Staff nuance: sometimes “wrong index” isn’t—you need rewrite (covering projection, narrower SELECT, lateral batching patterns), not index soup.
Write path & amplification
indexes ↑ write cost; hot updates on wide indexes cause bloat contention—articulate consciously.
Replication & consistency illusion
Stale read replicas surfaced as intermittent “bugs” unless user-visible freshness cues engineered—classification error mistaken as raw performance issue historically common.
Caching layers & invalidation realism
| Cache | Typical failure mode | Interview mitigation story |
|---|---|---|
| Browser HTTP | leaking auth’d HTML/CDN mishap | keyed URLs, surrogate keys |
| App local | stampedes after TTL | jitter + coalesce |
| Shared remote | stale business decisions | versioning + negative caching caution |
| ORM/session | phantom staleness layering | TTL + explicit invalidation events |
Thundering herd: explain probabilistic early refresh conceptually—even if vendor-specific implementation deferred.
Load shapes & amplification
| Pattern | Hazard | Controls |
|---|---|---|
| Retries without jitter | harmonic spikes | capped retries + backoff |
| Global fan-out timeouts | herd release | concurrency limits |
| Periodic cron alignment | spikes | jitter scheduling / rate smoothing |
| Autoscale lag | cold queue growth | provisioning / queue absorb strategy |
Cold starts (serverless / JVM warmup): quantify cold penalty affecting tail—tie to concurrency provisioning money trade.
Reliability interplay (error budgets)
Shrinking tails often trades cost or complexity or correctness windows—explicitly reconcile with reliability error budget narratives when relevant—not every perf win is unconditionally “good ops.”
Recognition cues (symptoms → drills)
| Symptom | Split first | Drill |
|---|---|---|
| High CPU but low QPS | user vs syscall vs GC | segregated profiler views |
| Memory climb steady | leak vs caching | snapshot diff timelines |
| API slow intermittent | tails vs saturation | percentile trace overlays |
| DB CPU low but latency high | waiting / locks / replicas | waits / locks / replication dashboard |
| Hot key / skew | uneven shard load | resharding narrative + cache pad |
Staff follow-ups: “What dashboard would pre-detect recurrence?” “What synthetic probe fails CI next time?”
Memory hooks
- USE per bottlenecked resource (Utilization, Saturation, Errors).
- One change hypothesis—no multi-variable mystery PRs pretending clarity.
- Little’s backlog intuition connects queue depth intuitively arrival rate interplay.
Study sessions (timed)
Session R — Incident replay (35 min)
Reconstruct chronologically: symptom metrics → narrowing experiment → causal commit → remediation → preventative guard instrumentation addition.
Session P — Flame/waterfall reading (25 min)
Use a sanitized internal capture or invent plausible shape: annotate three hypotheses + validation step each.
Session S — Constraint swap (12 min verbal)
Alternate characterization CPU-bound ↔ IO-bound ↔ memory-bound swapping diagnostic ordering & mitigations verbally without notes.
Diagrams
Bottleneck narrowing
Retry amplification cartoon
Pitfalls
- Reporting mean latency only while p95/p99 regress—retries hide the user story until production screams.
- Index theater: adding indexes without measuring write amplification, bloat, and maintenance windows.
- Tuning micro-benchmark laptops while production traffic is skewed, bursty, and multi-tenant.
- Shipping caches without staleness UX—especially money, quotas, entitlement, entitlement counts.
- Fixing CPU while latency is dominated by queueing—you win flame graphs but lose users.
Related
/frontend— rendering + main-thread discipline./backend— timeouts, retries, pools, saturation./databases(if present in nav) — plans, isolation, replication lag./dsa— algorithmic hotspots when profiling points to asymptotics.
Mark this page when you finish learning it.
Spotted something unclear or wrong on this page?