Profiling services & async
Core details
Wall time ≫ CPU time ⇒ the service is waiting (IO, locks, pool, downstream, event-loop blocked)—not “slow functions.”
First instruments
| Signal | Tool class |
|---|---|
| Per-request waterfall | distributed tracing (OpenTelemetry) |
| Pool wait | metrics on acquire time, not only query duration |
| Event-loop lag (Node) | perf_hooks, APM lag histograms |
| Saturation | queue depth, thread pool, goroutine sched (runtime-specific) |
Classic patterns
| Pattern | What you see | Direction |
|---|---|---|
| Blocking EL | CPU spikes + lag under load | non-blocking libs, offload |
| N+1 downstream | many short spans to same dependency | batch, cache |
| Retry storm | error rate + latency spike together | backoff, jitter, budgets |
| Cold pool | timeouts after deploy | pool sizing, RDS Proxy class fixes |
Understanding
Async does not mean “free”—each await is a continuation; under load, memory and scheduling matter. Parent deadline should propagate so children don’t do useless work after the client already timed out.
Senior understanding
Staff narrative: “I’d split queueing vs compute with a trace before changing algorithms.” Tie SLO to tail percentiles (Tail latency & SLOs).
Diagram
Loading diagram…
See also
Last updated on
Spotted something unclear or wrong on this page?