Profiling services & async

Core details

Wall time ≫ CPU time ⇒ the service is waiting (IO, locks, pool, downstream, event-loop blocked)—not “slow functions.”

First instruments

Signal	Tool class
Per-request waterfall	distributed tracing (OpenTelemetry)
Pool wait	metrics on acquire time, not only query duration
Event-loop lag (Node)	`perf_hooks`, APM lag histograms
Saturation	queue depth, thread pool, goroutine sched (runtime-specific)

Classic patterns

Pattern	What you see	Direction
Blocking EL	CPU spikes + lag under load	non-blocking libs, offload
N+1 downstream	many short spans to same dependency	batch, cache
Retry storm	error rate + latency spike together	backoff, jitter, budgets
Cold pool	timeouts after deploy	pool sizing, RDS Proxy class fixes

Async does not mean “free”—each await is a continuation; under load, memory and scheduling matter. Parent deadline should propagate so children don’t do useless work after the client already timed out.

Senior understanding

Staff narrative: “I’d split queueing vs compute with a trace before changing algorithms.” Tie SLO to tail percentiles (Tail latency & SLOs).

Diagram

Loading diagram…

Profiling services & async

Core details

First instruments

Classic patterns

Understanding

Senior understanding

Diagram

See also

On this page