Cost, latency & routing
GenAI cost and latency come from more than the final model call: retrieval, embeddings, reranking, tool IO, retries, context size, and observability all matter. Senior designs surface budgets early.
Latency path
Loading diagram…
Cost levers
| Lever | Helps | Risk |
|---|---|---|
| Smaller model router | Cheaply sends easy tasks to cheaper paths. | Misrouting hard tasks. |
| Context trimming | Reduces input tokens and latency. | Dropping required evidence. |
| Embedding cache | Avoids repeated query/document embedding. | Stale keys or tenant leakage. |
| Answer cache | Fast for repeated public FAQs. | Poisoning or personalized leakage. |
| Parallel retrieval | Reduces wall time. | Higher infra cost if overused. |
| Streaming | Improves perceived latency. | Does not reduce total compute. |
| Step budget | Prevents runaway agents. | May cut off legitimate long tasks. |
Routing states
Loading diagram…
Interview questions
1. How do you lower GenAI cost without hurting quality?
- Route easy tasks to cheaper models, trim context, cache safely, use retrieval only when needed, cap repairs, and evaluate quality deltas.
2. What causes tail latency in RAG?
- Slow vector/keyword search, reranker calls, serial IO, large context, provider latency, and retries.
3. Why is answer caching risky?
- Personalized or tenant-specific responses can leak across users unless keys include identity, permissions, version, and source freshness.
Related
Spotted something unclear or wrong on this page?