THN Interview Prep

Cost, latency & routing

GenAI cost and latency come from more than the final model call: retrieval, embeddings, reranking, tool IO, retries, context size, and observability all matter. Senior designs surface budgets early.


Latency path

Loading diagram…

Cost levers

LeverHelpsRisk
Smaller model routerCheaply sends easy tasks to cheaper paths.Misrouting hard tasks.
Context trimmingReduces input tokens and latency.Dropping required evidence.
Embedding cacheAvoids repeated query/document embedding.Stale keys or tenant leakage.
Answer cacheFast for repeated public FAQs.Poisoning or personalized leakage.
Parallel retrievalReduces wall time.Higher infra cost if overused.
StreamingImproves perceived latency.Does not reduce total compute.
Step budgetPrevents runaway agents.May cut off legitimate long tasks.

Routing states

Loading diagram…

Interview questions

1. How do you lower GenAI cost without hurting quality?

  • Route easy tasks to cheaper models, trim context, cache safely, use retrieval only when needed, cap repairs, and evaluate quality deltas.

2. What causes tail latency in RAG?

  • Slow vector/keyword search, reranker calls, serial IO, large context, provider latency, and retries.

3. Why is answer caching risky?

  • Personalized or tenant-specific responses can leak across users unless keys include identity, permissions, version, and source freshness.

RAG · Agentic production · Performance

Spotted something unclear or wrong on this page?

On this page