Agentic production & serving
Shipping agents means combining latency SLOs, financial guardrails, trust architecture, and recoverable execution: authZ per tool, quotas, human escalation, deterministic tests, trace redaction, and rollback hooks (LangSmith). If concepts are fuzzy, skim fundamentals first.
Process — hardened request lifecycle
Scoped tool gateway: central choke point verifying args, quotas, tenancy, cryptographic proof of invocation context.
Runtime state transitions
Operational meaning: production code needs metrics and handlers for every terminal path, not only the happy Completed path.
Defense-in-depth layering
Best practices recap
| Area | Principle |
|---|---|
| Budget | Caps on model turns, retries, parallelism; degrade gracefully. |
| Side effects | Idempotency tokens; never trust model raw strings for destructive ops—map to deterministic handles first. |
| Schema | Invalid tool args → deterministic refusal path; capped repair attempts only. |
| Testing | Golden multi-hop traces asserting calls + args, not flaky creative prose parity. |
| Observability | Correlate orchestration traces with infra metrics (#rate limit spikes, queue depth). |
| Data policy | Minimize payloads to LangSmith—mask PII consistently. |
| Identity | Carry tenant/user identity into tools; never let the model impersonate a broader service account. |
| Memory | Separate user memory, task scratchpad, and audit logs; apply retention and deletion policy explicitly. |
| Rollout | Use canaries, shadow evals, prompt/model versioning, and rollback flags. |
Security controls mapped to agent risk
| Risk | Control |
|---|---|
| Prompt injection | Treat retrieved text and tool output as untrusted; instruction hierarchy must not be overridden by documents. |
| Sensitive data disclosure | Redact traces, minimize context, enforce ACL before retrieval, and filter final output. |
| Improper output handling | Validate model output before passing it to HTML, SQL, shell, email, workflow engines, or APIs. |
| Excessive agency | Reduce tool count, permissions, and autonomy; require approval for irreversible actions. |
| Vector weakness | ACL-filter before ranking, dedupe chunks, detect stale indexes, and monitor citation quality. |
| Misinformation | Require evidence for factual claims, expose uncertainty, and degrade when sources are missing. |
| Unbounded consumption | Enforce per-user and per-tenant step, token, wall-clock, concurrency, and spend limits. |
Failure playbook (elevator-ready)
| Symptom | Likely systemic cause | Immediate mitigation |
|---|---|---|
| Cost spike | Recursive tool chatter | tighten step clamp + escalate |
| Stale factual answers | RAG ingestion drift | reindex + degrade message |
| Weird tool payloads | Injection via documents | sanitization boundary + verifier model |
| All responses slow tail | Sequential waterfall | parallelize orthogonal IO + prefetch |
| Wrong tenant data | Retrieval/tool authZ gap | disable tool, rotate traces, audit access, add ACL prefilter |
| Users see partial actions | Missing idempotency/rollback | stop writes, reconcile side effects, add transaction boundary |
| High refusals after deploy | Prompt/model/evaluator regression | rollback version, compare traces, inspect input classifier |
Expand security dialogue with /security + Gen AI ingestion notes on /gen-ai.
Incident scenario
Scenario: Agent tool spend triples in one hour.
Strong on-call response:
- Flip feature flag to reduce autonomy: lower step cap, disable expensive tools, or route to retrieval-only mode.
- Query traces by tenant, route, prompt version, model version, and tool name.
- Check for recursive tool chatter, provider retries, injected instructions, or a router regression.
- Roll back the prompt/model/router if the spike started with a deploy.
- Add a regression eval for the trace cluster that caused the spike.
- Communicate user-visible degradation honestly if features are reduced.
Weak response: "Switch to a cheaper model." That may reduce one line item but leaves runaway loops and unsafe tool behavior in place.
TypeScript parity
Agents on LangGraph.js: same graph patterns (START/END, ToolNode equivalents, streaming). Operational concerns remain identical though runtime differs (async ergonomics vs Python Gunicorn workers).
Interview questions — production
1. Outline an on-call playbook for exploding tool spend.
- Freeze feature flags, tighten budgets, pinpoint the anomalous trace cluster, and roll back the prompt/model/router version if the spike maps to a deploy.
2. Why separate tenancy at orchestration vs embeddings index?
- Different threat surfaces—retrieval leakage still surfaces via ranking; unify policy engine.
3. Describe circuit breaker triggering when provider flaps.
- Short-circuit synchronous calls → cached canned responses + disclaim freshness.
4. What contract tests validate before deploy?
- Schema conformance, tool latency percentiles replayed shadow traffic, evaluator delta budgets.
5. Incident where logs looked green but users angry.
- Only happy instrumentation—need negative path spans + refusal distribution sampling.
6. How do you safely let an agent send emails or create tickets?
- Draft first, show recipient/body/action summary, require approval for external sends, use idempotency keys, and audit the final payload.
7. What is the difference between a guardrail and an evaluator?
- Guardrail acts during runtime to block or transform a request/output. Evaluator usually runs offline or in CI to decide whether a model/prompt/tool change should ship.
8. What should be in a production trace?
- User/tenant hash, model version, prompt/tool version, state transition, tool name, validated args, result status, latency, token/spend, guardrail decisions, and final outcome.
Interview answer template
For "How do you ship an agent safely?", answer:
- Start with a deterministic workflow and add autonomy only where observations change the path.
- Put tools behind a gateway with user-scoped auth, schemas, quotas, and idempotency.
- Add step, token, wall-clock, retry, and spend budgets.
- Use human interrupts for irreversible or high-risk actions.
- Trace state transitions, tool calls, guardrail decisions, and outcomes.
- Gate releases with offline evals, canaries, feature flags, and rollback.
- Prepare degraded modes for provider/tool failures.
Common bad answers
| Bad answer | Why it is weak |
|---|---|
| "Ship the demo and monitor user feedback." | Production needs eval gates, traces, budgets, rollback, and incident playbooks before launch. |
| "Use a bigger model for reliability." | Reliability comes from controls around the model, not only model size. |
| "Let the agent retry until it succeeds." | Unbounded retries create cost spikes, duplicate side effects, and outages. |
Self-check
You are ready if you can explain:
- How to ship with canaries and rollback flags.
- What goes in a production trace.
- How to respond to tool spend spikes.
- Which actions require human approval.
- How degraded modes work when providers or tools fail.
Related
Mark this page when you finish learning it.
Spotted something unclear or wrong on this page?