Agentic production & serving

Shipping agents means combining latency SLOs, financial guardrails, trust architecture, and recoverable execution: authZ per tool, quotas, human escalation, deterministic tests, trace redaction, and rollback hooks (LangSmith). If concepts are fuzzy, skim fundamentals first.

Process — hardened request lifecycle

Loading diagram…

Scoped tool gateway: central choke point verifying args, quotas, tenancy, cryptographic proof of invocation context.

Runtime state transitions

Loading diagram…

Operational meaning: production code needs metrics and handlers for every terminal path, not only the happy Completed path.

Defense-in-depth layering

Loading diagram…

Best practices recap

Area	Principle
Budget	Caps on model turns, retries, parallelism; degrade gracefully.
Side effects	Idempotency tokens; never trust model raw strings for destructive ops—map to deterministic handles first.
Schema	Invalid tool args → deterministic refusal path; capped repair attempts only.
Testing	Golden multi-hop traces asserting calls + args, not flaky creative prose parity.
Observability	Correlate orchestration traces with infra metrics (#rate limit spikes, queue depth).
Data policy	Minimize payloads to LangSmith—mask PII consistently.
Identity	Carry tenant/user identity into tools; never let the model impersonate a broader service account.
Memory	Separate user memory, task scratchpad, and audit logs; apply retention and deletion policy explicitly.
Rollout	Use canaries, shadow evals, prompt/model versioning, and rollback flags.

Security controls mapped to agent risk

Risk	Control
Prompt injection	Treat retrieved text and tool output as untrusted; instruction hierarchy must not be overridden by documents.
Sensitive data disclosure	Redact traces, minimize context, enforce ACL before retrieval, and filter final output.
Improper output handling	Validate model output before passing it to HTML, SQL, shell, email, workflow engines, or APIs.
Excessive agency	Reduce tool count, permissions, and autonomy; require approval for irreversible actions.
Vector weakness	ACL-filter before ranking, dedupe chunks, detect stale indexes, and monitor citation quality.
Misinformation	Require evidence for factual claims, expose uncertainty, and degrade when sources are missing.
Unbounded consumption	Enforce per-user and per-tenant step, token, wall-clock, concurrency, and spend limits.

Failure playbook (elevator-ready)

Symptom	Likely systemic cause	Immediate mitigation
Cost spike	Recursive tool chatter	tighten step clamp + escalate
Stale factual answers	RAG ingestion drift	reindex + degrade message
Weird tool payloads	Injection via documents	sanitization boundary + verifier model
All responses slow tail	Sequential waterfall	parallelize orthogonal IO + prefetch
Wrong tenant data	Retrieval/tool authZ gap	disable tool, rotate traces, audit access, add ACL prefilter
Users see partial actions	Missing idempotency/rollback	stop writes, reconcile side effects, add transaction boundary
High refusals after deploy	Prompt/model/evaluator regression	rollback version, compare traces, inspect input classifier

Expand security dialogue with /security + Gen AI ingestion notes on /gen-ai.

Incident scenario

Scenario: Agent tool spend triples in one hour.

Strong on-call response:

Flip feature flag to reduce autonomy: lower step cap, disable expensive tools, or route to retrieval-only mode.
Query traces by tenant, route, prompt version, model version, and tool name.
Check for recursive tool chatter, provider retries, injected instructions, or a router regression.
Roll back the prompt/model/router if the spike started with a deploy.
Add a regression eval for the trace cluster that caused the spike.
Communicate user-visible degradation honestly if features are reduced.

Weak response: "Switch to a cheaper model." That may reduce one line item but leaves runaway loops and unsafe tool behavior in place.

TypeScript parity

Agents on LangGraph.js: same graph patterns (START/END, ToolNode equivalents, streaming). Operational concerns remain identical though runtime differs (async ergonomics vs Python Gunicorn workers).

Interview questions — production

1. Outline an on-call playbook for exploding tool spend.

Freeze feature flags, tighten budgets, pinpoint the anomalous trace cluster, and roll back the prompt/model/router version if the spike maps to a deploy.

2. Why separate tenancy at orchestration vs embeddings index?

Different threat surfaces—retrieval leakage still surfaces via ranking; unify policy engine.

3. Describe circuit breaker triggering when provider flaps.

Short-circuit synchronous calls → cached canned responses + disclaim freshness.

4. What contract tests validate before deploy?

Schema conformance, tool latency percentiles replayed shadow traffic, evaluator delta budgets.

5. Incident where logs looked green but users angry.

Only happy instrumentation—need negative path spans + refusal distribution sampling.

6. How do you safely let an agent send emails or create tickets?

Draft first, show recipient/body/action summary, require approval for external sends, use idempotency keys, and audit the final payload.

7. What is the difference between a guardrail and an evaluator?

Guardrail acts during runtime to block or transform a request/output. Evaluator usually runs offline or in CI to decide whether a model/prompt/tool change should ship.

8. What should be in a production trace?

User/tenant hash, model version, prompt/tool version, state transition, tool name, validated args, result status, latency, token/spend, guardrail decisions, and final outcome.

Interview answer template

For "How do you ship an agent safely?", answer:

Start with a deterministic workflow and add autonomy only where observations change the path.
Put tools behind a gateway with user-scoped auth, schemas, quotas, and idempotency.
Add step, token, wall-clock, retry, and spend budgets.
Use human interrupts for irreversible or high-risk actions.
Trace state transitions, tool calls, guardrail decisions, and outcomes.
Gate releases with offline evals, canaries, feature flags, and rollback.
Prepare degraded modes for provider/tool failures.

Common bad answers

Bad answer	Why it is weak
"Ship the demo and monitor user feedback."	Production needs eval gates, traces, budgets, rollback, and incident playbooks before launch.
"Use a bigger model for reliability."	Reliability comes from controls around the model, not only model size.
"Let the agent retry until it succeeds."	Unbounded retries create cost spikes, duplicate side effects, and outages.

Self-check

You are ready if you can explain:

How to ship with canaries and rollback flags.
What goes in a production trace.
How to respond to tool spend spikes.
Which actions require human approval.
How degraded modes work when providers or tools fail.

Track hub · LangGraph internals

On this page