LangSmith for agents
LangSmith captures runs from LangChain/LangGraph: nested spans for model calls, retrieval, tools, prompts, latency, tokens, and failures. The production promotion path is trace -> label -> dataset -> evaluator -> release gate -> monitor.
Context: fundamentals (Agentic fundamentals) and shipped surfaces (Production).
Process — from trace to guarded release
Anatomy of observability plumbing
Red flag: dumping raw PII into traces violates policy—use hashing, masking, segmentation by environment project.
What to trace for agents
| Signal | Why it matters |
|---|---|
| State transition | Shows why the agent moved from plan to retrieval, tool, approval, final, refusal, or escalation. |
| Tool call + validated args | Lets you debug wrong action selection separately from wrong tool implementation. |
| Guardrail decision | Explains blocked requests, policy failures, and high-risk approvals. |
| Evidence ids | Connects final claims to retrieval chunks without storing full sensitive payloads. |
| Prompt/model/tool versions | Makes rollback possible when a regression starts after a deploy. |
| Cost/latency tokens | Separates quality regressions from provider or infra saturation. |
| Outcome label | Enables evaluation datasets from real failed and successful runs. |
Enable tracing locally
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY="lsv2_..."
export LANGCHAIN_PROJECT="agentic-demo"# pip install langsmith
from langsmith import Client
client = Client()
recent = client.list_runs(project_name="agentic-demo", limit=5)
for r in recent:
print(r.id, r.status, r.total_tokens, r.latency)Use different projects (staging, prod-readonly-sample) plus TTL hygiene.
Datasets & evaluators (mental model)
- Capture representative conversation trajectories: user message sequence, eventual tool_calls, citations.
- Build evaluation examples referencing expected tools or grounding spans.
- Run automated scorer (
exact tool match,cosine grounding, calibrated LLM judge with human-anchor subset weekly). - Gate merges on Δ metric thresholds with manual review for regressions flagged.
Evaluation matrix
| Test type | Good assertion |
|---|---|
| Tool routing | Expected tool name and required argument fields match. |
| Policy refusal | Dangerous request is refused before tool execution. |
| Grounding | Final answer cites allowed evidence ids; no unsupported factual claim. |
| Recovery | Provider timeout degrades gracefully instead of looping. |
| Cost | Step/token budget stays under threshold for common tasks. |
| Regression | New prompt/model beats or matches baseline on critical examples. |
Example debugging workflow
Symptom: "The agent got worse after yesterday's deploy."
Debug in this order:
- Filter traces by prompt/model/tool version around the deploy window.
- Compare route decisions: did more requests go to tools, refusal, or escalation?
- Inspect retrieval spans: top-k ids, scores, corpus version, and empty retrieval rate.
- Inspect tool spans: selected tool, validated args, result status, latency.
- Compare final answers against the previous version on the same golden dataset.
- Roll back the prompt/model/router if critical tasks regressed.
Good interview phrasing:
"I would not start by blaming the model. I would use traces to isolate whether routing, retrieval, tool execution, guardrails, or final synthesis changed."
Interview questions — LangSmith
1. Difference between traces and logs?
- Trace is nested causal timeline aligning LLM/tool spans with token/latency metadata—not just printf lines.
2. Handling sensitive payloads.
- Sampling, stripping, hashing; client-side masking hook; contractual retention.
3. Fighting flaky evaluations.
- Assert deterministic tool fingerprints vs full natural language; replay recorded mocks for CI.
4. Offline vs online.
- Offline = dataset regressions batched—Online = streaming monitors (refusal spikes, latent tool error ratio).
5. How justify LLM-as-judge?
- Transparent rubric anchored weekly against human adjudicated subset—never sole gate on creative tasks.
6. What is the most useful production alert for agents?
- Ratio alerts are stronger than raw counts: tool-error rate, refusal-rate shift, cost per successful task, and human-escalation rate.
7. How do you debug "the model got worse"?
- Compare traces by model/prompt/tool version, then isolate whether routing, retrieval, guardrail, tool latency, or final synthesis changed.
Interview answer template
For "How do you observe and evaluate agents with LangSmith?", answer:
- Instrument nested traces for model, retrieval, tool, guardrail, and state transition spans.
- Redact or hash sensitive payloads before traces leave the app boundary.
- Convert representative traces into datasets.
- Use evaluators for tool routing, grounding, safety, latency, and cost.
- Gate releases on regression thresholds.
- Monitor production drift and feed incidents back into datasets.
Common bad answers
| Bad answer | Why it is weak |
|---|---|
| "Logs are enough." | Agent debugging needs nested traces linking model, retrieval, tool, state, latency, and cost. |
| "Trace everything raw." | Raw traces can leak PII, secrets, prompts, or tenant data. |
| "If evals pass once, ship forever." | Datasets drift; production traces must feed fresh regression cases. |
Self-check
You are ready if you can explain:
- Trace vs log.
- What signals to capture for agents.
- How to redact sensitive trace data.
- How traces become datasets.
- How to debug a model/prompt regression with traces.
Related
Operationalize in Agentic production · Return to hub overview.
Mark this page when you finish learning it.
Spotted something unclear or wrong on this page?