LangSmith for agents
LangSmith captures runs from LangChain/LangGraph: nested spans for model calls, retrieval, tools, prompts, latency, tokens, and failures. The production promotion path is trace -> label -> dataset -> evaluator -> release gate -> monitor.
Context: fundamentals (Agentic fundamentals) and shipped surfaces (Production).
Process — from trace to guarded release
Anatomy of observability plumbing
Red flag: dumping raw PII into traces violates policy—use hashing, masking, segmentation by environment project.
What to trace for agents
| Signal | Why it matters |
|---|---|
| State transition | Shows why the agent moved from plan to retrieval, tool, approval, final, refusal, or escalation. |
| Tool call + validated args | Lets you debug wrong action selection separately from wrong tool implementation. |
| Guardrail decision | Explains blocked requests, policy failures, and high-risk approvals. |
| Evidence ids | Connects final claims to retrieval chunks without storing full sensitive payloads. |
| Prompt/model/tool versions | Makes rollback possible when a regression starts after a deploy. |
| Cost/latency tokens | Separates quality regressions from provider or infra saturation. |
| Outcome label | Enables evaluation datasets from real failed and successful runs. |
Enable tracing locally
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY="lsv2_..."
export LANGCHAIN_PROJECT="agentic-demo"# pip install langsmith
from langsmith import Client
client = Client()
recent = client.list_runs(project_name="agentic-demo", limit=5)
for r in recent:
print(r.id, r.status, r.total_tokens, r.latency)Use different projects (staging, prod-readonly-sample) plus TTL hygiene.
Datasets & evaluators (mental model)
- Capture representative conversation trajectories: user message sequence, eventual tool_calls, citations.
- Build evaluation examples referencing expected tools or grounding spans.
- Run automated scorer (
exact tool match,cosine grounding, calibrated LLM judge with human-anchor subset weekly). - Gate merges on Δ metric thresholds with manual review for regressions flagged.
Evaluation matrix
| Test type | Good assertion |
|---|---|
| Tool routing | Expected tool name and required argument fields match. |
| Policy refusal | Dangerous request is refused before tool execution. |
| Grounding | Final answer cites allowed evidence ids; no unsupported factual claim. |
| Recovery | Provider timeout degrades gracefully instead of looping. |
| Cost | Step/token budget stays under threshold for common tasks. |
| Regression | New prompt/model beats or matches baseline on critical examples. |
Interview questions — LangSmith
1. Difference between traces and logs?
- Trace is nested causal timeline aligning LLM/tool spans with token/latency metadata—not just printf lines.
2. Handling sensitive payloads.
- Sampling, stripping, hashing; client-side masking hook; contractual retention.
3. Fighting flaky evaluations.
- Assert deterministic tool fingerprints vs full natural language; replay recorded mocks for CI.
4. Offline vs online.
- Offline = dataset regressions batched—Online = streaming monitors (refusal spikes, latent tool error ratio).
5. How justify LLM-as-judge?
- Transparent rubric anchored weekly against human adjudicated subset—never sole gate on creative tasks.
6. What is the most useful production alert for agents?
- Ratio alerts are stronger than raw counts: tool-error rate, refusal-rate shift, cost per successful task, and human-escalation rate.
7. How do you debug "the model got worse"?
- Compare traces by model/prompt/tool version, then isolate whether routing, retrieval, guardrail, tool latency, or final synthesis changed.
Next
Operationalize in Agentic production · Return to hub overview.
Spotted something unclear or wrong on this page?