LangSmith for agents

LangSmith captures runs from LangChain/LangGraph: nested spans for model calls, retrieval, tools, prompts, latency, tokens, and failures. The production promotion path is trace -> label -> dataset -> evaluator -> release gate -> monitor.

Context: fundamentals (Agentic fundamentals) and shipped surfaces (Production).

Process — from trace to guarded release

Loading diagram…

Anatomy of observability plumbing

Loading diagram…

Red flag: dumping raw PII into traces violates policy—use hashing, masking, segmentation by environment project.

What to trace for agents

Signal	Why it matters
State transition	Shows why the agent moved from plan to retrieval, tool, approval, final, refusal, or escalation.
Tool call + validated args	Lets you debug wrong action selection separately from wrong tool implementation.
Guardrail decision	Explains blocked requests, policy failures, and high-risk approvals.
Evidence ids	Connects final claims to retrieval chunks without storing full sensitive payloads.
Prompt/model/tool versions	Makes rollback possible when a regression starts after a deploy.
Cost/latency tokens	Separates quality regressions from provider or infra saturation.
Outcome label	Enables evaluation datasets from real failed and successful runs.

Enable tracing locally

export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY="lsv2_..."
export LANGCHAIN_PROJECT="agentic-demo"

# pip install langsmith
from langsmith import Client

client = Client()
recent = client.list_runs(project_name="agentic-demo", limit=5)
for r in recent:
    print(r.id, r.status, r.total_tokens, r.latency)

Use different projects (staging, prod-readonly-sample) plus TTL hygiene.

Datasets & evaluators (mental model)

Capture representative conversation trajectories: user message sequence, eventual tool_calls, citations.
Build evaluation examples referencing expected tools or grounding spans.
Run automated scorer (exact tool match, cosine grounding, calibrated LLM judge with human-anchor subset weekly).
Gate merges on Δ metric thresholds with manual review for regressions flagged.

Evaluation matrix

Test type	Good assertion
Tool routing	Expected tool name and required argument fields match.
Policy refusal	Dangerous request is refused before tool execution.
Grounding	Final answer cites allowed evidence ids; no unsupported factual claim.
Recovery	Provider timeout degrades gracefully instead of looping.
Cost	Step/token budget stays under threshold for common tasks.
Regression	New prompt/model beats or matches baseline on critical examples.

Example debugging workflow

Symptom: "The agent got worse after yesterday's deploy."

Debug in this order:

Filter traces by prompt/model/tool version around the deploy window.
Compare route decisions: did more requests go to tools, refusal, or escalation?
Inspect retrieval spans: top-k ids, scores, corpus version, and empty retrieval rate.
Inspect tool spans: selected tool, validated args, result status, latency.
Compare final answers against the previous version on the same golden dataset.
Roll back the prompt/model/router if critical tasks regressed.

Good interview phrasing:

"I would not start by blaming the model. I would use traces to isolate whether routing, retrieval, tool execution, guardrails, or final synthesis changed."

Interview questions — LangSmith

1. Difference between traces and logs?

Trace is nested causal timeline aligning LLM/tool spans with token/latency metadata—not just printf lines.

2. Handling sensitive payloads.

Sampling, stripping, hashing; client-side masking hook; contractual retention.

3. Fighting flaky evaluations.

Assert deterministic tool fingerprints vs full natural language; replay recorded mocks for CI.

4. Offline vs online.

Offline = dataset regressions batched—Online = streaming monitors (refusal spikes, latent tool error ratio).

5. How justify LLM-as-judge?

Transparent rubric anchored weekly against human adjudicated subset—never sole gate on creative tasks.

6. What is the most useful production alert for agents?

Ratio alerts are stronger than raw counts: tool-error rate, refusal-rate shift, cost per successful task, and human-escalation rate.

7. How do you debug "the model got worse"?

Compare traces by model/prompt/tool version, then isolate whether routing, retrieval, guardrail, tool latency, or final synthesis changed.

Interview answer template

For "How do you observe and evaluate agents with LangSmith?", answer:

Instrument nested traces for model, retrieval, tool, guardrail, and state transition spans.
Redact or hash sensitive payloads before traces leave the app boundary.
Convert representative traces into datasets.
Use evaluators for tool routing, grounding, safety, latency, and cost.
Gate releases on regression thresholds.
Monitor production drift and feed incidents back into datasets.

Common bad answers

Bad answer	Why it is weak
"Logs are enough."	Agent debugging needs nested traces linking model, retrieval, tool, state, latency, and cost.
"Trace everything raw."	Raw traces can leak PII, secrets, prompts, or tenant data.
"If evals pass once, ship forever."	Datasets drift; production traces must feed fresh regression cases.

Self-check

You are ready if you can explain:

Trace vs log.
What signals to capture for agents.
How to redact sensitive trace data.
How traces become datasets.
How to debug a model/prompt regression with traces.

Operationalize in Agentic production · Return to hub overview.

On this page