THN Interview Prep

LangSmith for agents

LangSmith captures runs from LangChain/LangGraph: nested spans for model calls, retrieval, tools, prompts, latency, tokens, and failures. The production promotion path is trace -> label -> dataset -> evaluator -> release gate -> monitor.

Context: fundamentals (Agentic fundamentals) and shipped surfaces (Production).


Process — from trace to guarded release

Loading diagram…

Anatomy of observability plumbing

Loading diagram…

Red flag: dumping raw PII into traces violates policy—use hashing, masking, segmentation by environment project.


What to trace for agents

SignalWhy it matters
State transitionShows why the agent moved from plan to retrieval, tool, approval, final, refusal, or escalation.
Tool call + validated argsLets you debug wrong action selection separately from wrong tool implementation.
Guardrail decisionExplains blocked requests, policy failures, and high-risk approvals.
Evidence idsConnects final claims to retrieval chunks without storing full sensitive payloads.
Prompt/model/tool versionsMakes rollback possible when a regression starts after a deploy.
Cost/latency tokensSeparates quality regressions from provider or infra saturation.
Outcome labelEnables evaluation datasets from real failed and successful runs.

Enable tracing locally

export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY="lsv2_..."
export LANGCHAIN_PROJECT="agentic-demo"
# pip install langsmith
from langsmith import Client

client = Client()
recent = client.list_runs(project_name="agentic-demo", limit=5)
for r in recent:
    print(r.id, r.status, r.total_tokens, r.latency)

Use different projects (staging, prod-readonly-sample) plus TTL hygiene.


Datasets & evaluators (mental model)

  1. Capture representative conversation trajectories: user message sequence, eventual tool_calls, citations.
  2. Build evaluation examples referencing expected tools or grounding spans.
  3. Run automated scorer (exact tool match, cosine grounding, calibrated LLM judge with human-anchor subset weekly).
  4. Gate merges on Δ metric thresholds with manual review for regressions flagged.

Evaluation matrix

Test typeGood assertion
Tool routingExpected tool name and required argument fields match.
Policy refusalDangerous request is refused before tool execution.
GroundingFinal answer cites allowed evidence ids; no unsupported factual claim.
RecoveryProvider timeout degrades gracefully instead of looping.
CostStep/token budget stays under threshold for common tasks.
RegressionNew prompt/model beats or matches baseline on critical examples.

Interview questions — LangSmith

1. Difference between traces and logs?

  • Trace is nested causal timeline aligning LLM/tool spans with token/latency metadata—not just printf lines.

2. Handling sensitive payloads.

  • Sampling, stripping, hashing; client-side masking hook; contractual retention.

3. Fighting flaky evaluations.

  • Assert deterministic tool fingerprints vs full natural language; replay recorded mocks for CI.

4. Offline vs online.

  • Offline = dataset regressions batched—Online = streaming monitors (refusal spikes, latent tool error ratio).

5. How justify LLM-as-judge?

  • Transparent rubric anchored weekly against human adjudicated subset—never sole gate on creative tasks.

6. What is the most useful production alert for agents?

  • Ratio alerts are stronger than raw counts: tool-error rate, refusal-rate shift, cost per successful task, and human-escalation rate.

7. How do you debug "the model got worse"?

  • Compare traces by model/prompt/tool version, then isolate whether routing, retrieval, guardrail, tool latency, or final synthesis changed.

Next

Operationalize in Agentic production · Return to hub overview.

Spotted something unclear or wrong on this page?

On this page