THN Interview Prep

Evaluations for LLM and agent systems

You cannot make GenAI reliable with demos. You need offline datasets, online telemetry, and human calibration. For agents, evaluate the whole trajectory: retrieval, state transitions, tool calls, guardrails, and final answer.


Evaluation loop

Loading diagram…

What to test

System partExample assertion
RetrievalExpected evidence ids appear in top-k.
PackingUnauthorized/stale chunks are excluded.
Tool routingCorrect tool is selected for the task.
Tool argsRequired ids, enums, and limits are valid.
PolicyDisallowed requests refuse before side effects.
Final answerClaims are supported by evidence or tool output.
Cost/latencyCommon tasks stay under thresholds.
RecoveryTimeouts, empty retrieval, and invalid JSON degrade safely.

Offline vs online vs human

ModeUse forWeakness
Offline evalRegression tests before release.Dataset can become stale or too easy.
Online monitorDrift, refusal spikes, cost, tool errors.Usually detects symptoms after users are affected.
Human reviewAmbiguous quality, safety, judgment, high-risk actions.Expensive and slower.
LLM judgeScalable rubric scoring and triage.Needs calibration against human labels.

Interview questions

1. What is the difference between evaluating an LLM answer and an agent?

  • Agent evals inspect the trajectory: state transitions, retrieval, tool calls, approvals, and final answer.

2. What metrics catch hallucination?

  • Unsupported-claim rate, citation precision/recall, evidence coverage, and human factuality labels.

3. How do you prevent eval leakage?

  • Keep holdout sets, avoid tuning repeatedly on the same examples, and version datasets with prompts/models.

4. What blocks release?

  • Regression on critical tasks, unsafe refusals/actions, cost blowup, latency SLO breach, or reduced grounding.

LangSmith for agents · RAG · Agentic production

Spotted something unclear or wrong on this page?

On this page