THN Interview Prep

Evaluations for LLM and agent systems

You cannot make GenAI reliable with demos. You need offline datasets, online telemetry, and human calibration. For agents, evaluate the whole trajectory: retrieval, state transitions, tool calls, guardrails, and final answer.

Interview principle:

Evaluate the workflow that creates the answer, not only the final paragraph.


Evaluation loop

Loading diagram…

The loop closes when real failures become labeled examples, then those examples block future regressions.


What to test

System partExample assertion
RetrievalExpected evidence ids appear in top-k for representative queries.
PackingUnauthorized, stale, duplicate, or irrelevant chunks are excluded.
Tool routingCorrect tool is selected only when the task requires it.
Tool argsRequired ids, enums, scopes, and limits are valid.
PolicyDisallowed requests refuse before retrieval or side effects.
Final answerClaims are supported by evidence or tool output.
Cost/latencyCommon tasks stay under cost and latency thresholds.
RecoveryTimeouts, empty retrieval, and invalid JSON degrade safely.

Offline vs online vs human

ModeUse forWeaknessGood signal
Offline evalRegression tests before release.Dataset can become stale or too easy.Pass/fail by task family and risk tier.
Online monitorDrift, refusal spikes, cost, tool errors.Usually detects symptoms after users are affected.Rate shifts and threshold alerts.
Human reviewAmbiguous quality, safety, judgment, high-risk actions.Expensive and slower.Calibrated labels and disagreement notes.
LLM judgeScalable rubric scoring and triage.Needs calibration against human labels.Agreement with human anchor set.

Dataset design

A useful GenAI eval dataset is not just "100 random prompts." It should cover:

CategoryExamples
Happy pathCommon questions with clear evidence.
Hard retrievalSynonyms, identifiers, stale docs, similar policies.
Negative casesNo evidence, forbidden data, unsupported requests.
Safety casesPrompt injection, exfiltration, tool abuse, PII.
Edge casesMultilingual input, tables, long docs, conflicting sources.
Regression casesReal incidents and recent user complaints.

Keep a holdout set. Do not tune repeatedly on every eval example until the benchmark becomes a memorized worksheet.


Metrics by system type

SystemBetter metrics
RAG QAtop-k evidence recall, citation precision, unsupported-claim rate, answer helpfulness.
Extractionexact match, field-level F1, schema validity, semantic validation pass rate.
Tool agentcorrect tool route, valid args, side-effect success, step count, escalation correctness.
Support assistantresolution rate, safe refusal rate, handoff quality, cost per resolved ticket.
High-risk workflowhuman approval precision, blocked unsafe action rate, audit completeness.

Example release gate

Before shipping a prompt/model/retriever change:

  1. Offline eval must not regress critical tasks.
  2. Unsafe-action tests must pass.
  3. Unsupported-claim rate must stay below threshold.
  4. p95 latency and cost per success must stay within budget.
  5. Canary traffic must not show refusal or escalation spikes.

If a metric fails, do not average it away with easy examples. Fix the failing slice or explicitly accept the risk.


Interview answer template

For "How would you evaluate this GenAI system?", answer:

  1. Define task and risk level.
  2. List components to evaluate: retrieval, packing, model output, tools, guardrails, final UX.
  3. Create offline golden sets with positive, negative, safety, and edge cases.
  4. Use targeted metrics, not only thumbs-up/down.
  5. Calibrate judges with humans.
  6. Monitor production for drift, cost, latency, refusals, tool errors.
  7. Gate releases and feed incidents back into the dataset.

Interview questions

1. What is the difference between evaluating an LLM answer and an agent?

  • Agent evals inspect the trajectory: state transitions, retrieval, tool calls, approvals, guardrail decisions, and final answer.

Follow-up: Why is final-answer scoring insufficient?

  • The final answer may look good while the agent used the wrong tool, leaked data, skipped approval, or consumed too much budget.

2. What metrics catch hallucination?

  • Unsupported-claim rate, citation precision/recall, evidence coverage, contradiction checks, and human factuality labels.

Follow-up: Can citations be faked?

  • Yes. Evaluate whether cited evidence actually supports the claim, not only whether a citation exists.

3. How do you prevent eval leakage?

  • Keep holdout sets, version datasets with prompts/models/corpora, avoid repeated tuning on the same examples, and add fresh production failures.

4. What blocks release?

  • Regression on critical tasks, unsafe refusals/actions, cost blowup, latency SLO breach, reduced grounding, or trace redaction failure.

5. When is LLM-as-judge acceptable?

  • For scalable triage with a clear rubric and human-calibrated anchor set. It should not be the only gate for high-risk or highly subjective tasks.

Common bad answers

Bad answerWhy it is weak
"We test by trying a few prompts manually."Demos do not catch regressions, safety failures, or edge cases.
"Use thumbs-up/down as the main metric."Too coarse to diagnose retrieval, grounding, tools, refusal, latency, or cost.
"Use an LLM judge for everything."Judges need calibration and are risky as the only gate for high-stakes tasks.

Self-check

You are ready if you can explain:

  • What belongs in a golden dataset.
  • Why evals need negative and safety cases.
  • How to evaluate a RAG answer differently from an agent trajectory.
  • Why cited answers still need support checks.
  • What blocks a release.

LangSmith for agents · RAG · Agentic production

Mark this page when you finish learning it.

Spotted something unclear or wrong on this page?

On this page