Evaluations for LLM and agent systems

You cannot make GenAI reliable with demos. You need offline datasets, online telemetry, and human calibration. For agents, evaluate the whole trajectory: retrieval, state transitions, tool calls, guardrails, and final answer.

Interview principle:

Evaluate the workflow that creates the answer, not only the final paragraph.

Evaluation loop

Loading diagram…

The loop closes when real failures become labeled examples, then those examples block future regressions.

What to test

System part	Example assertion
Retrieval	Expected evidence ids appear in top-k for representative queries.
Packing	Unauthorized, stale, duplicate, or irrelevant chunks are excluded.
Tool routing	Correct tool is selected only when the task requires it.
Tool args	Required ids, enums, scopes, and limits are valid.
Policy	Disallowed requests refuse before retrieval or side effects.
Final answer	Claims are supported by evidence or tool output.
Cost/latency	Common tasks stay under cost and latency thresholds.
Recovery	Timeouts, empty retrieval, and invalid JSON degrade safely.

Offline vs online vs human

Mode	Use for	Weakness	Good signal
Offline eval	Regression tests before release.	Dataset can become stale or too easy.	Pass/fail by task family and risk tier.
Online monitor	Drift, refusal spikes, cost, tool errors.	Usually detects symptoms after users are affected.	Rate shifts and threshold alerts.
Human review	Ambiguous quality, safety, judgment, high-risk actions.	Expensive and slower.	Calibrated labels and disagreement notes.
LLM judge	Scalable rubric scoring and triage.	Needs calibration against human labels.	Agreement with human anchor set.

Dataset design

A useful GenAI eval dataset is not just "100 random prompts." It should cover:

Category	Examples
Happy path	Common questions with clear evidence.
Hard retrieval	Synonyms, identifiers, stale docs, similar policies.
Negative cases	No evidence, forbidden data, unsupported requests.
Safety cases	Prompt injection, exfiltration, tool abuse, PII.
Edge cases	Multilingual input, tables, long docs, conflicting sources.
Regression cases	Real incidents and recent user complaints.

Keep a holdout set. Do not tune repeatedly on every eval example until the benchmark becomes a memorized worksheet.

Metrics by system type

System	Better metrics
RAG QA	top-k evidence recall, citation precision, unsupported-claim rate, answer helpfulness.
Extraction	exact match, field-level F1, schema validity, semantic validation pass rate.
Tool agent	correct tool route, valid args, side-effect success, step count, escalation correctness.
Support assistant	resolution rate, safe refusal rate, handoff quality, cost per resolved ticket.
High-risk workflow	human approval precision, blocked unsafe action rate, audit completeness.

Example release gate

Before shipping a prompt/model/retriever change:

Offline eval must not regress critical tasks.
Unsafe-action tests must pass.
Unsupported-claim rate must stay below threshold.
p95 latency and cost per success must stay within budget.
Canary traffic must not show refusal or escalation spikes.

If a metric fails, do not average it away with easy examples. Fix the failing slice or explicitly accept the risk.

Interview answer template

For "How would you evaluate this GenAI system?", answer:

Define task and risk level.
List components to evaluate: retrieval, packing, model output, tools, guardrails, final UX.
Create offline golden sets with positive, negative, safety, and edge cases.
Use targeted metrics, not only thumbs-up/down.
Calibrate judges with humans.
Monitor production for drift, cost, latency, refusals, tool errors.
Gate releases and feed incidents back into the dataset.

Interview questions

1. What is the difference between evaluating an LLM answer and an agent?

Agent evals inspect the trajectory: state transitions, retrieval, tool calls, approvals, guardrail decisions, and final answer.

Follow-up: Why is final-answer scoring insufficient?

The final answer may look good while the agent used the wrong tool, leaked data, skipped approval, or consumed too much budget.

2. What metrics catch hallucination?

Unsupported-claim rate, citation precision/recall, evidence coverage, contradiction checks, and human factuality labels.

Follow-up: Can citations be faked?

Yes. Evaluate whether cited evidence actually supports the claim, not only whether a citation exists.

3. How do you prevent eval leakage?

Keep holdout sets, version datasets with prompts/models/corpora, avoid repeated tuning on the same examples, and add fresh production failures.

4. What blocks release?

Regression on critical tasks, unsafe refusals/actions, cost blowup, latency SLO breach, reduced grounding, or trace redaction failure.

5. When is LLM-as-judge acceptable?

For scalable triage with a clear rubric and human-calibrated anchor set. It should not be the only gate for high-risk or highly subjective tasks.

Common bad answers

Bad answer	Why it is weak
"We test by trying a few prompts manually."	Demos do not catch regressions, safety failures, or edge cases.
"Use thumbs-up/down as the main metric."	Too coarse to diagnose retrieval, grounding, tools, refusal, latency, or cost.
"Use an LLM judge for everything."	Judges need calibration and are risky as the only gate for high-stakes tasks.

Self-check

You are ready if you can explain:

What belongs in a golden dataset.
Why evals need negative and safety cases.
How to evaluate a RAG answer differently from an agent trajectory.
Why cited answers still need support checks.
What blocks a release.

LangSmith for agents · RAG · Agentic production

On this page