Evaluations for LLM and agent systems
You cannot make GenAI reliable with demos. You need offline datasets, online telemetry, and human calibration. For agents, evaluate the whole trajectory: retrieval, state transitions, tool calls, guardrails, and final answer.
Evaluation loop
Loading diagram…
What to test
| System part | Example assertion |
|---|---|
| Retrieval | Expected evidence ids appear in top-k. |
| Packing | Unauthorized/stale chunks are excluded. |
| Tool routing | Correct tool is selected for the task. |
| Tool args | Required ids, enums, and limits are valid. |
| Policy | Disallowed requests refuse before side effects. |
| Final answer | Claims are supported by evidence or tool output. |
| Cost/latency | Common tasks stay under thresholds. |
| Recovery | Timeouts, empty retrieval, and invalid JSON degrade safely. |
Offline vs online vs human
| Mode | Use for | Weakness |
|---|---|---|
| Offline eval | Regression tests before release. | Dataset can become stale or too easy. |
| Online monitor | Drift, refusal spikes, cost, tool errors. | Usually detects symptoms after users are affected. |
| Human review | Ambiguous quality, safety, judgment, high-risk actions. | Expensive and slower. |
| LLM judge | Scalable rubric scoring and triage. | Needs calibration against human labels. |
Interview questions
1. What is the difference between evaluating an LLM answer and an agent?
- Agent evals inspect the trajectory: state transitions, retrieval, tool calls, approvals, and final answer.
2. What metrics catch hallucination?
- Unsupported-claim rate, citation precision/recall, evidence coverage, and human factuality labels.
3. How do you prevent eval leakage?
- Keep holdout sets, avoid tuning repeatedly on the same examples, and version datasets with prompts/models.
4. What blocks release?
- Regression on critical tasks, unsafe refusals/actions, cost blowup, latency SLO breach, or reduced grounding.
Related
Spotted something unclear or wrong on this page?