LLM contracts, context & tools
An LLM is a probabilistic next-token model wrapped by an application contract. It receives tokens, attends to context, predicts useful continuations, and returns text or structured actions. The engineering job is to turn that flexible generator into a reliable system with instructions, retrieval, schemas, validators, tools, and evaluations.
Do not explain LLM systems as "the model knows the answer." Explain them as: model prior + supplied context + decoding constraints + application checks.
Simple mental model
Explanation: the model does not store your runtime database. It predicts from learned parameters plus the context you provide. If the context is missing, stale, poisoned, or too long to fit well, the answer can be wrong even when the model is strong.
Core concepts
| Concept | Meaning | Production implication |
|---|---|---|
| Token | Unit the model reads/writes. It may be a word, word piece, punctuation, or bytes. | Cost, latency, and context limits are token-based. |
| Context window | Maximum tokens the request can carry. | You need ranking, truncation, memory compression, and source selection. |
| System/developer instructions | High-priority behavior contract. | Keep short, versioned, and testable. |
| User message | Current task/request. | Keep close to the final model call so intent is not buried. |
| Retrieved context | External facts injected at runtime. | Must be permission-filtered and cited. |
| Decoding | How next tokens are selected. | Low randomness helps consistency but does not guarantee correctness. |
| Structured output | Model response constrained to schema. | Makes integration safer, but values can still be logically wrong. |
| Tool call | Model asks the app to run a named function. | The app validates and executes; the model should not directly touch systems. |
Context packing order
Rule: include the smallest context that can answer the task. More context can dilute attention, increase cost, leak data, and make prompt injection harder to reason about.
Hallucination
Hallucination is unsupported output: the model states something not grounded in reliable context or tool results. It is not only a model issue; it is often an architecture issue.
| Cause | Example | Mitigation |
|---|---|---|
| Missing evidence | User asks about a private invoice; no retrieval ran. | Retrieve by tenant, cite evidence ids, say when data is unavailable. |
| Bad retrieval | Similar but wrong policy document ranked first. | Hybrid search, rerank, freshness checks, chunk lineage. |
| Over-broad prompt | "Answer confidently" encourages guessing. | Require uncertainty and evidence-backed claims. |
| Schema-only confidence | JSON is valid but value is wrong. | Validate business rules and cross-check with tools. |
| Tool observation mismatch | Tool returns partial data; model fills gaps. | Return explicit status, missing fields, and refusal/degrade path. |
Interview phrase: "Structured output prevents shape errors, not truth errors."
Tool calling lifecycle
The model requests a tool call. The application owns validation, authorization, execution, retries, idempotency, and audit.
Interview questions
1. How does an LLM work at inference time?
- Text becomes tokens, tokens pass through the transformer, the model scores likely next tokens, decoding selects output tokens, and the application validates the result.
2. Why does a larger context window not solve RAG?
- You still need permission filtering, ranking, dedupe, freshness, and context packing. A bigger window can carry more noise and more injected instructions.
3. What is the difference between JSON mode and structured outputs?
- JSON mode targets valid JSON. Structured outputs target a supplied schema. Both still require semantic validation for business correctness.
4. Why is tool calling safer than asking the model to produce SQL or shell commands?
- Typed tools expose narrow capabilities. The server can enforce identity, arguments, limits, and audit before touching real systems.
5. How do you reduce hallucination in production?
- Ground answers in retrieved evidence or tools, require citations, expose uncertainty, block unsupported claims where possible, and evaluate with negative cases.
Related
RAG: ingest -> retrieve -> pack · Structured outputs & guardrails · Agentic architecture workflow
Spotted something unclear or wrong on this page?