LLM contracts, context & tools
An LLM is a probabilistic next-token model wrapped by an application contract. It receives tokens, attends to context, predicts useful continuations, and returns text or structured actions. The engineering job is to turn that flexible generator into a reliable system with instructions, retrieval, schemas, validators, tools, and evaluations.
Do not explain LLM systems as "the model knows the answer." Explain them as: model prior + supplied context + decoding constraints + application checks.
Simple mental model
Explanation: the model does not store your runtime database. It predicts from learned parameters plus the context you provide. If the context is missing, stale, poisoned, or too long to fit well, the answer can be wrong even when the model is strong.
Core concepts
| Concept | Meaning | Production implication |
|---|---|---|
| Token | Unit the model reads/writes. It may be a word, word piece, punctuation, or bytes. | Cost, latency, and context limits are token-based. |
| Context window | Maximum tokens the request can carry. | You need ranking, truncation, memory compression, and source selection. |
| System/developer instructions | High-priority behavior contract. | Keep short, versioned, and testable. |
| User message | Current task/request. | Keep close to the final model call so intent is not buried. |
| Retrieved context | External facts injected at runtime. | Must be permission-filtered and cited. |
| Decoding | How next tokens are selected. | Low randomness helps consistency but does not guarantee correctness. |
| Structured output | Model response constrained to schema. | Makes integration safer, but values can still be logically wrong. |
| Tool call | Model asks the app to run a named function. | The app validates and executes; the model should not directly touch systems. |
Context packing order
Rule: include the smallest context that can answer the task. More context can dilute attention, increase cost, leak data, and make prompt injection harder to reason about.
Hallucination
Hallucination is unsupported output: the model states something not grounded in reliable context or tool results. It is not only a model issue; it is often an architecture issue.
| Cause | Example | Mitigation |
|---|---|---|
| Missing evidence | User asks about a private invoice; no retrieval ran. | Retrieve by tenant, cite evidence ids, say when data is unavailable. |
| Bad retrieval | Similar but wrong policy document ranked first. | Hybrid search, rerank, freshness checks, chunk lineage. |
| Over-broad prompt | "Answer confidently" encourages guessing. | Require uncertainty and evidence-backed claims. |
| Schema-only confidence | JSON is valid but value is wrong. | Validate business rules and cross-check with tools. |
| Tool observation mismatch | Tool returns partial data; model fills gaps. | Return explicit status, missing fields, and refusal/degrade path. |
Interview phrase: "Structured output prevents shape errors, not truth errors."
Tool calling lifecycle
The model requests a tool call. The application owns validation, authorization, execution, retries, idempotency, and audit.
Interview questions
1. How does an LLM work at inference time?
- Text becomes tokens, tokens pass through the transformer, the model scores likely next tokens, decoding selects output tokens, and the application validates the result.
Follow-up: Where do facts come from?
- From the model prior, supplied context, or tools. Private/current facts must come from retrieval or tools, not from assumed model memory.
2. Why does a larger context window not solve RAG?
- You still need permission filtering, ranking, dedupe, freshness, and context packing. A bigger window can carry more noise and more injected instructions.
3. What is the difference between JSON mode and structured outputs?
- JSON mode targets valid JSON. Structured outputs target a supplied schema. Both still require semantic validation for business correctness.
4. Why is tool calling safer than asking the model to produce SQL or shell commands?
- Typed tools expose narrow capabilities. The server can enforce identity, arguments, limits, and audit before touching real systems.
5. How do you reduce hallucination in production?
- Ground answers in retrieved evidence or tools, require citations, expose uncertainty, block unsupported claims where possible, and evaluate with negative cases.
Interview answer template
For "Explain how an LLM app works end to end", answer:
- Tokenize input and provide instructions/context.
- Model predicts likely output or tool calls.
- Application validates schema, policy, permissions, and grounding.
- Tools/retrieval provide external facts or actions.
- Final response is checked, traced, and evaluated.
Strong phrase:
"The LLM is flexible generation inside an application contract; the contract makes it usable."
Common bad answers
| Bad answer | Why it is weak |
|---|---|
| "The model knows the answer." | It hides the difference between learned prior, supplied context, and source-of-truth tools. |
| "Use low temperature to stop hallucinations." | Lower randomness improves consistency, not factual grounding. |
| "JSON output means it is safe." | JSON shape can still contain wrong or unauthorized values. |
Self-check
You are ready if you can explain:
- How input becomes tokens and output tokens.
- Why context packing order matters.
- Why tool calls are requests, not execution.
- Why schema validation and semantic validation are different.
- How hallucination is usually an architecture boundary failure.
Related
RAG: ingest -> retrieve -> pack · Structured outputs & guardrails · Agentic architecture workflow
Mark this page when you finish learning it.
Spotted something unclear or wrong on this page?