Generative AI for Engineers
Generative AI systems are not "a model call with a prompt." A production system is a regular software system wrapped around a probabilistic model: it controls context, retrieval, tools, schemas, safety, evaluation, latency, cost, and fallback behavior.
The practical mindset is:
Treat the model as useful but untrusted. Ground it with evidence, constrain it with contracts, validate its outputs, and measure the whole workflow.
How to study this section
Use this path if you are learning the topic for interviews or production design:
| Step | Page | What you should be able to explain |
|---|---|---|
| 1 | LLM contracts, context & tools | How tokens, context, hallucination, structured outputs, and tool calls fit together. |
| 2 | RAG: ingest -> retrieve -> pack | How external knowledge flows from documents into answers. |
| 3 | Structured outputs & guardrails | How to turn model output into safe application behavior. |
| 4 | Evaluations | How to know whether a prompt, model, retriever, or agent change is better. |
| 5 | Safety & injection | Why prompt injection is a privilege-boundary problem. |
| 6 | Cost & latency routing | How to keep GenAI systems economically and operationally realistic. |
| 7 | Agentic AI track | When to use agents, how to design stateful workflows, and how to ship them safely. |
Interview cadence companion: /dsa/interview-prep/generative-ai.
Core mental model
Read the diagram as a control system:
- The application owns identity, permissions, budgets, routing, retries, and user experience.
- The model proposes language, structured data, or tool calls.
- The retrieval/tool layers provide evidence and actions.
- The validation/eval layers decide whether the result is usable.
The seven interview pillars
1. LLM contract
An LLM predicts output from learned parameters plus the request context. It does not automatically know your private database, current policies, or tenant permissions.
Good interview answer:
"I separate the model prior from runtime context. The application supplies instructions, retrieved evidence, tool schemas, and validation rules. The model can propose an answer or tool call, but the app decides what is allowed."
Common mistake: saying "the model knows" instead of explaining context, grounding, and validation.
2. Context and token budget
Context is expensive and limited. More context can increase cost, latency, privacy risk, and confusion.
Useful packing order:
- Stable system/developer policy.
- Task-relevant memory.
- Permission-filtered retrieved evidence.
- Few-shot examples only when they change behavior.
- Current user request near the end.
Interview follow-up to expect: "Why not use a larger context window?" Answer: because ranking, ACL filtering, freshness, dedupe, and injection risk still matter.
3. RAG
RAG is an evidence pipeline, not just a vector database.
The full path is:
ingest -> parse -> chunk -> label -> embed/index -> retrieve -> rerank -> pack -> answer -> evaluateStrong answer points:
- Attach tenant, ACL, source version, and freshness metadata before indexing.
- Use hybrid retrieval when exact identifiers matter.
- Preserve chunk lineage so wrong answers can be debugged.
- Refuse or degrade when no evidence is found.
4. Structured outputs and tools
Structured output reduces shape errors. It does not guarantee truth.
Use:
- Structured outputs when the model must return typed data.
- Tool calling when the system needs external data or side effects.
- Guardrails when policy must block, transform, or escalate behavior.
Rule for tools:
The model requests a tool call. The server validates identity, schema, quotas, idempotency, and authorization before executing anything.
5. Evaluation
Demos do not prove reliability. You need:
| Layer | Purpose |
|---|---|
| Offline evals | Catch regressions before release. |
| Online telemetry | Detect drift, latency, cost, refusal, and tool errors. |
| Human review | Calibrate ambiguous quality and high-risk misses. |
For agents, evaluate the whole trajectory: retrieval, state transitions, tool calls, approvals, guardrails, and final answer.
6. Safety
Prompt injection is untrusted input trying to cross a privilege boundary.
Important distinction:
- Direct injection: user says "ignore instructions."
- Indirect injection: a PDF, email, webpage, ticket, or tool output contains hidden instructions.
- Tool abuse: injected content tries to cause an action, not just a wrong answer.
Prompt-only defense is insufficient. Use narrow tools, server-side authorization, output checks, trace review, and human approval for high-risk actions.
7. Cost, latency, and rollout
Senior designs discuss economics early. Cost and latency come from model tokens, retrieval, reranking, tool IO, retries, tracing, and agent loops.
Levers:
- Route simple tasks to cheaper paths.
- Trim context without dropping required evidence.
- Cache tenant-safely.
- Parallelize independent retrieval/tool calls.
- Stream for perceived latency.
- Cap steps, retries, and repair loops.
- Roll out with canaries, eval gates, and rollback flags.
Agentic AI track
Study agents after the basics above. An agent is a stateful workflow where the model may choose the next action under application control.
Recommended order:
| Page | Use it for |
|---|---|
| Agentic AI overview | Track map and high-level lifecycle. |
| Agentic fundamentals | Agent, tool, state, observation, transition, termination. |
| Agentic architecture workflow | End-to-end production architecture. |
| Agent memory, state & storage | Checkpoints, memory, source-of-truth separation. |
| LangChain | Primitives and tool schemas. |
| LangGraph | Explicit graphs, routers, checkpoints, interrupts. |
| LangSmith | Tracing, datasets, evals, release gates. |
| Agentic production | Shipping, operations, security, and incident response. |
Interview sound bite:
"Agents are state machines with stochastic transitions. I define legal states, allowed tools, stop conditions, budgets, and observability before I add autonomy."
Interview answer structure
When asked to design a GenAI system, answer in this order:
- Clarify task and risk: QA, summarization, extraction, routing, support, code, legal, finance, medical, internal tooling.
- Define grounding contract: internal corpus, web, tools, citations, refusal behavior.
- Sketch request path: ingress -> auth -> retrieval/tools -> model -> validation -> response.
- Explain data lifecycle: ingest, chunking, metadata, ACLs, freshness, deletion.
- Add safety: prompt injection, tool scope, PII, audit, human review.
- Add evals: offline dataset, online monitors, human calibration.
- Add cost/latency: routing, caching, parallelism, streaming, budgets.
- Add rollout: canary, shadow eval, prompt/model versioning, rollback.
Example interview prompt
Prompt: Design a customer-support assistant that answers from company docs and can create refund tickets.
Strong answer outline:
| Area | Good answer |
|---|---|
| Risk | Refund creation is a side effect, so answers and actions need different controls. |
| RAG | Ingest help docs, policies, tickets; attach tenant/product/version metadata; hybrid retrieve; cite evidence. |
| Tooling | Expose create_refund_ticket, not arbitrary API access. Validate user, order id, refund policy, and idempotency key. |
| Safety | Treat uploaded screenshots and retrieved docs as untrusted. Block instructions inside documents from changing tool policy. |
| Eval | Test top-k evidence, citation support, correct refusal, tool args, and refund policy edge cases. |
| Ops | Track p95 latency, retrieval empty rate, tool error rate, cost per resolved ticket, refund escalation rate. |
Common bad answer:
"Put all support docs in a vector DB and ask the LLM to answer and call APIs."
Why it is weak: it skips ACLs, freshness, hybrid retrieval, citations, tool authorization, idempotency, evals, and escalation.
Debugging map
| Symptom | First place to inspect | Likely fix |
|---|---|---|
| Confident wrong answer | Retrieved chunks and packed prompt | Improve retrieval, citation rules, refusal threshold. |
| Correct source, wrong field | Structured output and semantic validation | Add business validation or tool cross-check. |
| Slow p95 | Reranker, provider latency, serial IO | Parallelize independent work, route, cache, trim context. |
| Cost spike | Agent loops, retries, long context | Add step budgets, cap repairs, route cheaper tasks. |
| Data leak | ACL filters, trace payloads, answer cache | Disable path, audit access, tighten tenant-scoped keys. |
| High refusal rate | Policy classifier or prompt version | Compare traces and rollback or recalibrate. |
Memory hooks
- Evidence first, generation second.
- Schema checks shape; tools and evidence check truth.
- Prompt safety helps, server-side policy enforces.
- Evaluate the workflow, not only the final sentence.
- Every agent needs stop conditions, budgets, and traces.
Self-check
You are ready to move through the topic pages if you can answer:
- What does the application control that the model should not control?
- What is the difference between model prior, retrieved context, and tool output?
- Why does a larger context window not remove the need for RAG?
- Which failure is worse: a wrong answer or a wrong side effect?
- Which metrics prove quality, safety, latency, and cost are acceptable?
Related
- /dsa/interview-prep/generative-ai for timed interview drills.
- /security for broader threat modeling.
- /performance for latency, queues, and saturation thinking.
- GenAI mock interview drill for an integrated practice prompt.
Mark this page when you finish learning it.
Spotted something unclear or wrong on this page?