RAG ingest, retrieve & pack
RAG gives an LLM external knowledge at request time. Instead of hoping the model memorized facts, your system retrieves relevant, authorized evidence and packs it into the prompt.
RAG is not "add a vector database." It is an end-to-end pipeline: ingest -> chunk -> embed -> index -> retrieve -> rerank -> pack -> answer -> evaluate.
Use the visual model below to separate the three responsibilities: the ingest plane creates trusted chunks, the index plane keeps complementary retrieval signals, and the serving plane filters, reranks, and packs only safe evidence for the model.

Full RAG workflow
Explanation: vector search finds semantic neighbors, keyword search catches exact identifiers, object storage keeps original content, and metadata enforces access and freshness.
What each storage layer stores
| Storage | Stores | Why it exists |
|---|---|---|
| Object/blob storage | Original files, parsed text, OCR output, snapshots. | Reproducibility, re-indexing, audit, source download. |
| Relational/document DB | Document metadata, tenants, ACLs, chunk lineage, ingest status. | Filtering, governance, delete/update workflows. |
| Vector DB / ANN index | Embedding vectors plus metadata pointers. | Semantic similarity search at scale. |
| Keyword index | Tokens, terms, exact ids, BM25-style inverted index. | Names, SKUs, error codes, proper nouns. |
| Cache | Recent retrievals, embeddings, rerank results. | Latency/cost reduction with tenant-safe keys. |
| Trace/eval store | Queries, retrieved ids, answer outcome, ratings. | Debugging and regression testing. |
Do not store secrets in vectors or traces. Embeddings are not a privacy boundary.
Chunking choices
| Content | Good chunk strategy | Failure mode |
|---|---|---|
| Policies/prose | Paragraph or section chunks with heading path. | Losing the exception clause in a different chunk. |
| Tables | Preserve rows, headers, units, and surrounding explanation. | Numeric answers hallucinate because headers were removed. |
| Code | Function/class-level chunks with repo path and symbol. | Splitting imports from function body. |
| Tickets/chats | Thread-aware chunks with time and resolution status. | Retrieving complaint but not final fix. |
| PDF/OCR | Clean layout noise, keep page numbers and confidence. | Navigation/footer text dominates embeddings. |
Vector database basics
Vector search is approximate nearest-neighbor search. It is good for semantic similarity, not guaranteed factual correctness. Always combine it with filters, reranking, citations, and evaluation.
Context packing
Pack evidence in a deterministic order:
- Drop chunks the user cannot access.
- Drop stale or superseded versions.
- Merge duplicate/overlapping chunks.
- Prefer chunks with exact identifiers when the query contains identifiers.
- Keep source ids and short titles with each chunk.
- Reserve token budget for the user's question and final answer.
- Tell the model to answer only from evidence for factual claims.
RAG failure states
Interview questions
1. Why use hybrid search instead of only vectors?
- Vectors are strong for semantics. Keyword search is stronger for exact ids, product names, codes, and rare terms. Fusion gives better recall.
Follow-up: When can vectors fail badly?
- Proper nouns, SKUs, error codes, table values, short acronyms, and near-duplicate policies.
2. Where do ACL checks happen?
- Ideally before ranking or at least before packing. Never rely on the model to ignore unauthorized chunks.
3. How do you debug a hallucinated RAG answer?
- Inspect retrieved ids, ranking scores, source versions, packed context, final prompt, and whether the answer contained unsupported claims.
4. What happens when retrieval returns nothing?
- The assistant should say it lacks evidence, ask a clarifying question, or route to another source. It should not invent.
5. How do you handle document updates?
- Version documents, checksum ingest, mark old chunks superseded, dual-write during embedding migrations, and test query regressions.
Interview answer template
For "Design RAG for internal docs", answer:
- Ingest sources with parsing, checksums, metadata, ACLs, and source versions.
- Chunk by content type: prose, tables, code, tickets, PDFs.
- Build vector and keyword indexes plus object storage for originals.
- Retrieve with tenant/ACL/freshness filters and rerank when useful.
- Pack minimal evidence with ids and citations.
- Instruct the model to answer only from evidence and degrade when evidence is missing.
- Evaluate top-k recall, citation support, unsupported claims, latency, and cost.
Common bad answer:
"Embed documents into a vector DB and ask the model."
That misses ingestion quality, metadata, ACLs, hybrid search, source lineage, context packing, and evals.
Self-check
You are ready if you can explain:
- Why chunking depends on content type.
- Why vector search and keyword search complement each other.
- Where ACL and freshness checks belong.
- What to inspect when a RAG answer hallucinates.
- What the assistant should do when retrieval returns nothing.
Related
LLM contracts, context & tools · Agent memory, state & storage · Evaluations
Mark this page when you finish learning it.
Spotted something unclear or wrong on this page?