RAG ingest, retrieve & pack

RAG gives an LLM external knowledge at request time. Instead of hoping the model memorized facts, your system retrieves relevant, authorized evidence and packs it into the prompt.

RAG is not "add a vector database." It is an end-to-end pipeline: ingest -> chunk -> embed -> index -> retrieve -> rerank -> pack -> answer -> evaluate.

Use the visual model below to separate the three responsibilities: the ingest plane creates trusted chunks, the index plane keeps complementary retrieval signals, and the serving plane filters, reranks, and packs only safe evidence for the model.

RAG evidence pipeline showing ingest, index, and serving planes with lineage, hybrid retrieval, ACL and freshness filters, reranking, context packing, citations, and runtime controls.

Full RAG workflow

Loading diagram…

Explanation: vector search finds semantic neighbors, keyword search catches exact identifiers, object storage keeps original content, and metadata enforces access and freshness.

What each storage layer stores

Storage	Stores	Why it exists
Object/blob storage	Original files, parsed text, OCR output, snapshots.	Reproducibility, re-indexing, audit, source download.
Relational/document DB	Document metadata, tenants, ACLs, chunk lineage, ingest status.	Filtering, governance, delete/update workflows.
Vector DB / ANN index	Embedding vectors plus metadata pointers.	Semantic similarity search at scale.
Keyword index	Tokens, terms, exact ids, BM25-style inverted index.	Names, SKUs, error codes, proper nouns.
Cache	Recent retrievals, embeddings, rerank results.	Latency/cost reduction with tenant-safe keys.
Trace/eval store	Queries, retrieved ids, answer outcome, ratings.	Debugging and regression testing.

Do not store secrets in vectors or traces. Embeddings are not a privacy boundary.

Chunking choices

Content	Good chunk strategy	Failure mode
Policies/prose	Paragraph or section chunks with heading path.	Losing the exception clause in a different chunk.
Tables	Preserve rows, headers, units, and surrounding explanation.	Numeric answers hallucinate because headers were removed.
Code	Function/class-level chunks with repo path and symbol.	Splitting imports from function body.
Tickets/chats	Thread-aware chunks with time and resolution status.	Retrieving complaint but not final fix.
PDF/OCR	Clean layout noise, keep page numbers and confidence.	Navigation/footer text dominates embeddings.

Vector database basics

Loading diagram…

Vector search is approximate nearest-neighbor search. It is good for semantic similarity, not guaranteed factual correctness. Always combine it with filters, reranking, citations, and evaluation.

Context packing

Pack evidence in a deterministic order:

Drop chunks the user cannot access.
Drop stale or superseded versions.
Merge duplicate/overlapping chunks.
Prefer chunks with exact identifiers when the query contains identifiers.
Keep source ids and short titles with each chunk.
Reserve token budget for the user's question and final answer.
Tell the model to answer only from evidence for factual claims.

RAG failure states

Loading diagram…

Interview questions

1. Why use hybrid search instead of only vectors?

Vectors are strong for semantics. Keyword search is stronger for exact ids, product names, codes, and rare terms. Fusion gives better recall.

Follow-up: When can vectors fail badly?

Proper nouns, SKUs, error codes, table values, short acronyms, and near-duplicate policies.

2. Where do ACL checks happen?

Ideally before ranking or at least before packing. Never rely on the model to ignore unauthorized chunks.

3. How do you debug a hallucinated RAG answer?

Inspect retrieved ids, ranking scores, source versions, packed context, final prompt, and whether the answer contained unsupported claims.

4. What happens when retrieval returns nothing?

The assistant should say it lacks evidence, ask a clarifying question, or route to another source. It should not invent.

5. How do you handle document updates?

Version documents, checksum ingest, mark old chunks superseded, dual-write during embedding migrations, and test query regressions.

Interview answer template

For "Design RAG for internal docs", answer:

Ingest sources with parsing, checksums, metadata, ACLs, and source versions.
Chunk by content type: prose, tables, code, tickets, PDFs.
Build vector and keyword indexes plus object storage for originals.
Retrieve with tenant/ACL/freshness filters and rerank when useful.
Pack minimal evidence with ids and citations.
Instruct the model to answer only from evidence and degrade when evidence is missing.
Evaluate top-k recall, citation support, unsupported claims, latency, and cost.

Common bad answer:

"Embed documents into a vector DB and ask the model."

That misses ingestion quality, metadata, ACLs, hybrid search, source lineage, context packing, and evals.

Self-check

You are ready if you can explain:

Why chunking depends on content type.
Why vector search and keyword search complement each other.
Where ACL and freshness checks belong.
What to inspect when a RAG answer hallucinates.
What the assistant should do when retrieval returns nothing.

LLM contracts, context & tools · Agent memory, state & storage · Evaluations

On this page