RAG ingest, retrieve & pack
RAG gives an LLM external knowledge at request time. Instead of hoping the model memorized facts, your system retrieves relevant, authorized evidence and packs it into the prompt.
RAG is not "add a vector database." It is an end-to-end pipeline: ingest -> chunk -> embed -> index -> retrieve -> rerank -> pack -> answer -> evaluate.
Full RAG workflow
Explanation: vector search finds semantic neighbors, keyword search catches exact identifiers, object storage keeps original content, and metadata enforces access and freshness.
What each storage layer stores
| Storage | Stores | Why it exists |
|---|---|---|
| Object/blob storage | Original files, parsed text, OCR output, snapshots. | Reproducibility, re-indexing, audit, source download. |
| Relational/document DB | Document metadata, tenants, ACLs, chunk lineage, ingest status. | Filtering, governance, delete/update workflows. |
| Vector DB / ANN index | Embedding vectors plus metadata pointers. | Semantic similarity search at scale. |
| Keyword index | Tokens, terms, exact ids, BM25-style inverted index. | Names, SKUs, error codes, proper nouns. |
| Cache | Recent retrievals, embeddings, rerank results. | Latency/cost reduction with tenant-safe keys. |
| Trace/eval store | Queries, retrieved ids, answer outcome, ratings. | Debugging and regression testing. |
Do not store secrets in vectors or traces. Embeddings are not a privacy boundary.
Chunking choices
| Content | Good chunk strategy | Failure mode |
|---|---|---|
| Policies/prose | Paragraph or section chunks with heading path. | Losing the exception clause in a different chunk. |
| Tables | Preserve rows, headers, units, and surrounding explanation. | Numeric answers hallucinate because headers were removed. |
| Code | Function/class-level chunks with repo path and symbol. | Splitting imports from function body. |
| Tickets/chats | Thread-aware chunks with time and resolution status. | Retrieving complaint but not final fix. |
| PDF/OCR | Clean layout noise, keep page numbers and confidence. | Navigation/footer text dominates embeddings. |
Vector database basics
Vector search is approximate nearest-neighbor search. It is good for semantic similarity, not guaranteed factual correctness. Always combine it with filters, reranking, citations, and evaluation.
Context packing
Pack evidence in a deterministic order:
- Drop chunks the user cannot access.
- Drop stale or superseded versions.
- Merge duplicate/overlapping chunks.
- Prefer chunks with exact identifiers when the query contains identifiers.
- Keep source ids and short titles with each chunk.
- Reserve token budget for the user's question and final answer.
- Tell the model to answer only from evidence for factual claims.
RAG failure states
Interview questions
1. Why use hybrid search instead of only vectors?
- Vectors are strong for semantics. Keyword search is stronger for exact ids, product names, codes, and rare terms. Fusion gives better recall.
2. Where do ACL checks happen?
- Ideally before ranking or at least before packing. Never rely on the model to ignore unauthorized chunks.
3. How do you debug a hallucinated RAG answer?
- Inspect retrieved ids, ranking scores, source versions, packed context, final prompt, and whether the answer contained unsupported claims.
4. What happens when retrieval returns nothing?
- The assistant should say it lacks evidence, ask a clarifying question, or route to another source. It should not invent.
5. How do you handle document updates?
- Version documents, checksum ingest, mark old chunks superseded, dual-write during embedding migrations, and test query regressions.
Related
LLM contracts, context & tools · Agent memory, state & storage · Evaluations
Spotted something unclear or wrong on this page?