Featured image of post LLM & RAG: The 'Smart Librarian' Mental Model

LLM & RAG: The 'Smart Librarian' Mental Model

Why do LLMs hallucinate? A mastery guide to Retrieval Augmented Generation (RAG) — the architecture powering every serious AI product in 2026.

You deploy a ChatGPT-powered chatbot for your company.

A user asks: “What’s our refund policy?” The AI confidently answers with a made-up policy. The user complains. You lose a customer.

This is called hallucination — the original sin of Large Language Models. The fix is RAG (Retrieval Augmented Generation).

This is the Mastery Guide to building production AI systems that don’t lie.


Part 1: Foundations (The Mental Model)

The Overconfident Intern

An LLM (like GPT-4, Claude, Gemini) is a brilliant intern who has read the entire internet up to a certain date. They are excellent at reasoning, writing, and summarizing.

The problem: They have no access to your internal documents, Notion pages, or last week’s customer data. When asked, they confidently invent an answer rather than saying “I don’t know.” This is hallucination.

RAG = The Smart Librarian

RAG gives the intern a librarian who can fetch relevant documents before they answer.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
User: "What's our pricing for Enterprise?"
[1. Retriever]: Search your knowledge base
         │   → finds pricing_2026.pdf, enterprise_FAQ.md
[2. Augment]: Inject documents into the LLM prompt:
    "Here is relevant context: [pricing_2026.pdf content]...
     Now answer: What's our pricing for Enterprise?"
[3. Generate]: LLM answers BASED ON the real document
Answer: "Enterprise pricing starts at $999/month (Source: pricing_2026.pdf)"

Result: Grounded answer. Citable source. No hallucination.


Part 2: The Investigation (RAG Pipeline Components)

1. The Chunking Problem

You can’t feed a 500-page PDF directly to an LLM (context window limit). You chunk it first.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,        # ~512 tokens per chunk
    chunk_overlap=50,      # 50 token overlap to avoid cutting mid-sentence
    separators=["\n\n", "\n", ". ", " "]  # Split at natural boundaries
)

chunks = splitter.split_text(document_text)
# Result: ["The refund policy states...", "For enterprise plans...", ...]

Chunking is the most impactful factor in RAG quality. Too large → irrelevant context pollutes the answer. Too small → loses paragraph-level meaning.

2. The Embedding (Turning Text into Math)

Each chunk is converted into a vector — a list of numbers that represents its semantic meaning.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from openai import OpenAI

client = OpenAI()

def embed(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding  # [0.23, -0.87, 0.12, ...] (1536 dimensions)

The key insight: “Dog” and “Canine” will have similar vectors. “Dog” and “Database” will have very different vectors. This is semantic similarity.

3. The Vector Database (Storing and Searching Embeddings)

You can’t search through 1 million embeddings with a simple WHERE clause. You need a Vector DB (see the dedicated post on Vector Databases).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import chromadb

client = chromadb.Client()
collection = client.create_collection("company_docs")

# Index your documents
collection.add(
    documents=chunks,
    embeddings=[embed(c) for c in chunks],
    ids=[f"chunk_{i}" for i in range(len(chunks))]
)

# Query: find the most relevant chunks for a question
results = collection.query(
    query_embeddings=[embed("What is the refund policy?")],
    n_results=3  # Get top 3 most semantically similar chunks
)

Part 3: The Diagnosis (RAG Failure Modes)

ProblemSymptomFix
Bad chunkingAnswer uses wrong section of documentTune chunk_size + add metadata (page number, section title)
Wrong embedding modelRetrieval finds irrelevant chunksUse a domain-specific model (e.g., text-embedding-3-large or a fine-tuned model)
Context window overflowLLM ignores later chunks (“Lost in the Middle” problem)Reduce n_results. Rerank results. Put most relevant chunk first.
Stale knowledge baseDocuments updated but not re-indexedImplement an ingestion pipeline triggered on document changes
LLM ignores contextStill hallucinates despite good retrievalStrengthen the system prompt: “Answer ONLY using the provided context. If not in context, say ‘I don’t know’.”

Part 4: The Resolution (Production RAG Stack)

The Minimal Stack (Python)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
from openai import OpenAI
import chromadb

client = OpenAI()
db = chromadb.Client()
collection = db.get_collection("company_docs")

SYSTEM_PROMPT = """You are a helpful assistant for Acme Corp.
Answer questions ONLY using the provided context.
If the answer is not in the context, say "I don't have that information."
Always cite your source."""

def rag_query(question: str) -> str:
    # 1. Retrieve relevant chunks
    results = collection.query(
        query_embeddings=[embed(question)],
        n_results=3
    )
    context = "\n\n---\n\n".join(results["documents"][0])
    
    # 2. Augment prompt with context
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
    ]
    
    # 3. Generate grounded answer
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        temperature=0  # Low temp = more factual, less creative
    )
    return response.choices[0].message.content

# Usage
answer = rag_query("What's the refund policy for annual plans?")

Advanced: Re-Ranking

After retrieval, use a Cross-Encoder to re-rank chunks by relevance before feeding to the LLM:

1
2
3
4
5
6
7
8
9
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(question: str, chunks: list[str], top_k: int = 2) -> list[str]:
    pairs = [(question, chunk) for chunk in chunks]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(scores, chunks), reverse=True)
    return [chunk for _, chunk in ranked[:top_k]]

Final Mental Model

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
LLM alone    -> The Overconfident Intern. Smart but makes things up.
RAG          -> Intern + Librarian. Grounded, citable, accurate.

Chunking     -> Breaking the library into flashcards.
Embedding    -> Turning flashcard text into a map coordinate.
Vector DB    -> The map. "Find all coordinates near this question."
Reranking    -> A second expert who reads the 10 candidates and picks the best 2.

temperature=0 -> "Be a lawyer. Facts only."
temperature=1 -> "Be a poet. Be creative."

RAG is not optional for production AI products. Without it, you are shipping an AI that confidently makes things up about your own business.

Made with laziness love 🦥

Subscribe to My Newsletter