Why RAG? The Problem It Solves

LLMs are trained on data up to a cutoff date and know nothing about your internal documents โ€” your network diagrams, runbooks, product docs, or research papers. You could fine-tune a model on your data, but that's expensive, slow, and the model might still "forget" or confabulate.

RAG solves this elegantly: retrieve relevant documents at query time and inject them into the prompt so the LLM can answer from actual context rather than from memory. The architecture is:

Document โ†’ Chunks โ†’ Embeddings โ†’ Vector DB (offline)
Query โ†’ Embed โ†’ Retrieve โ†’ Augment Prompt โ†’ LLM โ†’ Answer (online)

Prerequisites

pip install langchain langchain-openai langchain-community chromadb pypdf tiktoken
export OPENAI_API_KEY="sk-your-key-here"

Step 1: Load and Chunk Your Documents

Chunking strategy matters enormously. Too small = loss of context. Too large = diluted retrieval and token limit issues. The sweet spot is usually 500โ€“1000 tokens with 10โ€“20% overlap.

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load all PDFs from a directory
loader = DirectoryLoader('./docs/', glob="**/*.pdf", loader_cls=PyPDFLoader)
raw_docs = loader.load()
print(f"Loaded {len(raw_docs)} pages")

# Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,          # characters (~200 tokens)
    chunk_overlap=100,       # overlap between chunks
    separators=["\n\n", "\n", ". ", " ", ""],  # try larger separators first
    length_function=len
)
chunks = splitter.split_documents(raw_docs)
print(f"Split into {len(chunks)} chunks")

Step 2: Create Embeddings and Store in ChromaDB

Embeddings convert text into numerical vectors in a high-dimensional space where semantically similar texts are geometrically close. We store these in ChromaDB for fast similarity search.

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Initialize embedding model
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Create vector store (persisted to disk)
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)
print(f"Stored {vectorstore._collection.count()} vectors")

For production, consider alternatives: text-embedding-3-large for higher quality, or a local model like nomic-embed-text via Ollama for zero API cost.

Step 3: Build the Retrieval Chain

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Custom prompt โ€” critical for quality answers
PROMPT_TEMPLATE = """You are a helpful assistant answering questions based on the provided context.
If the answer isn't in the context, say "I don't have enough information to answer that."
Never make up information.

Context:
{context}

Question: {question}

Answer:"""

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=PROMPT_TEMPLATE
)

# Load existing vector store
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=OpenAIEmbeddings(model="text-embedding-3-small")
)

# Create retriever โ€” MMR gives more diverse results than cosine similarity alone
retriever = vectorstore.as_retriever(
    search_type="mmr",         # Maximal Marginal Relevance
    search_kwargs={
        "k": 5,                # Retrieve 5 chunks
        "fetch_k": 20,         # From top-20 candidates
        "lambda_mult": 0.7     # Balance relevance vs. diversity
    }
)

# Build the chain
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",        # Stuffs all retrieved chunks into one prompt
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt}
)

Step 4: Query Your Documents

def ask(question: str):
    result = qa_chain.invoke({"query": question})
    print(f"\nโ“ Question: {question}")
    print(f"\n๐Ÿ’ฌ Answer: {result['result']}")
    print(f"\n๐Ÿ“š Sources:")
    seen = set()
    for doc in result['source_documents']:
        source = doc.metadata.get('source', 'Unknown')
        page = doc.metadata.get('page', '?')
        key = f"{source}:{page}"
        if key not in seen:
            seen.add(key)
            print(f"   - {source} (page {page})")

# Example queries
ask("What is the BGP holddown timer in our runbook?")
ask("How do we handle a VLAN mismatch between sites?")
ask("What are the escalation steps for P1 incidents?")

Step 5: Evaluate and Improve

RAG quality depends on three things: retrieval recall (did we get the right chunks?), retrieval precision (did we get too many irrelevant ones?), and generation quality (did the LLM use the context correctly?). To evaluate:

from langchain.evaluation import load_evaluator

# Evaluate faithfulness: does the answer stick to the retrieved context?
evaluator = load_evaluator("criteria", criteria="faithfulness")
result = evaluator.evaluate_strings(
    prediction=answer,
    input=question,
    reference=retrieved_context
)
print(result)

Production Considerations

  • Chunking strategy: For code, chunk by function. For legal docs, by paragraph. For conversations, by turn. Don't use one-size-fits-all.
  • Metadata filtering: Add metadata (department, date, document type) to chunks and pre-filter before semantic search to improve precision.
  • Hybrid search: Combine BM25 (keyword) + vector search with a reranker for best results. LangChain's EnsembleRetriever handles this.
  • Caching: Cache embeddings and repeated query results. OpenAI charges per token โ€” caching saves real money at scale.
  • Observability: Add LangSmith or Langfuse tracing to monitor retrieval quality and LLM responses in production.

Complete Project Structure

rag-project/
โ”œโ”€โ”€ ingest.py          โ† Load docs, chunk, embed, store
โ”œโ”€โ”€ query.py           โ† Query interface
โ”œโ”€โ”€ evaluate.py        โ† Evaluation pipeline
โ”œโ”€โ”€ docs/              โ† Your PDF/text documents
โ”œโ”€โ”€ chroma_db/         โ† Persisted vector database
โ””โ”€โ”€ requirements.txt

Key Takeaways

  • RAG = Retrieval + Augmentation + Generation. Documents live in a vector DB; relevant chunks are retrieved per query.
  • Chunking strategy and overlap significantly impact quality โ€” experiment with sizes for your document type
  • MMR retrieval gives more diverse, less redundant results than pure cosine similarity
  • Always include a "I don't know" escape hatch in your prompt to prevent confabulation
  • Evaluate with faithfulness, relevance, and correctness metrics before deploying to production