Building a RAG System from Scratch

Why RAG? The Problem It Solves

LLMs are trained on data up to a cutoff date and know nothing about your internal documents — your network diagrams, runbooks, product docs, or research papers. You could fine-tune a model on your data, but that's expensive, slow, and the model might still "forget" or confabulate.

RAG solves this elegantly: retrieve relevant documents at query time and inject them into the prompt so the LLM can answer from actual context rather than from memory. The architecture is:

Document → Chunks → Embeddings → Vector DB (offline)
Query → Embed → Retrieve → Augment Prompt → LLM → Answer (online)

Prerequisites

pip install langchain langchain-openai langchain-community chromadb pypdf tiktoken
export OPENAI_API_KEY="sk-your-key-here"

Step 1: Load and Chunk Your Documents

Chunking strategy matters enormously. Too small = loss of context. Too large = diluted retrieval and token limit issues. The sweet spot is usually 500–1000 tokens with 10–20% overlap.

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load all PDFs from a directory
loader = DirectoryLoader('./docs/', glob="**/*.pdf", loader_cls=PyPDFLoader)
raw_docs = loader.load()
print(f"Loaded {len(raw_docs)} pages")

# Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,          # characters (~200 tokens)
    chunk_overlap=100,       # overlap between chunks
    separators=["\n\n", "\n", ". ", " ", ""],  # try larger separators first
    length_function=len
)
chunks = splitter.split_documents(raw_docs)
print(f"Split into {len(chunks)} chunks")

Step 2: Create Embeddings and Store in ChromaDB

Embeddings convert text into numerical vectors in a high-dimensional space where semantically similar texts are geometrically close. We store these in ChromaDB for fast similarity search.

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Initialize embedding model
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Create vector store (persisted to disk)
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)
print(f"Stored {vectorstore._collection.count()} vectors")

For production, consider alternatives: text-embedding-3-large for higher quality, or a local model like nomic-embed-text via Ollama for zero API cost.

Step 3: Build the Retrieval Chain

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Custom prompt — critical for quality answers
PROMPT_TEMPLATE = """You are a helpful assistant answering questions based on the provided context.
If the answer isn't in the context, say "I don't have enough information to answer that."
Never make up information.

Context:
{context}

Question: {question}

Answer:"""

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=PROMPT_TEMPLATE
)

# Load existing vector store
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=OpenAIEmbeddings(model="text-embedding-3-small")
)

# Create retriever — MMR gives more diverse results than cosine similarity alone
retriever = vectorstore.as_retriever(
    search_type="mmr",         # Maximal Marginal Relevance
    search_kwargs={
        "k": 5,                # Retrieve 5 chunks
        "fetch_k": 20,         # From top-20 candidates
        "lambda_mult": 0.7     # Balance relevance vs. diversity
    }
)

# Build the chain
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",        # Stuffs all retrieved chunks into one prompt
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt}
)

Step 4: Query Your Documents

def ask(question: str):
    result = qa_chain.invoke({"query": question})
    print(f"\n❓ Question: {question}")
    print(f"\n💬 Answer: {result['result']}")
    print(f"\n📚 Sources:")
    seen = set()
    for doc in result['source_documents']:
        source = doc.metadata.get('source', 'Unknown')
        page = doc.metadata.get('page', '?')
        key = f"{source}:{page}"
        if key not in seen:
            seen.add(key)
            print(f"   - {source} (page {page})")

# Example queries
ask("What is the BGP holddown timer in our runbook?")
ask("How do we handle a VLAN mismatch between sites?")
ask("What are the escalation steps for P1 incidents?")

Step 5: Evaluate and Improve

RAG quality depends on three things: retrieval recall (did we get the right chunks?), retrieval precision (did we get too many irrelevant ones?), and generation quality (did the LLM use the context correctly?). To evaluate:

from langchain.evaluation import load_evaluator

# Evaluate faithfulness: does the answer stick to the retrieved context?
evaluator = load_evaluator("criteria", criteria="faithfulness")
result = evaluator.evaluate_strings(
    prediction=answer,
    input=question,
    reference=retrieved_context
)
print(result)

Production Considerations

Chunking strategy: For code, chunk by function. For legal docs, by paragraph. For conversations, by turn. Don't use one-size-fits-all.
Metadata filtering: Add metadata (department, date, document type) to chunks and pre-filter before semantic search to improve precision.
Hybrid search: Combine BM25 (keyword) + vector search with a reranker for best results. LangChain's EnsembleRetriever handles this.
Caching: Cache embeddings and repeated query results. OpenAI charges per token — caching saves real money at scale.
Observability: Add LangSmith or Langfuse tracing to monitor retrieval quality and LLM responses in production.

Complete Project Structure

rag-project/
├── ingest.py          ← Load docs, chunk, embed, store
├── query.py           ← Query interface
├── evaluate.py        ← Evaluation pipeline
├── docs/              ← Your PDF/text documents
├── chroma_db/         ← Persisted vector database
└── requirements.txt

Key Takeaways

RAG = Retrieval + Augmentation + Generation. Documents live in a vector DB; relevant chunks are retrieved per query.
Chunking strategy and overlap significantly impact quality — experiment with sizes for your document type
MMR retrieval gives more diverse, less redundant results than pure cosine similarity
Always include a "I don't know" escape hatch in your prompt to prevent confabulation
Evaluate with faithfulness, relevance, and correctness metrics before deploying to production

Building a RAG System from Scratch with Python and LangChain

AI Summary

Why RAG? The Problem It Solves

Prerequisites

Step 1: Load and Chunk Your Documents

Step 2: Create Embeddings and Store in ChromaDB

Step 3: Build the Retrieval Chain

Step 4: Query Your Documents

Step 5: Evaluate and Improve

Production Considerations

Complete Project Structure

Key Takeaways

Building a RAG System from Scratch with Python and LangChain

AI Summary

Why RAG? The Problem It Solves

Prerequisites

Step 1: Load and Chunk Your Documents

Step 2: Create Embeddings and Store in ChromaDB

Step 3: Build the Retrieval Chain

Step 4: Query Your Documents

Step 5: Evaluate and Improve

Production Considerations

Complete Project Structure

Key Takeaways

🏆 Achievements