Why RAG? The Problem It Solves
LLMs are trained on data up to a cutoff date and know nothing about your internal documents โ your network diagrams, runbooks, product docs, or research papers. You could fine-tune a model on your data, but that's expensive, slow, and the model might still "forget" or confabulate.
RAG solves this elegantly: retrieve relevant documents at query time and inject them into the prompt so the LLM can answer from actual context rather than from memory. The architecture is:
Document โ Chunks โ Embeddings โ Vector DB (offline)
Query โ Embed โ Retrieve โ Augment Prompt โ LLM โ Answer (online)
Prerequisites
pip install langchain langchain-openai langchain-community chromadb pypdf tiktoken
export OPENAI_API_KEY="sk-your-key-here"
Step 1: Load and Chunk Your Documents
Chunking strategy matters enormously. Too small = loss of context. Too large = diluted retrieval and token limit issues. The sweet spot is usually 500โ1000 tokens with 10โ20% overlap.
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Load all PDFs from a directory
loader = DirectoryLoader('./docs/', glob="**/*.pdf", loader_cls=PyPDFLoader)
raw_docs = loader.load()
print(f"Loaded {len(raw_docs)} pages")
# Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=800, # characters (~200 tokens)
chunk_overlap=100, # overlap between chunks
separators=["\n\n", "\n", ". ", " ", ""], # try larger separators first
length_function=len
)
chunks = splitter.split_documents(raw_docs)
print(f"Split into {len(chunks)} chunks")
Step 2: Create Embeddings and Store in ChromaDB
Embeddings convert text into numerical vectors in a high-dimensional space where semantically similar texts are geometrically close. We store these in ChromaDB for fast similarity search.
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
# Initialize embedding model
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Create vector store (persisted to disk)
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
print(f"Stored {vectorstore._collection.count()} vectors")
For production, consider alternatives: text-embedding-3-large for higher quality, or a local model like nomic-embed-text via Ollama for zero API cost.
Step 3: Build the Retrieval Chain
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
# Custom prompt โ critical for quality answers
PROMPT_TEMPLATE = """You are a helpful assistant answering questions based on the provided context.
If the answer isn't in the context, say "I don't have enough information to answer that."
Never make up information.
Context:
{context}
Question: {question}
Answer:"""
prompt = PromptTemplate(
input_variables=["context", "question"],
template=PROMPT_TEMPLATE
)
# Load existing vector store
vectorstore = Chroma(
persist_directory="./chroma_db",
embedding_function=OpenAIEmbeddings(model="text-embedding-3-small")
)
# Create retriever โ MMR gives more diverse results than cosine similarity alone
retriever = vectorstore.as_retriever(
search_type="mmr", # Maximal Marginal Relevance
search_kwargs={
"k": 5, # Retrieve 5 chunks
"fetch_k": 20, # From top-20 candidates
"lambda_mult": 0.7 # Balance relevance vs. diversity
}
)
# Build the chain
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff", # Stuffs all retrieved chunks into one prompt
retriever=retriever,
return_source_documents=True,
chain_type_kwargs={"prompt": prompt}
)
Step 4: Query Your Documents
def ask(question: str):
result = qa_chain.invoke({"query": question})
print(f"\nโ Question: {question}")
print(f"\n๐ฌ Answer: {result['result']}")
print(f"\n๐ Sources:")
seen = set()
for doc in result['source_documents']:
source = doc.metadata.get('source', 'Unknown')
page = doc.metadata.get('page', '?')
key = f"{source}:{page}"
if key not in seen:
seen.add(key)
print(f" - {source} (page {page})")
# Example queries
ask("What is the BGP holddown timer in our runbook?")
ask("How do we handle a VLAN mismatch between sites?")
ask("What are the escalation steps for P1 incidents?")
Step 5: Evaluate and Improve
RAG quality depends on three things: retrieval recall (did we get the right chunks?), retrieval precision (did we get too many irrelevant ones?), and generation quality (did the LLM use the context correctly?). To evaluate:
from langchain.evaluation import load_evaluator
# Evaluate faithfulness: does the answer stick to the retrieved context?
evaluator = load_evaluator("criteria", criteria="faithfulness")
result = evaluator.evaluate_strings(
prediction=answer,
input=question,
reference=retrieved_context
)
print(result)
Production Considerations
- Chunking strategy: For code, chunk by function. For legal docs, by paragraph. For conversations, by turn. Don't use one-size-fits-all.
- Metadata filtering: Add metadata (department, date, document type) to chunks and pre-filter before semantic search to improve precision.
- Hybrid search: Combine BM25 (keyword) + vector search with a reranker for best results. LangChain's EnsembleRetriever handles this.
- Caching: Cache embeddings and repeated query results. OpenAI charges per token โ caching saves real money at scale.
- Observability: Add LangSmith or Langfuse tracing to monitor retrieval quality and LLM responses in production.
Complete Project Structure
rag-project/
โโโ ingest.py โ Load docs, chunk, embed, store
โโโ query.py โ Query interface
โโโ evaluate.py โ Evaluation pipeline
โโโ docs/ โ Your PDF/text documents
โโโ chroma_db/ โ Persisted vector database
โโโ requirements.txt
Key Takeaways
- RAG = Retrieval + Augmentation + Generation. Documents live in a vector DB; relevant chunks are retrieved per query.
- Chunking strategy and overlap significantly impact quality โ experiment with sizes for your document type
- MMR retrieval gives more diverse, less redundant results than pure cosine similarity
- Always include a "I don't know" escape hatch in your prompt to prevent confabulation
- Evaluate with faithfulness, relevance, and correctness metrics before deploying to production