What Is Local RAG and Why You Should Care
Retrieval-Augmented Generation (RAG) lets you feed your own documents to an LLM so it answers questions based on your data instead of hallucinating. The problem? Most RAG tutorials assume you're using OpenAI's API or Anthropic's Claude — which means sending your data to the cloud and paying per-token.
There's a better way. You can build a fully local RAG system that runs on your own machine, keeps your data private, and costs nothing after setup. This guide walks you through building one using Ollama for local LLM inference and LangChain for orchestration.
Prerequisites
Before starting, make sure you have:
- Python 3.10+ installed
- Ollama installed from ollama.ai
- 8GB+ RAM (16GB recommended for larger models)
- A GPU is optional but speeds things up significantly
Step 1: Install Dependencies and Pull Models
First, install the required Python libraries:
pip install langchain langchain-community langchain-ollama chromadb pypdf sentence-transformers
Next, pull the models you'll need. You'll need an embedding model and a generation model:
ollama pull nomic-embed-text
ollama pull llama3.1:latest
The nomic-embed-text model creates vector embeddings of your documents. The llama3.1 model generates answers based on retrieved context. Verify both are installed by running ollama list.
Step 2: Ingest and Index Your Documents
Create a Python script called embed.py to load your PDFs and create a searchable vector database:
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
import os
# Load PDFs from a directory
loader = PyPDFDirectoryLoader("your_documents/")
docs = loader.load()
# Split into chunks (1000 tokens with 200 overlap)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
# Create embeddings and store in ChromaDB
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(
documents=splits,
embedding=embeddings,
persist_directory="./chroma_db"
)
print(f"Indexed {len(splits)} document chunks")
Run this script to create your local vector database. The chunks are stored in ./chroma_db for reuse.
Step 3: Build the Query Chain
Now create query.py to retrieve relevant context and generate answers:
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_community.chat_models import ChatOllama
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import OllamaEmbeddings
# Load the vector database
vectorstore = Chroma(
persist_directory="./chroma_db",
embedding_function=OllamaEmbeddings(model="nomic-embed-text")
)
llm = ChatOllama(model="llama3.1:latest")
# Multi-query retriever improves recall by generating variations of your question
QUERY_PROMPT = PromptTemplate(
input_variables=["question"],
template="Generate five reworded versions of this question: {question}"
)
retriever = MultiQueryRetriever.from_llm(
vectorstore.as_retriever(),
llm,
prompt=QUERY_PROMPT
)
# Prompt template that forces the LLM to use only the provided context
template = """Answer based ONLY on this context:
{context}
Question: {question}
Answer:"""
prompt = ChatPromptTemplate.from_template(template)
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# Ask questions
result = chain.invoke("What is this document about?")
print(result)
Step 4: Run It as a Web Service (Optional)
Want to expose your RAG system as an API? Add Flask:
from flask import Flask, request, jsonify
app = Flask(__name__)
@app.route('/query', methods=['POST'])
def handle_query():
question = request.json['question']
answer = chain.invoke(question)
return jsonify({'answer': answer})
if __name__ == '__main__':
app.run(debug=True)
Now you can POST questions to http://localhost:5000/query from any client.
Key Tuning Parameters
Getting subpar results? Try adjusting these:
- chunk_size: Smaller chunks (500-800) improve precision. Larger chunks (1000-1500) improve recall but may introduce noise.
- chunk_overlap: 100-200 tokens ensures context isn't lost at chunk boundaries.
- top_k: Retrieve more chunks (k=5-8) for complex questions, fewer (k=2-3) for simple ones.
What's Next?
This basic setup gives you a fully functional local RAG system. From here, you can explore hybrid search (combining keyword and vector search), reranking models for better accuracy, or connecting to specialized models like Claude Code 2.0 through Ollama.
The entire pipeline runs offline, keeps your data on your machine, and costs nothing to operate once the initial setup is complete.