RAG for Solo Builders: Making AI Remember Your Business
Every solo builder hits the same wall with AI. You paste context into a prompt, get a useful response, and then the next conversation starts from zero. The AI has no memory. It doesn't know your clients, your conventions, your past decisions. You're doing the same onboarding every single session.
Retrieval-Augmented Generation fixes this. And you can run the entire thing locally.
What RAG Actually Is
RAG is a simple idea that gets buried under vendor marketing. Here's the core: instead of fine-tuning a model on your data (expensive, slow, requires expertise), you store your documents in a searchable format, retrieve the relevant ones at query time, and inject them into the prompt as context. The model reads your documents on the fly instead of memorizing them during training.
The practical difference is enormous. Fine-tuning a model costs hundreds to thousands of dollars and goes stale the moment your data changes. RAG costs nothing to update. Drop a new document into the system and it's available immediately. Delete an obsolete one and it's gone. Your AI's knowledge stays in sync with your actual business.
For a solo builder, this means your AI can reference your past proposals, your client communication style, your technical standards, your pricing decisions, and your domain-specific terminology. Not because it was trained on them. Because it looked them up.
pgvector and Embeddings
The retrieval part of RAG depends on embeddings: numeric representations of text that capture semantic meaning. Similar documents produce similar vectors. A query about "client onboarding process" will match a document titled "New Client Setup Checklist" even though they share almost no words.
You need somewhere to store these vectors and search them efficiently. If you're already running PostgreSQL, pgvector adds this capability to the database you already have. No new infrastructure. No vector database subscription.
-- Install the extension (one time)
CREATE EXTENSION IF NOT EXISTS vector;
-- Create a table for your documents
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
title TEXT NOT NULL,
content TEXT NOT NULL,
source TEXT,
created_at TIMESTAMPTZ DEFAULT now(),
embedding vector(1024)
);
-- Create an index for fast similarity search
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
That's the entire infrastructure. A table with a vector column and an index. The vector(1024) dimension matches BGE-large embeddings, which you can generate locally with Ollama. If you're using OpenAI's text-embedding-3-small, change it to vector(1536). The dimension has to match your embedding model.
Generating embeddings and inserting documents looks like this:
import ollama
import psycopg2
def store_document(title, content, source=None):
# Generate embedding locally — no API costs
response = ollama.embed(
model="bge-large",
input=content
)
embedding = response["embeddings"][0]
conn = psycopg2.connect("dbname=myapp")
cur = conn.cursor()
cur.execute(
"""INSERT INTO documents (title, content, source, embedding)
VALUES (%s, %s, %s, %s::vector)""",
(title, content, source, str(embedding))
)
conn.commit()
cur.close()
conn.close()
Retrieval is a single SQL query:
-- Find the 5 most relevant documents for a query embedding
SELECT title, content,
1 - (embedding <=> query_embedding::vector) AS similarity
FROM documents
WHERE 1 - (embedding <=> query_embedding::vector) > 0.5
ORDER BY embedding <=> query_embedding::vector
LIMIT 5;
The <=> operator computes cosine distance. Subtracting from 1 converts it to cosine similarity, where higher means more relevant. The WHERE clause filters out weak matches. That 0.5 threshold is a starting point; you'll tune it based on your data.
Chunking Strategies That Actually Matter
Before you can embed a document, you have to decide how to split it. A 20-page proposal doesn't work as a single embedding. The vector tries to represent too many ideas at once and ends up representing none of them well. Chunking is how you break documents into pieces that each carry a coherent idea.
Three approaches, with real tradeoffs:
- Fixed-size chunks (400-600 tokens): Split every N tokens with some overlap. Dead simple to implement. Works acceptably for homogeneous documents like logs or transcripts where there's no strong structure. Falls apart on anything with headings, sections, or mixed content because the splits land in arbitrary places, cutting ideas in half.
- Semantic chunking: Split on paragraph boundaries, heading boundaries, or topic shifts. Requires more preprocessing but produces chunks that each represent a complete thought. A section titled "Pricing Methodology" stays intact instead of getting sliced across two fixed-size windows. This is the right default for most business documents.
- Document-level embedding: Embed the entire document as one vector, typically using the first 512 tokens or a generated summary. Works when your documents are already short and focused: individual emails, meeting notes, support tickets. Fails on long documents for the same reason oversized chunks fail.
I use semantic chunking for everything longer than ~800 tokens and document-level for everything shorter. The split is simple: if the document has headings, split on headings. If it doesn't, split on double newlines. Add 50-100 tokens of overlap between chunks so you don't lose context at the boundaries. Store both the chunk and a reference back to the source document so you can pull up the full original when needed.
def semantic_chunk(text, max_tokens=500, overlap=75):
"""Split text on paragraph boundaries, respecting max size."""
paragraphs = text.split("\n\n")
chunks = []
current = []
current_len = 0
for para in paragraphs:
para_len = len(para.split())
if current_len + para_len > max_tokens and current:
chunks.append("\n\n".join(current))
# Keep last paragraph for overlap
current = current[-1:] if overlap > 0 else []
current_len = len(current[0].split()) if current else 0
current.append(para)
current_len += para_len
if current:
chunks.append("\n\n".join(current))
return chunks
The Retrieval Quality Trap
This is the thing nobody tells you about RAG: the hard problem isn't the infrastructure. pgvector works. Embeddings work. Cosine similarity works. The hard problem is what you put into the system.
Garbage in, hallucinations out. And the failure mode is subtle. The AI won't tell you it retrieved a bad document. It will confidently weave irrelevant or outdated information into its response, and the result will read perfectly well. You'll trust it because the prose is fluent, and you'll catch the error only when a client points out that the pricing is wrong or the process description is two versions old.
Three things that matter more than model size or embedding dimensions:
- Document freshness: If your knowledge base has three versions of a pricing document, the AI might retrieve any of them. Stale documents don't just take up space; they actively poison results. When you update a document, delete or archive the old version. A smaller, current knowledge base outperforms a large one full of historical artifacts.
- Document quality: A meeting transcript full of "um, yeah, so we were thinking maybe" produces terrible embeddings. Clean, structured documents with clear statements retrieve well. Messy, rambling ones don't. Spend time curating what goes in. A document that wouldn't be useful to a smart new hire isn't going to be useful to your AI either.
- Retrieval evaluation: You need to know whether your system is actually finding the right documents. The simplest approach: keep a running list of 20-30 test queries with expected results. Run them periodically. When retrieval quality drifts, you'll catch it before your outputs degrade.
Curation matters more than model size. A well-curated knowledge base of 200 documents with a mid-tier embedding model will produce better results than 10,000 unorganized documents with the most expensive embeddings available.
The Complete Pipeline
Putting it together, a working RAG pipeline for a solo builder has four steps:
- Ingest: New documents get chunked, embedded (locally via Ollama, ~1-2 seconds per document), and stored in PostgreSQL with pgvector. Old versions get archived or deleted.
- Retrieve: When you query the system, your question gets embedded with the same model, and pgvector returns the top 5-10 most similar chunks.
- Augment: The retrieved chunks get injected into the prompt as context, along with your query. The model reads them as if you'd pasted them in manually.
- Generate: The model responds using both its general knowledge and the specific context from your documents.
The total cost of running this locally: zero dollars per month. PostgreSQL is free. pgvector is free. Ollama runs embedding models locally on any Mac with 8GB+ of RAM. The only cost is your time setting it up, which is a few hours at most, and the occasional curation pass to keep the knowledge base clean.
If you're already running Postgres for your application, you're adding one extension and one table. Not a new service. Not a new subscription. Not a new vendor whose pricing page you'll need to check quarterly.
What to Put in the Knowledge Base
Start with the documents you find yourself re-reading before doing work. These are the ones your AI needs too:
- Client briefs and preferences: Communication style, terminology they use, past feedback. The context that turns generic output into something that sounds like it was written for this specific client.
- Your past deliverables: Proposals, reports, code patterns, email templates. Not as things to copy, but as reference material that informs new work.
- Process documentation: How you handle onboarding, how you structure a project, what your review checklist looks like. The tribal knowledge that lives in your head.
- Domain-specific reference material: Industry standards, regulatory requirements, technical specifications that your work depends on.
Don't dump everything in on day one. Start with 20-30 high-value documents. Use the system for a week. Notice which queries return poor results and add the documents that would fix them. The knowledge base grows best when it's shaped by actual usage rather than speculative completeness.
The Catch
RAG is not a magic memory transplant. There are real limitations.
Embedding models have a context window, 512 tokens for older models like BGE-large, up to 8,192 for newer ones. Text beyond that limit gets truncated or compressed. Long documents need chunking, and chunking loses cross-document connections. If the answer to a question depends on synthesizing information from three different documents, retrieval might find one or two of them but miss the third.
Cosine similarity is not understanding. A document about "cancellation policies" might not match a query about "how to handle a client who wants to stop the project" even though they're about the same thing, because the language is different enough to produce distant embeddings. You'll need to experiment with how you phrase both your documents and your queries.
And the curation problem never goes away. Your knowledge base needs maintenance the same way any system of record does. Documents go stale. New patterns emerge that need to be captured. The flywheel only works if you keep feeding it.
These are solvable problems. But they require ongoing attention, not a one-time setup.
Why This Matters for Solo Builders
A consultant at a firm has colleagues to remind them of past approaches, institutional knowledge bases maintained by dedicated staff, and partners who remember the client's preferences from three years ago. A solo builder has none of that. Every piece of context lives in their head or scattered across files they might not find in time.
RAG replaces the institutional memory you don't have. Not perfectly. But well enough that the AI stops being a generic tool and starts being a tool that knows your business. The difference between a generic AI response and one grounded in your actual documents, your actual clients, and your actual standards is the difference between a first draft you throw away and a first draft you edit.
The infrastructure is a Postgres extension and an embedding model. The ongoing cost is zero. The ongoing work is curation. And the curation is the same work you'd be doing anyway if you were keeping your knowledge organized for yourself.
Your AI is only as good as what it can remember. Give it something worth remembering.