Most RAG demos die in production. They work on the 10-question demo deck and break the moment users ask anything novel. This guide is the architecture and patterns that hold up: chunking that respects semantic boundaries, hybrid retrieval that catches both fuzzy and literal queries, prompt structure that prevents hallucination, and cost discipline that keeps the bill flat as you scale.

The five decisions that matter

Chunking strategy — affects retrieval quality more than the model
Embedding model — picks the recall ceiling
Retrieval strategy — vector alone is not enough
Prompt structure — decides whether the model hallucinates
Generation model + caching — sets the unit economics

1. Chunking — boring beats clever

Don't overthink this. Recursive character splitter with 1000-1500 characters per chunk and 100-200 character overlap. Tries to break on paragraph boundaries first, then sentences, then words. Stable across domains:

chunk.py

# Recursive character splitter with overlap — the boring choice that wins
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1200,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_text(document)
# 1200 chars ≈ 300-400 tokens — fits 5+ chunks in a Sonnet context window
# with room for system prompt + question + answer

Things that look smarter but rarely help: sentence-aware models, markdown-aware splitters, sliding window with embeddings. They're not bad — they're marginally better at huge effort. Spend that effort on retrieval instead.

2. Embedding model — text-embedding-3-large is the default

text-embedding-3-large hits the multilingual + accuracy sweet spot for most production use cases. On Kunavo it's about $0.10/1M tokens — embedding 1M chunks costs ~$25, a rounding error compared to the LLM bill. Re-embed when you change the chunking strategy or the model itself; not when content updates (just add new chunks).

Exceptions where smaller embeddings work: pure English short-text retrieval (FAQs, product cards). For those text-embedding-3-smallis 4x cheaper with negligible quality loss. For multilingual or technical content, stay on large.

3. Retrieval — pure vector loses to hybrid

Vector similarity is great at fuzzy semantic matching ("how do I cancel" → "subscription cancellation policy"). It's terrible at literal terms ("SKU-A92837" or "order #4729"). The fix is hybrid: BM25 keyword search in parallel, then reciprocal rank fusion:

retrieve.py

# Hybrid retrieval: vector similarity + BM25 keyword scoring
# Pure vector misses literal keywords (product SKUs, IDs, dates)
def retrieve(question: str, k: int = 5) -> list[dict]:
    q_embed = embed([question])[0]
    vector_hits = vector_db.similarity_search(q_embed, k=k * 2)
    keyword_hits = bm25_search(question, k=k * 2)

    # Reciprocal rank fusion — simple, robust
    scores: dict[str, float] = {}
    for rank, hit in enumerate(vector_hits):
        scores[hit["id"]] = scores.get(hit["id"], 0) + 1 / (rank + 60)
    for rank, hit in enumerate(keyword_hits):
        scores[hit["id"]] = scores.get(hit["id"], 0) + 1 / (rank + 60)

    top_ids = sorted(scores, key=scores.get, reverse=True)[:k]
    return [chunk_by_id[i] for i in top_ids]

This is the single biggest quality lift after picking a decent embedding model. Track recall@5 on a held-out 100-question eval — if it goes from 70% to 90% after adding BM25, you just removed a third of the "I don't know" failures.

4. Prompt structure — three patterns that prevent hallucination

answer.py

# Final RAG prompt structure — three sections, citations enforced
SYSTEM_PROMPT = """You answer based exclusively on the supplied Context.
- Cite the [doc_id] for each factual claim.
- If the context doesn't answer the question, say "I don't have that
  information" — do not extrapolate or use general knowledge.
- Be concise. No throat-clearing."""

def answer(question: str) -> dict:
    chunks = retrieve(question, k=5)
    context = "\n\n---\n\n".join(
        f"[doc:{c['id']}] {c['text']}" for c in chunks
    )
    resp = client.chat.completions.create(
        model="claude-sonnet-4-6",
        messages=[
            {"role": "system", "content": [{
                "type": "text",
                "text": SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"},
            }]},
            {"role": "user", "content": f"# Context\n{context}\n\n# Question\n{question}"},
        ],
        max_tokens=600,
    )
    return {
        "text": resp.choices[0].message.content,
        "sources": [c["id"] for c in chunks],
    }

Three non-negotiables in the system prompt:

Cite sources — make the model attach [doc:42]to each claim. If the cited ID isn't real, you caught a hallucination
Refuse explicitly — "if context doesn't answer, say I don't have that information." Without this the model completes from world knowledge
Concise output — short answers correlate with accurate answers. Max 600 tokens is a good production default

5. Generation model + caching — the cost layer

Claude Sonnet 4.6 is the default. Gemini 2.5 Flash is the cheap fallback if you're cost-pressed. Always wrap the system prompt in cache_control — the system prompt is the same every call, so caching cuts that portion to 10% of the input rate (Anthropic) or 20% (Gemini).

Realistic per-query cost at production scale (5K context, 500 output, caching on):

Claude Sonnet 4.6: ~$0.007 per query
Claude Haiku 4.5: ~$0.001 per query (4x cheaper, ~85% quality)
Gemini 2.5 Flash: ~$0.0008 per query (5x cheaper, similar quality)

At 10,000 queries/day → $70/day on Sonnet, $10/day on Haiku. Pick Sonnet for high-stakes use cases (customer-facing, legal, medical), Haiku for internal/bulk processing.

What breaks in production (and how to catch it)

Distribution drift: training corpus stops matching real user questions. Sample 100 queries/week, check recall@5 manually
Stale embeddings: source docs updated but embeddings not refreshed. Track index size vs source size weekly
Fake citations: model invents doc IDs that look real. Validate every [doc:N] against the actual list of retrieved IDs before rendering
Latency cliffs: vector DB blows up past a million vectors. Use HNSW indexing, partition by tenant if multi-tenant

A 4-week production roadmap

Week 1: prototype with 100 docs, 10 test questions, manual eval
Week 2: scale to full corpus, build hybrid retrieval, write 100-question eval set
Week 3: tune chunking + system prompt iterating on the eval, ship to internal users
Week 4: monitoring, cost budget, public release with cite-source UI

RAG implementation guide — production retrieval-augmented generation with Claude and Gemini