The default modern KB chatbot architecture
Traditional KB search returns links; the user reads them. RAG turns that into a one-shot conversational answer with citations. Done well, users get answers in 2 seconds instead of digging through 5 documents. Done badly, the chatbot hallucinates and your CTO bans the project for a year. This page is the difference.
The minimum viable production stack
- Vector store: pgvector if you already have Postgres, Pinecone or Qdrant for managed
- Embeddings: text-embedding-3-large via Kunavo ($0.10/1M tokens)
- Retrieval: hybrid (vector + BM25 with reciprocal rank fusion)
- Generation: Claude Sonnet 4.6 with cache_control on the system prompt
- UI: streaming responses, citation rendering, "I don't know" fallback
from openai import OpenAI
client = OpenAI(api_key="sk-kunavo-...", base_url="https://api.kunavo.com/v1")
def chat(question: str, history: list[dict]) -> dict:
chunks = hybrid_retrieve(question, k=5) # vector + BM25
context = "\n\n---\n\n".join(
f"[doc:{c['id']}] {c['text']}" for c in chunks
)
resp = client.chat.completions.create(
model="claude-sonnet-4-6",
messages=[
{"role": "system", "content": [{
"type": "text",
"text": (
"Answer only from Context. Cite [doc:N] for each claim. "
"If Context doesn't answer, say 'I don't have that.' "
"Be concise, no throat-clearing."
),
"cache_control": {"type": "ephemeral"},
}]},
*history,
{"role": "user", "content": f"# Context\n{context}\n\n# Q\n{question}"},
],
max_tokens=600,
)
answer = resp.choices[0].message.content
cited_ids = parse_citations(answer) # extract [doc:N] references
return {"answer": answer, "sources": [c for c in chunks if c["id"] in cited_ids]}Cost at production scale
- Initial indexing: ~$0.25 for 5,000 docs of 500 tokens each
- 1,000 queries/day: ~$210/month with caching
- 10,000 queries/day: ~$2,100/month
- If using Haiku 4.5 instead of Sonnet: ~4x cheaper, ~85% answer quality
Full architectural breakdown in the RAG implementation guide. Language-specific tuning notes in the Japanese RAG deep dive and Spanish RAG guide.
The three patterns that actually prevent hallucination
- Cite every claim:
[doc:42]tags in the model output. If the cited id isn't in the retrieved set, it hallucinated — block and log - Explicit refusal in system prompt: "If Context doesn't answer, say 'I don't have that.'" Without this, the model fills in from world knowledge
- Output cap of 600 tokens: short answers are usually accurate answers. Longer outputs are where extra inventions sneak in
What to ship in week 1 vs week 4
| Week | Milestone |
|---|---|
| 1 | 100 docs, single chunking strategy, basic vector search, 10-question eval set, ~70% recall@5 |
| 2 | Full corpus, hybrid retrieval, 100-question eval set, citations enforced, internal beta |
| 3 | Tune chunking based on failed questions, ship to internal users, measure CSAT |
| 4 | Monitoring + cost dashboard, daily spend cap, public beta or production launch |
Start at /app/signup
$2 free credit covers initial indexing of ~50K docs and 300 test queries — more than enough for a functional prototype. Then read the complete RAG guide for production patterns.