Docs
Prompt caching
Mark a stable prompt prefix once with cache_control. Every subsequent call that reuses the prefix bills it back at a fraction of input price — 10% on Anthropic, 20% on OpenAI and Gemini. Kunavo's routing layer keeps the cache warm by pinning your conversation to the same upstream key.
On a 30K-token system prompt with Claude Sonnet 4.6, a cache read costs $0.13 / 1M instead of $1.28 / 1M. For an agent that re-sends the same context 50 times a session, the cache saves ~90% of the input bill.
How to enable it
Add a cache_control: { type: "ephemeral" } breakpoint on any system block, message, or tool definition. The model caches the span up to that breakpoint on the first call; subsequent calls that share the same prefix bytes hit the cache.
from openai import OpenAI
client = OpenAI(
api_key="sk-kunavo-...",
base_url="https://api.kunavo.com/v1",
)
# Mark the large, stable prefix once with cache_control.
# Every subsequent call that sends the same prefix reads it back
# from cache at ~10% of the input rate on Anthropic models.
SYSTEM = [
{
"type": "text",
"text": open("long_system_prompt.txt").read(),
"cache_control": {"type": "ephemeral"},
},
]
resp = client.chat.completions.create(
model="claude-sonnet-4-6",
messages=[
{"role": "system", "content": SYSTEM},
{"role": "user", "content": "Summarize the doc above in 3 bullets."},
],
)
u = resp.usage
# OpenAI-compat fields populated by Kunavo:
# - prompt_tokens_details.cached_tokens — read from cache
# - cache_creation_input_tokens — written to cache this call
print(u.prompt_tokens_details.cached_tokens,
getattr(u, "cache_creation_input_tokens", 0))Identical wire-format on the native Messages API — cache_control passes through to Claude untranslated:
# Native Anthropic Messages API — cache_control passes through untranslated.
curl https://api.kunavo.com/v1/messages \
-H "Authorization: Bearer $KUNAVO_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-H "Content-Type: application/json" \
-d '{
"model": "claude-sonnet-4-6",
"system": [
{
"type": "text",
"text": "<your large stable system prompt>",
"cache_control": { "type": "ephemeral" }
}
],
"messages": [
{ "role": "user", "content": "Summarize the doc above in 3 bullets." }
],
"max_tokens": 1024
}'Pricing
Cache rates derive from each model's input price. Anthropic cache reads bill at 0.10× input. OpenAI and Gemini cache reads bill at 0.20× input. Kunavo does not pass through Anthropic's upstream cache-write surcharge — cache writes are billed to you at the plain input rate, so there's no penalty for warming up a new cache entry.
Anthropic — 0.10× cache read
| Model | Input | Cache read |
|---|---|---|
| Claude Opus 4.7 | $3.5 | $0.35 |
| Claude Opus 4.6 | $3.5 | $0.35 |
| Claude Sonnet 4.6 | $2.1 | $0.21 |
| Claude Sonnet 4.5 | $2.1 | $0.21 |
| Claude Haiku 4.5 | $0.7 | $0.07 |
OpenAI / Gemini — 0.20× cache read
| Model | Input | Cache read |
|---|---|---|
| Gemini 3 Pro | $1.4 | $0.28 |
| Gemini 3.1 Pro | $1.4 | $0.28 |
| Gemini 3 Flash | $0.35 | $0.07 |
| Gemini 2.5 Pro | $0.875 | $0.175 |
| Gemini 2.5 Flash | $0.21 | $0.042 |
max(catalog_cost, upstream × markup) per call — the floor is computed against the cached-input rate, not the fresh rate, so the savings reach you instead of being flattened.Affinity routing keeps the cache warm
Upstream prompt caches are per-API-key. If every request lands on a random upstream key, the cache hit rate is ~1/N. Kunavo derives a stable hash from your system prompt + first user message and routes all calls with the same prefix to the same upstream key — using weighted rendezvous hashing, so only ~1/N keys are remapped when the pool changes.
You don't need to configure anything. Affinity routing applies automatically to all chat calls. (Image / video / TTS use random routing — there is no per-key cache for those modalities.)
Dashboard metrics
Once your calls start hitting the cache, /app/usage surfaces two rolled-up metrics for the selected window:
- Hit rate — cached input tokens ÷ total input tokens. Exact, computed from per-call token counts.
- Saved — USD saved this window, computed per model with its real cache-read rate (no flat-average approximations). This is an estimate at catalog rates — since real bills are floored at catalog and cached calls can land below the floor, the number is a tight upper bound on actual savings. For typical chat workloads the two are within a few percent.
Each call's detail page (/app/usage/<id>) shows the cached + cache-write token counts when they're non-zero, and the actual billed amount.
Response fields
Kunavo surfaces cache tokens in both response shapes. On the OpenAI-compatible /v1/chat/completions endpoint:
| Field | Meaning |
|---|---|
usage.prompt_tokens | Total input tokens (includes cached + cache-write). |
usage.prompt_tokens_details.cached_tokens | Subset of input served from cache. |
usage.cache_creation_input_tokens | Tokens written to cache this call (billed as fresh input). |
On the native Messages API at /v1/messages, the Anthropic-original fields pass through:
| Field | Meaning |
|---|---|
input_tokens | Fresh (uncached) input. |
cache_read_input_tokens | Served from cache (billed 0.10× input). |
cache_creation_input_tokens | Written to cache this call (billed at the plain input rate — Kunavo doesn't pass through Anthropic's upstream write surcharge). |
Where to go next
- Messages API reference — native Anthropic shape with worked cache examples.
- Billing & the ledger — how the cache-aware max() floor interacts with upstream cost.
- Full pricing table — fresh input / output prices for every chat model, with the cache sub-table at the bottom.