Docs

Prompt caching

Mark a stable prompt prefix once with cache_control. Every subsequent call that reuses the prefix bills it back at a fraction of input price — 10% on Anthropic, 20% on OpenAI and Gemini. Kunavo's routing layer keeps the cache warm by pinning your conversation to the same upstream key.

Esta documentação está em inglês. Para um guia de início rápido em português, veja:Guia em português — Claude API no Brasil →

On a 30K-token system prompt with Claude Sonnet 4.6, a cache read costs $0.27 / 1M instead of $2.70 / 1M. For an agent that re-sends the same context 50 times a session, the cache saves ~90% of the input bill.

How to enable it

Add a cache_control: { type: "ephemeral" } breakpoint on any system block, message, or tool definition. The model caches the span up to that breakpoint on the first call; subsequent calls that share the same prefix bytes hit the cache.

cache_example.py

from openai import OpenAI

client = OpenAI(
    api_key="sk-kunavo-...",
    base_url="https://api.kunavo.com/v1",
)

# Mark the large, stable prefix once with cache_control.
# Every subsequent call that sends the same prefix reads it back
# from cache at ~10% of the input rate on Anthropic models.
SYSTEM = [
    {
        "type": "text",
        "text": open("long_system_prompt.txt").read(),
        "cache_control": {"type": "ephemeral"},
    },
]

resp = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[
        {"role": "system", "content": SYSTEM},
        {"role": "user", "content": "Summarize the doc above in 3 bullets."},
    ],
)

u = resp.usage
# OpenAI-compat fields populated by Kunavo:
# - prompt_tokens_details.cached_tokens — read from cache
# - cache_creation_input_tokens          — written to cache this call
print(u.prompt_tokens_details.cached_tokens,
      getattr(u, "cache_creation_input_tokens", 0))

Identical wire-format on the native Messages API — cache_control passes through to Claude untranslated:

messages_api.sh

# Native Anthropic Messages API — cache_control passes through untranslated.
curl https://api.kunavo.com/v1/messages \
  -H "Authorization: Bearer $KUNAVO_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-6",
    "system": [
      {
        "type": "text",
        "text": "<your large stable system prompt>",
        "cache_control": { "type": "ephemeral" }
      }
    ],
    "messages": [
      { "role": "user", "content": "Summarize the doc above in 3 bullets." }
    ],
    "max_tokens": 1024
  }'

Stable bytes matter. The cache key is content-addressed. A single different character anywhere in the prefix invalidates the cache for that span. Put dynamic content (timestamps, user names, per-call IDs) after the cached block, not inside it.

Pricing

Cache rates derive from each model's input price. Anthropic cache reads bill at 0.10× input. OpenAI and Gemini cache reads bill at 0.20× input. Kunavo does not pass through Anthropic's upstream cache-write surcharge — cache writes are billed to you at the plain input rate, so there's no penalty for warming up a new cache entry.

Anthropic — 0.10× cache read

Model	Input	Cache read
Claude Fable 5	$7	$0.7
Claude Opus 4.7	$2	$0.2
Claude Opus 4.6	$2	$0.2
Claude Sonnet 5	$2.1	$0.21
Claude Sonnet 4.6	$1.2	$0.12
Claude Haiku 4.5	$0.4	$0.04

OpenAI / Gemini — 0.20× cache read

Model	Input	Cache read
Gemini 2.5 Pro	$0.375	$0.075
Gemini 2.5 Flash	$0.09	$0.018
GPT-5.4	$1	$0.2
GPT-5.5	$2	$0.4
GPT-5.4 Mini	$0.225	$0.045
GPT-5.3 Codex	$0.7	$0.14
GPT-5.5 Pro	$12	$2.4

The catalog floor is cache-aware. Kunavo bills max(catalog_cost, upstream × markup) per call — the floor is computed against the cached-input rate, not the fresh rate, so the savings reach you instead of being flattened.

The rates above are catalog prices. The real per-call bill is computed from the upstream credits the model actually consumed (with a markup), then floored at the catalog rate. On cache hits the upstream charges fewer credits — so the upstream-times-markup branch usually drops below catalog and the catalog branch wins. In short: the catalog rate is your worst-case price; cached calls may land below it but never above.

Affinity routing keeps the cache warm

Upstream prompt caches are per-API-key. If every request lands on a random upstream key, the cache hit rate is ~1/N. Kunavo derives a stable hash from your system prompt + first user message and routes all calls with the same prefix to the same upstream key — using weighted rendezvous hashing, so only ~1/N keys are remapped when the pool changes.

You don't need to configure anything. Affinity routing applies automatically to all chat calls. (Image / video / TTS use random routing — there is no per-key cache for those modalities.)

Dashboard metrics

Once your calls start hitting the cache, /app/usage surfaces two rolled-up metrics for the selected window:

Hit rate — cached input tokens ÷ total input tokens. Exact, computed from per-call token counts.
Saved — USD saved this window, computed per model with its real cache-read rate (no flat-average approximations). This is an estimate at catalog rates — since real bills are floored at catalog and cached calls can land below the floor, the number is a tight upper bound on actual savings. For typical chat workloads the two are within a few percent.

Each call's detail page (/app/usage/<id>) shows the cached + cache-write token counts when they're non-zero, and the actual billed amount.

Response fields

Kunavo surfaces cache tokens in both response shapes. On the OpenAI-compatible /v1/chat/completions endpoint:

Field	Meaning
`usage.prompt_tokens`	Total input tokens (includes cached + cache-write).
`usage.prompt_tokens_details.cached_tokens`	Subset of input served from cache.
`usage.cache_creation_input_tokens`	Tokens written to cache this call (billed as fresh input).

On the native Messages API at /v1/messages, the Anthropic-original fields pass through:

Field	Meaning
`input_tokens`	Fresh (uncached) input.
`cache_read_input_tokens`	Served from cache (billed 0.10× input).
`cache_creation_input_tokens`	Written to cache this call (billed at the plain input rate — Kunavo doesn't pass through Anthropic's upstream write surcharge).

Where to go next

Messages API reference — native Anthropic shape with worked cache examples.
Billing & the ledger — how the cache-aware max() floor interacts with upstream cost.
Full pricing table — fresh input / output prices for every chat model, with the cache sub-table at the bottom.