Docs

Prompt caching

Mark a stable prompt prefix once with cache_control. Every subsequent call that reuses the prefix bills it back at a fraction of input price — 10% on Anthropic, 20% on OpenAI and Gemini. Kunavo's routing layer keeps the cache warm by pinning your conversation to the same upstream key.

Bu doküman İngilizce'dir. Türkçe hızlı başlangıç kılavuzu için:Türkçe rehber — Türkiye'den Claude API

On a 30K-token system prompt with Claude Sonnet 4.6, a cache read costs $0.13 / 1M instead of $1.28 / 1M. For an agent that re-sends the same context 50 times a session, the cache saves ~90% of the input bill.

How to enable it

Add a cache_control: { type: "ephemeral" } breakpoint on any system block, message, or tool definition. The model caches the span up to that breakpoint on the first call; subsequent calls that share the same prefix bytes hit the cache.

cache_example.py
from openai import OpenAI

client = OpenAI(
    api_key="sk-kunavo-...",
    base_url="https://api.kunavo.com/v1",
)

# Mark the large, stable prefix once with cache_control.
# Every subsequent call that sends the same prefix reads it back
# from cache at ~10% of the input rate on Anthropic models.
SYSTEM = [
    {
        "type": "text",
        "text": open("long_system_prompt.txt").read(),
        "cache_control": {"type": "ephemeral"},
    },
]

resp = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[
        {"role": "system", "content": SYSTEM},
        {"role": "user", "content": "Summarize the doc above in 3 bullets."},
    ],
)

u = resp.usage
# OpenAI-compat fields populated by Kunavo:
# - prompt_tokens_details.cached_tokens — read from cache
# - cache_creation_input_tokens          — written to cache this call
print(u.prompt_tokens_details.cached_tokens,
      getattr(u, "cache_creation_input_tokens", 0))

Identical wire-format on the native Messages API — cache_control passes through to Claude untranslated:

messages_api.sh
# Native Anthropic Messages API — cache_control passes through untranslated.
curl https://api.kunavo.com/v1/messages \
  -H "Authorization: Bearer $KUNAVO_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-6",
    "system": [
      {
        "type": "text",
        "text": "<your large stable system prompt>",
        "cache_control": { "type": "ephemeral" }
      }
    ],
    "messages": [
      { "role": "user", "content": "Summarize the doc above in 3 bullets." }
    ],
    "max_tokens": 1024
  }'
Stable bytes matter. The cache key is content-addressed. A single different character anywhere in the prefix invalidates the cache for that span. Put dynamic content (timestamps, user names, per-call IDs) after the cached block, not inside it.

Pricing

Cache rates derive from each model's input price. Anthropic cache reads bill at 0.10× input. OpenAI and Gemini cache reads bill at 0.20× input. Kunavo does not pass through Anthropic's upstream cache-write surcharge — cache writes are billed to you at the plain input rate, so there's no penalty for warming up a new cache entry.

Anthropic — 0.10× cache read

ModelInputCache read
Claude Opus 4.7$3.5$0.35
Claude Opus 4.6$3.5$0.35
Claude Sonnet 4.6$2.1$0.21
Claude Sonnet 4.5$2.1$0.21
Claude Haiku 4.5$0.7$0.07

OpenAI / Gemini — 0.20× cache read

ModelInputCache read
Gemini 3 Pro$1.4$0.28
Gemini 3.1 Pro$1.4$0.28
Gemini 3 Flash$0.35$0.07
Gemini 2.5 Pro$0.875$0.175
Gemini 2.5 Flash$0.21$0.042
The catalog floor is cache-aware. Kunavo bills max(catalog_cost, upstream × markup) per call — the floor is computed against the cached-input rate, not the fresh rate, so the savings reach you instead of being flattened.
The rates above are catalog prices. The real per-call bill is computed from the upstream credits the model actually consumed (with a markup), then floored at the catalog rate. On cache hits the upstream charges fewer credits — so the upstream-times-markup branch usually drops below catalog and the catalog branch wins. In short: the catalog rate is your worst-case price; cached calls may land below it but never above.

Affinity routing keeps the cache warm

Upstream prompt caches are per-API-key. If every request lands on a random upstream key, the cache hit rate is ~1/N. Kunavo derives a stable hash from your system prompt + first user message and routes all calls with the same prefix to the same upstream key — using weighted rendezvous hashing, so only ~1/N keys are remapped when the pool changes.

You don't need to configure anything. Affinity routing applies automatically to all chat calls. (Image / video / TTS use random routing — there is no per-key cache for those modalities.)

Dashboard metrics

Once your calls start hitting the cache, /app/usage surfaces two rolled-up metrics for the selected window:

  • Hit rate — cached input tokens ÷ total input tokens. Exact, computed from per-call token counts.
  • Saved — USD saved this window, computed per model with its real cache-read rate (no flat-average approximations). This is an estimate at catalog rates — since real bills are floored at catalog and cached calls can land below the floor, the number is a tight upper bound on actual savings. For typical chat workloads the two are within a few percent.

Each call's detail page (/app/usage/<id>) shows the cached + cache-write token counts when they're non-zero, and the actual billed amount.

Response fields

Kunavo surfaces cache tokens in both response shapes. On the OpenAI-compatible /v1/chat/completions endpoint:

FieldMeaning
usage.prompt_tokensTotal input tokens (includes cached + cache-write).
usage.prompt_tokens_details.cached_tokensSubset of input served from cache.
usage.cache_creation_input_tokensTokens written to cache this call (billed as fresh input).

On the native Messages API at /v1/messages, the Anthropic-original fields pass through:

FieldMeaning
input_tokensFresh (uncached) input.
cache_read_input_tokensServed from cache (billed 0.10× input).
cache_creation_input_tokensWritten to cache this call (billed at the plain input rate — Kunavo doesn't pass through Anthropic's upstream write surcharge).

Where to go next