返回博客
Guide·2026年5月24日·8 min read

Anthropic prompt caching — cut 90% off your input bill in 30 minutes

The full picture: how cache_control works, OpenAI-style automatic caching, the 4-breakpoint pattern for agent loops, what silently breaks caching, and how to verify your hit rate is non-zero. Includes Kunavo cache rates per model.

If you're sending a long system prompt — RAG context, a tool catalog, agent rules, examples — you're probably paying full input price every call. Anthropic's prompt caching cuts that to 10% of the rate on the cached portion. OpenAI does the same thing implicitly. Most teams find it worth 30 minutes of work because it reliably knocks 60–90% off their input cost line.

Kunavo passes both flavors through unchanged. Anthropic-style cache_control works on the Messages API; OpenAI-style automatic caching works on the Chat Completions API. This post walks through both, the gotchas that quietly invalidate caches, and how to verify your hit rate.

The before-and-after

A naive loop that sends the same 18K-token system prompt ten times:

naive.py
# What most people start with: every call re-pays for the whole prompt.
import anthropic

client = anthropic.Anthropic(
    api_key="sk-kunavo-...",
    base_url="https://api.kunavo.com",
)

SYSTEM = open("system-prompt.md").read()        # 18,000 tokens of rules + examples

# 10 user questions in a session. Every call sends the 18K-token system block.
for question in questions:
    resp = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=600,
        system=SYSTEM,
        messages=[{"role": "user", "content": question}],
    )

# Cost per call (input only): 18,000 × $3 / 1M = $0.054
# 10 calls: $0.54 in input alone.

Now mark the system block cacheable — one extra field:

cached.py
# The fix: mark the static prefix as cacheable. After the first call,
# subsequent calls within ~5 minutes pay 10% the input rate on the cached
# portion. Same answer, 89% cheaper.
resp = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=600,
    system=[
        {
            "type": "text",
            "text": SYSTEM,
            "cache_control": {"type": "ephemeral"},  # mark cacheable
        }
    ],
    messages=[{"role": "user", "content": question}],
)

# First call (cache write):  18,000 × $3   / 1M = $0.054
# Calls 2–10 (cache hit):    18,000 × $0.30 / 1M = $0.0054 each
# Total: $0.054 + 9 × $0.0054 = $0.103  (was $0.54 — 81% saved)

You just saved 81% of input cost on this session. The first call is actually slightly more expensive than the naive version (~1.25× input rate to write the cache). Calls 2 through 10 pay 10% the rate. The crossover is at call 2 — by call 3 you're ahead. By call 10 it's a landslide.

OpenAI-style: nothing to do

If you're using the OpenAI SDK, Chat Completions caches the prefix of any prompt over 1,024 tokens automatically. No flag, no config. The same model behind Kunavo gets the discount whether you used the Anthropic or OpenAI route in:

openai_style.py
# OpenAI Chat Completions style — caching is automatic for prompts ≥1024 tokens.
# No flag to set. The "usage" object tells you what was cached.
from openai import OpenAI

client = OpenAI(
    api_key="sk-kunavo-...",
    base_url="https://api.kunavo.com/v1",
)

resp = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[
        {"role": "system", "content": LONG_SYSTEM_PROMPT},  # >1024 tokens
        {"role": "user", "content": "Latest question…"},
    ],
)

# In the response:
#   resp.usage.prompt_tokens_details.cached_tokens  →  17,800
#   resp.usage.prompt_tokens                        →  18,200
# 17.8K/18.2K = 98% of input came from cache. Bill reflects that automatically.

Read usage.prompt_tokens_details.cached_tokens to see how much was served from cache. Bigger fixed prefix = bigger savings. The rule of thumb: if your system prompt is shorter than your variable user content, caching isn't doing much. Reshape the prompt so the static parts are large and at the front.

Multi-layer caching — up to 4 breakpoints

For agent loops where some layers change faster than others, set multiple cache_control breakpoints. Each one is a snapshot of everything up to that point:

breakpoints.py
# Anthropic supports up to 4 cache breakpoints per request — use them
# to keep the cache hot even as later layers change.
client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=600,
    system=[
        {"type": "text",
         "text": ROLE_AND_RULES,                      # ~3,000 tokens
         "cache_control": {"type": "ephemeral"}},     # breakpoint 1
        {"type": "text",
         "text": LARGE_KNOWLEDGE_BASE,                # ~15,000 tokens, rarely changes
         "cache_control": {"type": "ephemeral"}},     # breakpoint 2
    ],
    messages=[
        {"role": "user",
         "content": [
             {"type": "text",
              "text": CONVERSATION_HISTORY,           # grows each turn
              "cache_control": {"type": "ephemeral"}}, # breakpoint 3
             {"type": "text", "text": new_question},
         ]},
    ],
)

# When CONVERSATION_HISTORY changes, breakpoints 1+2 still hit cache.
# Only breakpoint 3 + the new question pay full input rate.

The cache key is the entire prefix. Add a token at position N and every breakpoint at or after N is invalidated. Order matters: most stable content first. The rules layer should rarely change; the knowledge base updates weekly; the conversation grows every turn.

Common ways to silently break caching

  • Putting the current date or a request_id in the prompt. Every call is a new prefix, cache hit rate is 0%. Hash your prompt inputs and compare across calls.
  • Non-deterministic system prompt assembly. If you build the system from a dict, dict iteration order matters in some Python versions. Sort keys explicitly.
  • Cache lifetime is ~5 minutes. Sparse traffic patterns (one call every 10 minutes) get zero hits. Either batch calls, or accept the loss.
  • The 1,024-token minimum. Under 1K tokens, OpenAI-style caching doesn't engage. Combine small static fragments into a single longer prefix.
  • Tools / function definitions are part of the prefix. A new tool added to the catalog invalidates the cache for everyone. Keep the tool catalog stable; version it.

Verifying the hit rate

Caching that you can't see isn't engineering — it's hoping. Log usage on every call:

observe.py
# Always read usage. If cached_tokens is 0 when you expected a hit,
# something's wrong — usually a non-deterministic prefix.
resp = client.messages.create(...)
u = resp.usage
print({
    "input_uncached":  u.input_tokens,
    "input_cache_read": u.cache_read_input_tokens,
    "input_cache_write": u.cache_creation_input_tokens,
    "output": u.output_tokens,
})

# A common gotcha: putting today's date or a request_id in the system prompt
# silently invalidates the cache. Hash your inputs; verify cache_read_input_tokens
# is non-zero on the 2nd identical call.

In the Kunavo dashboard, the Usage page shows the cache split per model per day. If you see cache_read_input_tokens growing as a percentage of total input, caching is working. If it stays at 0 or fluctuates wildly, walk through the gotcha list above.

What it actually costs on Kunavo

Every model's cache rate is published on the pricing page:

  • Anthropic models: cache reads are 10% of the input rate. Cache writes are billed at the input rate — no upstream surcharge.
  • OpenAI / Gemini models: cache reads are 20% of the input rate (the providers' published ratio).
  • All cached pricing already includes Kunavo's 30% discount off the upstream sticker. So a Sonnet 4.6 cache read on Kunavo is $3 × 0.70 × 0.10 = $0.21 per 1M tokens. About 14× cheaper than the uncached upstream rate.

When caching isn't the answer

A few cases where the work isn't worth it:

  • Short prompts (<1K tokens total). The overhead dominates; just don't bother.
  • One-shot tasks with no repeat traffic. The first call is slightly more expensive — caching only pays back on call 2+.
  • High output, low input tasks (creative writing, code generation). Input is already a small share of the bill. Focus on output budget caps instead.

For everything else — RAG chatbots, agents with a fixed tool catalog, classifiers running over a static rubric, structured-extraction pipelines with consistent few-shots — caching is the highest-ROI optimization you can ship in a single afternoon. Pair it with the four other techniques in our cost-optimization playbook and 70% cost reduction is realistic without a single tradeoff in output quality.

Already on Kunavo? Open /app/usage and check the cache column for your largest model. If it's zero, you're leaving money on the table. Full guide: /docs/caching.