返回博客
Playbook·2026年5月23日·8 min read

AI API cost optimization — five techniques that actually cut the bill

Prompt caching, model tiering, output caps, parallelism, retry hygiene — with runnable code for each and realistic per-technique savings ranges. Stack them and you cut 70–80%.

AI-feature cost lines on financial dashboards have a fun property: they grow non-linearly with traffic and they keep getting worse as your product gets popular. Here are the five highest- leverage techniques we've seen actually move the needle, with runnable code for each.

Worked example: a chat product doing 100K calls/day at an average of 3K input / 500 output tokens on Claude Sonnet 4.6 spends about $4,500/month direct. The same load with the five techniques below usually settles around $900–$1,200/month — a 70–80% cut without meaningfully degrading output quality.

1. Prompt caching — the single biggest lever

If your system prompt or context block is more than ~2K tokens and is reused across calls, you're overpaying. Anthropic and Google both bill cached input at roughly 0.1×–0.2× of normal input — a 5–10× saving on every call after the cache warms. Kunavo passes the entire saving through and runs an affinity router that keeps repeat prompts hitting the same upstream node, so the cache stays alive even at low concurrency.

See /docs/caching for the full guide.

cache.py
import anthropic

client = anthropic.Anthropic(
    api_key="sk-kunavo-...",
    base_url="https://api.kunavo.com",
)

# Mark the long, static parts of the prompt as cacheable. The next call
# with the same prefix hits Anthropic's cache — 10% the cost.
resp = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "<<your 20K-token system prompt — schema, rules, examples>>",
            "cache_control": {"type": "ephemeral"},   # mark as cacheable
        },
    ],
    messages=[{"role": "user", "content": "Latest question…"}],
)

Expected savings: 40–70% on input cost for long-system-prompt workloads. Often the largest single improvement.

2. Model tiering — don't pay for capability you're not using

The cheapest reasonable model that still produces acceptable output wins. Most LLM workloads have a long-tail distribution of difficulty — the average classification or summarization task does not need Opus.

Build a simple difficulty estimate (heuristic, classifier, or Haiku itself triaging) and route accordingly. Across 50+ Kunavo accounts we've seen this alone cut chat costs by 40–60%.

tiering.py
def pick_model(task_difficulty: int) -> str:
    """
    Cheapest model that's still good enough wins.
    Pick by perceived difficulty, then validate by output quality.
    """
    if task_difficulty <= 2:
        return "claude-haiku-4-5"     # ~10x cheaper than Opus
    if task_difficulty <= 4:
        return "gemini-3-flash"       # very strong, even cheaper on output
    if task_difficulty <= 7:
        return "claude-sonnet-4-6"
    return "claude-opus-4-7"          # only when truly needed

Expected savings: 40–60% on chat-style workloads once a meaningful share routes to Haiku / Gemini Flash.

3. Output caps and spending limits

Output tokens cost roughly 5× more than input. A bug that lets a prompt request a 16K-token response inside a tight loop is how people accidentally spend $400 in an afternoon. Defend both layers — the per-call max_tokens, and the wallet-level spending limit in the dashboard.

caps.py
# Two-layer cap: per-call max_tokens, plus a hard wallet cap on the key.
resp = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[...],
    max_tokens=800,       # cap output per call
    stop=["</answer>"],   # cut at a known terminator
)

# In the Kunavo dashboard: /app/keys → set "Spending limit" on each key.
# Production key: $50/day. Experiment key: $5/day. CI key: $1/day.

On Kunavo: /app/keys lets you set a per-key per-day cap. CI/test keys at $1/day prevent runaway scripts; production keys can stay generous.

4. Parallelism — concurrent != more expensive

Sequential calls don't save money compared to parallel ones — but they do waste wall-clock time, which often gets "solved" by giving the LLM more context or a bigger model. Fire independent calls in parallel; finish faster; tier down.

parallel.py
# Don't pay for sequential reasoning when N tasks are independent.
import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI(
    api_key=os.environ["KUNAVO_API_KEY"],
    base_url="https://api.kunavo.com/v1",
)

async def classify_one(text: str) -> str:
    resp = await client.chat.completions.create(
        model="claude-haiku-4-5",     # parallel + cheap = win
        messages=[{"role": "user", "content": f"Classify: {text}"}],
        max_tokens=20,
    )
    return resp.choices[0].message.content

results = await asyncio.gather(*(classify_one(t) for t in batch))

Expected savings: indirect — usually 10–20% by avoiding the "upgrade the model to be faster" trap.

5. Retries that don't double-bill

On Kunavo, failed requests (4xx, 5xx) are never billed — but only the call that fails is free. If your retry logic fires on every transient blip without backoff, you can multiply cost by 2–3× on a bad upstream day. Honor Retry-After, distinguish 4xx (your bug) from 5xx (transient), cap attempts.

retry.py
from openai import APIError, RateLimitError
import time, random

def call_with_backoff(client, **kwargs):
    for attempt in range(5):
        try:
            return client.chat.completions.create(**kwargs)
        except RateLimitError as e:
            # Honor Retry-After when present.
            ra = getattr(e, "retry_after", None) or (2 ** attempt + random.random())
            time.sleep(min(ra, 30))
        except APIError as e:
            if 500 <= e.status_code < 600:
                time.sleep(2 ** attempt + random.random())
                continue
            raise           # 4xx — your bug, not a transient one
    raise RuntimeError("exhausted retries")

See /docs/errors for the full retry rules.

The compound effect

Each technique applied alone gives a modest cut. Stacked, they compound: 50% off from caching × 50% off from tiering × 10% off from capping = 22% of original cost = 78% saved. That's the difference between AI being a budget item and AI being a production margin contributor.

Kunavo's own pricing — every model 30% under the official list — is a flat multiplier on top of all five techniques. The cheapest way to run frontier AI in 2026 is: choose the right provider, then squeeze the right techniques out of that provider. Both layers matter.

Want a free read of your current usage to see where the leverage is? Email contact@kunavo.com with a rough description and we'll do the analysis.