AI-feature cost lines on financial dashboards have a fun property: they grow non-linearly with traffic and they keep getting worse as your product gets popular. Here are the five highest- leverage techniques we've seen actually move the needle, with runnable code for each.
Worked example: a chat product doing 100K calls/day at an average of 3K input / 500 output tokens on Claude Sonnet 4.6 spends about $4,500/month direct. The same load with the five techniques below usually settles around $900–$1,200/month — a 70–80% cut without meaningfully degrading output quality.
1. Prompt caching — the single biggest lever
If your system prompt or context block is more than ~2K tokens and is reused across calls, you're overpaying. Anthropic and Google both bill cached input at roughly 0.1×–0.2× of normal input — a 5–10× saving on every call after the cache warms. Kunavo passes the entire saving through and runs an affinity router that keeps repeat prompts hitting the same upstream node, so the cache stays alive even at low concurrency.
See /docs/caching for the full guide.
import anthropic
client = anthropic.Anthropic(
api_key="sk-kunavo-...",
base_url="https://api.kunavo.com",
)
# Mark the long, static parts of the prompt as cacheable. The next call
# with the same prefix hits Anthropic's cache — 10% the cost.
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": "<<your 20K-token system prompt — schema, rules, examples>>",
"cache_control": {"type": "ephemeral"}, # mark as cacheable
},
],
messages=[{"role": "user", "content": "Latest question…"}],
)Expected savings: 40–70% on input cost for long-system-prompt workloads. Often the largest single improvement.
2. Model tiering — don't pay for capability you're not using
The cheapest reasonable model that still produces acceptable output wins. Most LLM workloads have a long-tail distribution of difficulty — the average classification or summarization task does not need Opus.
Build a simple difficulty estimate (heuristic, classifier, or Haiku itself triaging) and route accordingly. Across 50+ Kunavo accounts we've seen this alone cut chat costs by 40–60%.
def pick_model(task_difficulty: int) -> str:
"""
Cheapest model that's still good enough wins.
Pick by perceived difficulty, then validate by output quality.
"""
if task_difficulty <= 2:
return "claude-haiku-4-5" # ~10x cheaper than Opus
if task_difficulty <= 4:
return "gemini-3-flash" # very strong, even cheaper on output
if task_difficulty <= 7:
return "claude-sonnet-4-6"
return "claude-opus-4-7" # only when truly neededExpected savings: 40–60% on chat-style workloads once a meaningful share routes to Haiku / Gemini Flash.
3. Output caps and spending limits
Output tokens cost roughly 5× more than input. A bug that lets a prompt request a 16K-token response inside a tight loop is how people accidentally spend $400 in an afternoon. Defend both layers — the per-call max_tokens, and the wallet-level spending limit in the dashboard.
# Two-layer cap: per-call max_tokens, plus a hard wallet cap on the key.
resp = client.chat.completions.create(
model="claude-sonnet-4-6",
messages=[...],
max_tokens=800, # cap output per call
stop=["</answer>"], # cut at a known terminator
)
# In the Kunavo dashboard: /app/keys → set "Spending limit" on each key.
# Production key: $50/day. Experiment key: $5/day. CI key: $1/day.On Kunavo: /app/keys lets you set a per-key per-day cap. CI/test keys at $1/day prevent runaway scripts; production keys can stay generous.
4. Parallelism — concurrent != more expensive
Sequential calls don't save money compared to parallel ones — but they do waste wall-clock time, which often gets "solved" by giving the LLM more context or a bigger model. Fire independent calls in parallel; finish faster; tier down.
# Don't pay for sequential reasoning when N tasks are independent.
import asyncio
from openai import AsyncOpenAI
client = AsyncOpenAI(
api_key=os.environ["KUNAVO_API_KEY"],
base_url="https://api.kunavo.com/v1",
)
async def classify_one(text: str) -> str:
resp = await client.chat.completions.create(
model="claude-haiku-4-5", # parallel + cheap = win
messages=[{"role": "user", "content": f"Classify: {text}"}],
max_tokens=20,
)
return resp.choices[0].message.content
results = await asyncio.gather(*(classify_one(t) for t in batch))Expected savings: indirect — usually 10–20% by avoiding the "upgrade the model to be faster" trap.
5. Retries that don't double-bill
On Kunavo, failed requests (4xx, 5xx) are never billed — but only the call that fails is free. If your retry logic fires on every transient blip without backoff, you can multiply cost by 2–3× on a bad upstream day. Honor Retry-After, distinguish 4xx (your bug) from 5xx (transient), cap attempts.
from openai import APIError, RateLimitError
import time, random
def call_with_backoff(client, **kwargs):
for attempt in range(5):
try:
return client.chat.completions.create(**kwargs)
except RateLimitError as e:
# Honor Retry-After when present.
ra = getattr(e, "retry_after", None) or (2 ** attempt + random.random())
time.sleep(min(ra, 30))
except APIError as e:
if 500 <= e.status_code < 600:
time.sleep(2 ** attempt + random.random())
continue
raise # 4xx — your bug, not a transient one
raise RuntimeError("exhausted retries")See /docs/errors for the full retry rules.
The compound effect
Each technique applied alone gives a modest cut. Stacked, they compound: 50% off from caching × 50% off from tiering × 10% off from capping = 22% of original cost = 78% saved. That's the difference between AI being a budget item and AI being a production margin contributor.
Kunavo's own pricing — every model 30% under the official list — is a flat multiplier on top of all five techniques. The cheapest way to run frontier AI in 2026 is: choose the right provider, then squeeze the right techniques out of that provider. Both layers matter.
Want a free read of your current usage to see where the leverage is? Email contact@kunavo.com with a rough description and we'll do the analysis.