Back to guides
Cost·May 25, 2026·11 min read

AI cost optimization — the complete guide to cutting 70-90% off your LLM bill

If you're paying more than $1,000/month on LLM APIs, this guide will save you at least half. Every technique is published with measured savings and runnable code. Stack 3-5 of them and 70% reduction is realistic without quality tradeoff.

Five techniques compound. Each one cuts a different slice of the bill — stacking three of them realistically gets you to 70% reduction without any quality loss; all five gets you closer to 85-90%. This guide is the full menu, with measured savings and runnable code for each technique. Each technique links to a focused deep dive at the bottom.

The five techniques, ranked by ROI

  1. Prompt caching (60-90% off input on repeat prompts)
  2. Model tiering (10x cheaper for simple tasks)
  3. Output caps (caps the dominant cost on creative tasks)
  4. Parallelism + batching (cuts wall-clock, not per-token, but enables tighter timeouts and fewer retries)
  5. Retry hygiene (don't pay for runaway retries)

1. Prompt caching — the highest-ROI technique by far

If your system prompt is over 1,000 tokens and you call the same prompt more than twice in a 5-minute window, prompt caching is free money. Anthropic charges cached input at 10% of the input rate; OpenAI does the equivalent automatically at 20%. One field added:

cache_control.py
# Prompt caching: 1 extra field = 10% rate on cached input
resp = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=600,
    system=[{
        "type": "text",
        "text": LONG_SYSTEM_PROMPT,
        "cache_control": {"type": "ephemeral"},   # cache this part
    }],
    messages=[{"role": "user", "content": question}],
)

Realistic savings: 60-90% off your input cost line if your system prompt is large and stable. Look at your /app/usage dashboard — if cache_read_input_tokens is 0, you have 30 minutes of work ahead that pays back forever. Deep dive.

2. Model tiering — use the cheapest model that's good enough

Claude Haiku 4.5 costs roughly 10% of Claude Opus 4.7. For classification, formatting, routing decisions, simple summaries — Haiku is more than enough. Reserve Opus for the genuinely hard reasoning calls.

tiering.py
# Pick the cheapest model that's still good enough for the task.
# Cost order (Kunavo prices): Haiku 4.5 < Gemini 3 Flash < Sonnet 4.6 < Opus 4.7
def pick_model(difficulty: int) -> str:
    if difficulty <= 2: return "claude-haiku-4-5"     # ~10x cheaper than Opus
    if difficulty <= 4: return "gemini-3-flash"       # very strong, cheap output
    if difficulty <= 7: return "claude-sonnet-4-6"
    return "claude-opus-4-7"

How to build the tiering decision: a small classifier first (Haiku itself can do this in 100 input tokens), then route. Measure output quality with held-out evals. Don't let intuition pick — bench it.

  • Tier 0 (Haiku 4.5): classification, extraction, translation, summarization of <2K input — cost-per-call cents
  • Tier 1 (Gemini 3 Flash): longer summarization, code completion, simple tool use — also cents but better at long context
  • Tier 2 (Sonnet 4.6): real reasoning, multi-step agents, vision tasks — the production workhorse
  • Tier 3 (Opus 4.7): only when you've confirmed Sonnet doesn't get it right

3. Output caps — limit what they generate, not just what they read

Output tokens are typically 4-5x more expensive than input. A loose max_tokens=4096 default invites the model to ramble. Cap aggressively, and use stop sequences to cut at the right marker:

caps.py
# Two-layer caps: per-call max_tokens + per-key wallet cap
client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[...],
    max_tokens=800,             # cap each call
    stop=["</answer>"],         # cut at terminator
)
# /app/keys → set "Spending limit" per key:
#   Production: $50/day · Experiment: $5/day · CI: $1/day

Pair the per-call max_tokens with a per-key wallet spending cap in /app/keys — a runaway loop or bug can't exceed your daily ceiling. Two layers, both important.

4. Parallelism — for batch-shaped workloads

If you're processing 1,000 independent items (classifying feedback, scoring leads, generating descriptions), serial calls take an hour. Async with AsyncOpenAI and a concurrency limit gets you there in under a minute — and tighter timeouts let you fail-fast on slow upstreams. Parallelism doesn't cut per-token cost, but it enables architectural patterns (idempotent processing, tighter retry budgets) that do.

5. Retry hygiene — don't pay for runaway retries

Naive code retries on every error including 400s. That's paying for failures (4xx requests at Kunavo aren't billed, but you still burn rate-limit budget and wall-clock). Correct policy: retry only 408/429/5xx, exponential backoff with jitter, max 5 attempts and 30 seconds total wait. Anything else, fail fast and surface the error.

What the savings look like in practice

Realistic measured savings on a Kunavo-using SaaS with $5,000/month spend:

  • Prompt caching added: $2,000 saved (−40%)
  • Haiku tiering for classification: $800 saved (−16%)
  • Tighter max_tokens: $400 saved (−8%)
  • Combined: $3,200/month saved, $1,800 spent (−64% total)

These compound — each next technique works on the already-reduced baseline. Without caching applied first, the others save less. Order matters: start with #1, then #2, then #3.

What does not work as advertised

  • "Embedding for everything" — embeddings save on text search, not on generation. They're cheap but they don't reduce LLM-call cost
  • "Local fine-tunes for cost" — only justifiable if you're processing 10M+ tokens/day on a narrow task. For most teams, prompt caching + Haiku tiering beats the operational complexity of self-hosting
  • "Cheaper aggregators" — Kunavo is already 30% under official upstream pricing. Cheaper aggregators usually win by silently swapping your requested model to a cheaper one

Where to go next

Apply caching first — it's the highest ROI per hour invested. Then instrument your /app/usage dashboard breakdown by model and tier. Then tier. Track monthly cost-per-active-user as your north star metric — if it's flat-to-down while engagement grows, you're winning.