Kunavo's public SLO is 99.95% monthly availability on the chat API. That sounds like a marketing number, but it's a real one — it caps our error budget at 21 minutes per month, and we've hit it every month since launch. This post is what's actually behind it: the upstream failover, the first-byte watchdog, the billing abacus, and the boring operational things that matter more than any of the above.
The reality: every single upstream has bad days
We talk to ~12 upstreams behind the scenes. In any given month we see each of them have at least one 5-to-30 minute partial degradation. Sometimes it's a region outage (Anthropic's us-east-1 had two incidents this quarter). Sometimes it's a quiet token-bucket squeeze (Vertex AI's rate limiter is ungenerous on Mondays). Sometimes it's a full provider outage (kie.ai went dark for 47 minutes in March).
If a single upstream goes down and you have no failover, your SLO is whatever their SLO is — minus your transit overhead. That puts a hard ceiling around 99.5%. Cracking 99.9% requires that no single upstream failure can take you down, and 99.95% requires that two simultaneous failures also can't.
Layer 1: Weighted candidate routing
For every model we host, there are 1 to 4 upstream paths. The dispatcher walks them in weighted order, with weights that decay in real-time based on rolling success rate (the last 100 calls per upstream per region). When us-east-1 starts 5xx-ing, its weight drops below us-west-2 within ~5 seconds, and new requests stop going there before a customer notices.
// Pseudocode for the upstream dispatcher. The real implementation
// (TypeScript, server-only) is shaped close to this.
async function dispatch(req: ChatRequest): Promise<ChatResponse> {
const candidates = pickCandidates(req.model);
// candidates = [{provider: "anthropic-direct", weight: 100},
// {provider: "google-vertex", weight: 80},
// {provider: "kie-wholesale", weight: 60}]
// Lower weight = backup. Initial weights come from rolling success rates.
const deadline = Date.now() + 540_000; // hard 540s wall
let lastError: Error | null = null;
for (const candidate of candidates) {
const budgetMs = chunkBudget(deadline, candidates.length);
try {
return await Promise.race([
callUpstream(candidate, req),
timeout(budgetMs),
]);
} catch (err) {
if (!isRetryable(err)) throw err; // 4xx → no failover
recordFailure(candidate, err); // weight decays in real-time
lastError = err;
}
}
throw lastError ?? new Error("all upstreams exhausted");
}Two things matter here that are easy to get wrong:
- Don't fail over on 4xx. A 400 from the upstream means the request was bad — retrying it against another upstream will produce the same 400, just slower. Only 408, 429, 5xx, and network errors trigger failover.
- Bound the per-candidate budget. If a 540s deadline is split across 3 candidates, give each one ~150s to either start streaming or die. Don't give the first one the full 540 — if it hangs, you have no time left for backup.
Layer 2: First-byte watchdog (50ms)
Half of upstream incidents aren't hard failures — the connection opens, the request sends, and then nothing. No response, no error. Just silence. A naive retry waits the full timeout, then tries the next one, doubling the user-visible latency.
Our dispatcher arms an aggressive 50ms first-byte timer after the request hits the wire. If we don't see a single byte of response in that window, we abort and fall through. 50ms is below the perception threshold, so on the happy path the customer sees no delay. On the failure path, the customer's request fails over to the backup upstream within a single human-perceivable frame.
// First-byte detection — fail FAST if upstream is hung.
async function callUpstream(c: Candidate, req: ChatRequest) {
const ac = new AbortController();
// If we don't see a single response byte in 50ms after the request hits
// the wire, treat the upstream as silent and fall through to the next.
const firstByteWatchdog = setTimeout(() => ac.abort("ttfb-50ms"), 50);
const res = await fetch(c.url, {
method: "POST",
body: JSON.stringify(req),
signal: ac.signal,
});
clearTimeout(firstByteWatchdog);
if (!res.body) throw new Error("no-body");
// Now we have first byte. Hand back the stream; the client gets every
// chunk as it arrives — no buffering.
return new Response(res.body, {
headers: { "content-type": "text/event-stream" },
});
}We arrived at 50ms by measuring our own upstream TTFB P99: even the slowest upstream produces its first response byte under 40ms in 99% of cases. Anything above 50ms is signal, not noise. Tune this number to your own upstream P99.
Layer 3: The billing abacus
Reliability isn't just about uptime — it's about whether billing stays correct under load. Early on we had a race where two concurrent requests both passed the balance check, then both committed, and the user's account went negative by $0.12. The fix was a write-time reservation:
// The "billing abacus" — reserve estimated cost up front so two concurrent
// big requests can't both pass the balance check and then overdraw.
async function reserveBalance(userId: string, estCents: number) {
return await db.transaction(async (tx) => {
const bal = await tx.balance.findUnique({ where: { userId } });
if (bal.balanceCents - bal.reservedCents < estCents) {
throw new InsufficientQuotaError();
}
await tx.balance.update({
where: { userId },
data: { reservedCents: { increment: estCents } },
});
return { release: () => releaseReservation(userId, estCents) };
});
}Every call estimates its max cost up front (input tokens × input rate + max_tokens × output rate), reserves that amount on the wallet in a single transaction, and releases the unused portion after the actual usage is known. Concurrent requests serialize at the transaction layer. No overdrafts, ever.
The boring stuff that matters more
Every engineer reading this is nodding at the dispatcher diagrams and thinking "cool, weighted retry, first-byte watchdog, I got it." But the things that have actually made our SLO hold up month after month are less exciting:
- Heartbeats on every cron. Token rotation, balance sweeps, usage rollups — every one of them pings a Healthchecks.io endpoint when it completes. A missed heartbeat alerts within 5 minutes. We've caught more outages from missed heartbeats than from explicit alarms.
- A status page that's actually accurate. kunavo.com/status runs synthetic checks against every model every 90 seconds. When an upstream regresses, it goes yellow before customers notice.
- Runbooks before runbooks are needed. Every incident we've had, including the ones we caught before customer impact, produced a runbook entry. The 4am pager has a checklist; the engineer doesn't have to think.
- Sentry on everything, profiling on hot paths. The dispatcher logs every failover decision with the candidate, the reason, and the budget remaining. If a regression sneaks in, the search is "show me failovers with reason=ttfb-50ms grouped by upstream this hour."
What we don't do
A few things we've deliberately avoided, that you might expect a 99.95% gateway to do:
- Multi-region active-active database. Our database is single-region (iad). Replication lag would create billing inconsistencies we don't want to debug. If iad goes hard down, we fail to a read-only mode — customers can still call models (dispatcher is stateless at the edge) but can't see usage history for an hour. That tradeoff is right for us.
- Silently swapping models. Some aggregators, when they can't reach Claude Opus, will route to Haiku. We don't — we'd rather you get a 503 than a different model you didn't pay for. The dispatcher only fails over to the same model on a different upstream.
- Pre-emptively retrying. Anthropic's 429s mean something. Retrying immediately makes the problem worse for everyone. Our backoff is exponential with jitter, capped at 5 attempts and 30 seconds total wait — the same policy we recommend in our error guide.
Where the budget actually goes
Our error budget for May 2026 was 21 minutes 36 seconds. We used:
- 3 minutes 12 seconds: us-east-1 Anthropic blip during a deploy. Failover triggered, no requests dropped, but P99 latency spiked above the target for the window.
- 1 minute 50 seconds: a kie.ai 5xx burst on a video model. All retries succeeded, but the per-call latency violated SLO.
- 0 hard downtime. No requests returned 5xx with no retry option.
That's 5:02 of budget burn against a 21:36 budget — 23% used, well within target. The other 77% is what lets us deploy aggressively and absorb the next surprise.
What this means for you
If you're building on Kunavo, you don't need to implement your own provider failover, your own TTFB watchdog, or your own Anthropic-vs-Vertex routing. That's the whole point: one base URL, one bearer token, the SLO is ours to defend.
If you're building another gateway and reading this for ideas: the dispatcher is 600 lines of TypeScript. It's not the complicated part. The complicated parts are the failure-mode runbooks, the heartbeats, the status page, and the months of small operational scars that compound into a reliable system. Plan for those before you ship.
See the current uptime numbers and per-region breakdown at kunavo.com/status, or the full error taxonomy in our error docs.