Docs

Chat completions

Kunavo's /v1/chat/completions endpoint is bit-for-bit OpenAI-compatible across the Claude and Gemini families. Streaming, tools, vision, reasoning — all work via the same SDK.

Dokumentasi ini berbahasa Inggris. Untuk panduan cepat berbahasa Indonesia, lihat:Panduan Bahasa Indonesia — cara mendapatkan Gemini API key →

Endpoint: POST /v1/chat/completions. Request and response shape match OpenAI's chat completions API exactly — including streaming and the optional tool_calls / reasoning_tokens fields.

Basic call

resp = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[
        {"role": "system", "content": "You are a senior staff engineer."},
        {"role": "user", "content": "Pros and cons of postgres LISTEN/NOTIFY for a job queue?"},
    ],
    temperature=0.4,
    max_tokens=800,
)
print(resp.choices[0].message.content)

Parameters

Every standard OpenAI parameter is accepted. Some only make sense for certain providers — the translator passes through what each upstream supports.

Param	Type	Notes
`model`	string (required)	Any enabled slug from /v1/models.
`messages`	array (required)	Standard OpenAI message format.
`temperature`	0..2	Sampling temperature. Default 1.
`top_p`	0..1	Nucleus sampling. Mutually exclusive with temperature in some models.
`max_tokens`	int	Output cap. Reasoning tokens count separately.
`stream`	bool	Stream chunks as SSE. See below.
`tools`	array	Function/tool definitions. Claude and Gemini both support tool use.
`tool_choice`	auto\|none\|named	Force a specific tool or let the model decide.
`response_format`	object	Set <code>{type: "json_object"}</code> for guaranteed JSON output.
`seed`	int	Deterministic sampling where supported.
`stop`	string\|array	Hard stop sequences.

Streaming

Set stream=True. Kunavo emits server-sent events in OpenAI's exact format: each chunk is a chat.completion.chunk with choices[0].delta.content. The final usage payload arrives with data: [DONE].

stream = client.chat.completions.create(
    model="gemini-2-5-pro",
    messages=[{"role": "user", "content": "Explain B-trees in one paragraph."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

Streaming works for every chat model. For Claude (Anthropic-protocol upstream), we translate the Anthropic event stream into OpenAI deltas on the fly — your SDK doesn't notice the difference.

Tool / function calling

Tool calling works across providers. Define tools as JSON schema; the model returns tool_calls in its message; you execute and feed results back as role: "tool" messages.

tools = [{
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string"},
                "unit": {"type": "string", "enum": ["c", "f"]},
            },
            "required": ["city"],
        },
    },
}]

resp = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
    tool_choice="auto",
)

# Inspect tool calls the model wants to make
for call in resp.choices[0].message.tool_calls or []:
    print(call.function.name, call.function.arguments)

Vision / multimodal input

Models with the vision capability accept image content blocks. Use either an HTTPS URL or a data: base64 URI.

# Pass an image URL or a base64 data URI as part of a multimodal message
resp = client.chat.completions.create(
    model="gemini-2-5-pro",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {
                "url": "https://example.com/cat.jpg"
            }},
        ],
    }],
)
print(resp.choices[0].message.content)

Vision-capable models in the catalog: Claude Opus / Sonnet / Haiku 4.x, Gemini 2.5 Pro / 2.5 Flash.

Reasoning tokens

Models with extended thinking (Claude Opus 4.7, Sonnet 4.6, Gemini 2.5 Pro) emit reasoning tokens in addition to visible output. They're billed at the output rate. Inspect them via usage.completion_tokens_details.reasoning_tokens.

# Claude's "extended thinking" mode + Gemini "thinking" mode both surface
# reasoning_tokens in usage. You're billed for them at the output rate.
resp = client.chat.completions.create(
    model="claude-opus-4-7",
    messages=[{"role": "user", "content": "Plan a 3-week MLOps migration."}],
)
print(resp.usage.completion_tokens_details.reasoning_tokens)

Prompt caching

A long system prompt, a reference document, a few-shot block — any stable prefix can be cached upstream and replayed on later calls at a fraction of the input price. Cache hits surface in the usage object as prompt_tokens_details.cached_tokens.

Gemini and GPT models cache automatically — no request change needed. cached_tokens is a subset of prompt_tokens and is billed at a reduced cache-read rate.

Claude caches only the prefix you mark with a cache_control breakpoint. Through this OpenAI-compatible endpoint, attach it to a content block:

# Claude caches prefixes you mark with cache_control. Attach it to a content
# block; later calls reusing that prefix read it at ~10% of the input price.
resp = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": LONG_DOCUMENT,            # the stable, reused prefix
                "cache_control": {"type": "ephemeral"},
            },
            {"type": "text", "text": "Summarize the document above."},
        ],
    }],
)
print(resp.usage.prompt_tokens_details.cached_tokens)

For full control over Claude prompt caching — caching the system prompt and tool definitions, multiple breakpoints, and the native cache_creation_input_tokens / cache_read_input_tokens usage fields — use the native Messages API.

Usage object

Every response (and the final streaming chunk) includes a usage object:

Field	Meaning
`prompt_tokens`	Input tokens we billed — cached tokens included.
`prompt_tokens_details.cached_tokens`	Cached input — a subset of prompt_tokens, billed at a reduced cache-read rate.
`completion_tokens`	Visible output tokens.
`completion_tokens_details.reasoning_tokens`	Reasoning tokens (billed at output rate).
`total_tokens`	Sum of input + output + reasoning.

Ollama compatibility

Ollama serves its local models through an OpenAI-compatible API at http://localhost:11434/v1, implementing the same /chat/completions contract documented on this page. That makes the two interchangeable at the client: anything written against Ollama's v1 API runs against Kunavo's by changing the base URL and the model.

from openai import OpenAI

# Before — Ollama's OpenAI-compatible API, local models
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",                       # ignored locally
)
resp = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Summarize this changelog."}],
)

# After — the same call against hosted frontier models.
# Two lines changed. Everything below them is untouched.
client = OpenAI(
    base_url="https://api.kunavo.com/v1",   # <- was localhost:11434/v1
    api_key=os.environ["KUNAVO_API_KEY"],
)
resp = client.chat.completions.create(
    model="claude-sonnet-4-6",              # <- was llama3.2
    messages=[{"role": "user", "content": "Summarize this changelog."}],
)

The pieces that usually break on a provider switch do not break here, because the wire format is the same one:

Feature	Ollama v1	Kunavo v1
Endpoint path	`/v1/chat/completions`	`/v1/chat/completions`
Streaming	SSE, chat.completion.chunk	Identical — same delta shape
Tool calling	tools / tool_calls	Identical across Claude, Gemini and GPT
Vision input	image_url blocks on multimodal local models	Same blocks; URL or base64 data URI
api_key	Placeholder, ignored	A real Kunavo key
model	A pulled tag (llama3.2)	A hosted slug (claude-sonnet-4-6)
usage extras	Token counts only	Adds cached_tokens and reasoning_tokens
Beyond chat	completions, embeddings	Plus images, video and audio endpoints

Keeping both? Read base_url, model and the key from environment variables and switch per environment — local models for offline iteration, hosted models where quality matters, with one code path. The full walkthrough, including what Ollama's OpenAI surface does and does not implement, is in the Ollama OpenAI-compatible API guide.

Is Kunavo's /v1/chat/completions compatible with Ollama's OpenAI API?

Yes — both implement the same OpenAI chat completions contract, so a client written against Ollama's http://localhost:11434/v1 works against https://api.kunavo.com/v1 unchanged. Request fields (messages, temperature, max_tokens, stream, tools, tool_choice, response_format, stop, seed) and response fields (choices[].message, choices[].delta on streams, usage) match. The differences are the ones you would expect from hosted models: the api_key is real rather than a placeholder, model takes a hosted slug such as claude-sonnet-4-6 instead of a pulled tag like llama3.2, and usage carries cached_tokens and reasoning_tokens that local models do not report.

How do I migrate an Ollama app to hosted Claude, Gemini or GPT?

Change two lines: set base_url from http://localhost:11434/v1 to https://api.kunavo.com/v1, and set model from the local tag to a hosted slug. Pass a real Kunavo key instead of the placeholder Ollama ignores. Streaming, tool calling and vision code paths need no changes because the SSE and tool_calls shapes are identical.

Can I keep Ollama for development and hosted models in production?

Yes, and it is a common setup — read the base URL, model and key from environment variables and switch per environment. Since both ends speak the same OpenAI protocol there is no second code path to maintain: local Llama or Qwen for offline iteration, hosted Claude or Gemini where quality matters. Tool definitions and streaming handlers are shared verbatim.

Which OpenAI endpoints does Kunavo add over Ollama's compatible surface?

Ollama's OpenAI-compatible surface covers /chat/completions, /completions, /embeddings and /models. Kunavo serves those plus /v1/images/generations, /v1/images/edits, /v1/video/generations, /v1/videos, /v1/audio/speech, /v1/audio/transcriptions and /v1/audio/music, and additionally exposes Claude on the native Anthropic /v1/messages endpoint — all on the same key.