Docs

Chat completions

Kunavo's /v1/chat/completions endpoint is bit-for-bit OpenAI-compatible across the Claude and Gemini families. Streaming, tools, vision, reasoning — all work via the same SDK.

Dokumentasi ini berbahasa Inggris. Untuk panduan cepat berbahasa Indonesia, lihat:Panduan Bahasa Indonesia — cara mendapatkan Gemini API key

Endpoint: POST /v1/chat/completions. Request and response shape match OpenAI's chat completions API exactly — including streaming and the optional tool_calls / reasoning_tokens fields.

Basic call

resp = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[
        {"role": "system", "content": "You are a senior staff engineer."},
        {"role": "user", "content": "Pros and cons of postgres LISTEN/NOTIFY for a job queue?"},
    ],
    temperature=0.4,
    max_tokens=800,
)
print(resp.choices[0].message.content)

Parameters

Every standard OpenAI parameter is accepted. Some only make sense for certain providers — the translator passes through what each upstream supports.

ParamTypeNotes
modelstring (required)Any enabled slug from /v1/models.
messagesarray (required)Standard OpenAI message format.
temperature0..2Sampling temperature. Default 1.
top_p0..1Nucleus sampling. Mutually exclusive with temperature in some models.
max_tokensintOutput cap. Reasoning tokens count separately.
streamboolStream chunks as SSE. See below.
toolsarrayFunction/tool definitions. Claude and Gemini both support tool use.
tool_choiceauto|none|namedForce a specific tool or let the model decide.
response_formatobjectSet <code>{type: "json_object"}</code> for guaranteed JSON output.
seedintDeterministic sampling where supported.
stopstring|arrayHard stop sequences.

Streaming

Set stream=True. Kunavo emits server-sent events in OpenAI's exact format: each chunk is a chat.completion.chunk with choices[0].delta.content. The final usage payload arrives with data: [DONE].

stream = client.chat.completions.create(
    model="gemini-2-5-pro",
    messages=[{"role": "user", "content": "Explain B-trees in one paragraph."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)
Streaming works for every chat model. For Claude (Anthropic-protocol upstream), we translate the Anthropic event stream into OpenAI deltas on the fly — your SDK doesn't notice the difference.

Tool / function calling

Tool calling works across providers. Define tools as JSON schema; the model returns tool_calls in its message; you execute and feed results back as role: "tool" messages.

tools = [{
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string"},
                "unit": {"type": "string", "enum": ["c", "f"]},
            },
            "required": ["city"],
        },
    },
}]

resp = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
    tool_choice="auto",
)

# Inspect tool calls the model wants to make
for call in resp.choices[0].message.tool_calls or []:
    print(call.function.name, call.function.arguments)

Vision / multimodal input

Models with the vision capability accept image content blocks. Use either an HTTPS URL or a data: base64 URI.

# Pass an image URL or a base64 data URI as part of a multimodal message
resp = client.chat.completions.create(
    model="gemini-2-5-pro",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {
                "url": "https://example.com/cat.jpg"
            }},
        ],
    }],
)
print(resp.choices[0].message.content)

Vision-capable models in the catalog: Claude Opus / Sonnet / Haiku 4.x, Gemini 2.5 Pro / 2.5 Flash.

Reasoning tokens

Models with extended thinking (Claude Opus 4.7, Sonnet 4.6, Gemini 2.5 Pro) emit reasoning tokens in addition to visible output. They're billed at the output rate. Inspect them via usage.completion_tokens_details.reasoning_tokens.

# Claude's "extended thinking" mode + Gemini "thinking" mode both surface
# reasoning_tokens in usage. You're billed for them at the output rate.
resp = client.chat.completions.create(
    model="claude-opus-4-7",
    messages=[{"role": "user", "content": "Plan a 3-week MLOps migration."}],
)
print(resp.usage.completion_tokens_details.reasoning_tokens)

Prompt caching

A long system prompt, a reference document, a few-shot block — any stable prefix can be cached upstream and replayed on later calls at a fraction of the input price. Cache hits surface in the usage object as prompt_tokens_details.cached_tokens.

Gemini and GPT models cache automatically — no request change needed. cached_tokens is a subset of prompt_tokens and is billed at a reduced cache-read rate.

Claude caches only the prefix you mark with a cache_control breakpoint. Through this OpenAI-compatible endpoint, attach it to a content block:

# Claude caches prefixes you mark with cache_control. Attach it to a content
# block; later calls reusing that prefix read it at ~10% of the input price.
resp = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": LONG_DOCUMENT,            # the stable, reused prefix
                "cache_control": {"type": "ephemeral"},
            },
            {"type": "text", "text": "Summarize the document above."},
        ],
    }],
)
print(resp.usage.prompt_tokens_details.cached_tokens)
For full control over Claude prompt caching — caching the system prompt and tool definitions, multiple breakpoints, and the native cache_creation_input_tokens / cache_read_input_tokens usage fields — use the native Messages API.

Usage object

Every response (and the final streaming chunk) includes a usage object:

FieldMeaning
prompt_tokensInput tokens we billed — cached tokens included.
prompt_tokens_details.cached_tokensCached input — a subset of prompt_tokens, billed at a reduced cache-read rate.
completion_tokensVisible output tokens.
completion_tokens_details.reasoning_tokensReasoning tokens (billed at output rate).
total_tokensSum of input + output + reasoning.
credits_consumedKunavo addition. Raw cost in kie credits (1 credit = $0.005).

Where to go next