Docs

Chat completions

Kunavo's /v1/chat/completions endpoint is bit-for-bit OpenAI-compatible across the Claude and Gemini families. Streaming, tools, vision, reasoning — all work via the same SDK.

Esta documentação está em inglês. Para um guia de início rápido em português, veja:Guia em português — Claude API no Brasil

Endpoint: POST /v1/chat/completions. Request and response shape match OpenAI's chat completions API exactly — including streaming and the optional tool_calls / reasoning_tokens fields.

Basic call

resp = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[
        {"role": "system", "content": "You are a senior staff engineer."},
        {"role": "user", "content": "Pros and cons of postgres LISTEN/NOTIFY for a job queue?"},
    ],
    temperature=0.4,
    max_tokens=800,
)
print(resp.choices[0].message.content)

Parameters

Every standard OpenAI parameter is accepted. Some only make sense for certain providers — the translator passes through what each upstream supports.

ParamTypeNotes
modelstring (required)Any enabled slug from /v1/models.
messagesarray (required)Standard OpenAI message format.
temperature0..2Sampling temperature. Default 1.
top_p0..1Nucleus sampling. Mutually exclusive with temperature in some models.
max_tokensintOutput cap. Reasoning tokens count separately.
streamboolStream chunks as SSE. See below.
toolsarrayFunction/tool definitions. Claude and Gemini both support tool use.
tool_choiceauto|none|namedForce a specific tool or let the model decide.
response_formatobjectSet <code>{type: "json_object"}</code> for guaranteed JSON output.
seedintDeterministic sampling where supported.
stopstring|arrayHard stop sequences.

Streaming

Set stream=True. Kunavo emits server-sent events in OpenAI's exact format: each chunk is a chat.completion.chunk with choices[0].delta.content. The final usage payload arrives with data: [DONE].

stream = client.chat.completions.create(
    model="gemini-3-pro",
    messages=[{"role": "user", "content": "Explain B-trees in one paragraph."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)
Streaming works for every chat model. For Claude (Anthropic-protocol upstream), we translate the Anthropic event stream into OpenAI deltas on the fly — your SDK doesn't notice the difference.

Tool / function calling

Tool calling works across providers. Define tools as JSON schema; the model returns tool_calls in its message; you execute and feed results back as role: "tool" messages.

tools = [{
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string"},
                "unit": {"type": "string", "enum": ["c", "f"]},
            },
            "required": ["city"],
        },
    },
}]

resp = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
    tool_choice="auto",
)

# Inspect tool calls the model wants to make
for call in resp.choices[0].message.tool_calls or []:
    print(call.function.name, call.function.arguments)

Vision / multimodal input

Models with the vision capability accept image content blocks. Use either an HTTPS URL or a data: base64 URI.

# Pass an image URL or a base64 data URI as part of a multimodal message
resp = client.chat.completions.create(
    model="gemini-3-pro",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {
                "url": "https://example.com/cat.jpg"
            }},
        ],
    }],
)
print(resp.choices[0].message.content)

Vision-capable models in the catalog: Claude Opus / Sonnet / Haiku 4.x, Gemini 3 Pro / 3 Flash, Gemini 2.5 Pro.

Reasoning tokens

Models with extended thinking (Claude Opus 4.7, Sonnet 4.6, Gemini 3 Pro) emit reasoning tokens in addition to visible output. They're billed at the output rate. Inspect them via usage.completion_tokens_details.reasoning_tokens.

# Claude's "extended thinking" mode + Gemini "thinking" mode both surface
# reasoning_tokens in usage. You're billed for them at the output rate.
resp = client.chat.completions.create(
    model="claude-opus-4-7",
    messages=[{"role": "user", "content": "Plan a 3-week MLOps migration."}],
)
print(resp.usage.completion_tokens_details.reasoning_tokens)

Prompt caching

A long system prompt, a reference document, a few-shot block — any stable prefix can be cached upstream and replayed on later calls at a fraction of the input price. Cache hits surface in the usage object as prompt_tokens_details.cached_tokens.

Gemini and GPT models cache automatically — no request change needed. cached_tokens is a subset of prompt_tokens and is billed at a reduced cache-read rate.

Claude caches only the prefix you mark with a cache_control breakpoint. Through this OpenAI-compatible endpoint, attach it to a content block:

# Claude caches prefixes you mark with cache_control. Attach it to a content
# block; later calls reusing that prefix read it at ~10% of the input price.
resp = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": LONG_DOCUMENT,            # the stable, reused prefix
                "cache_control": {"type": "ephemeral"},
            },
            {"type": "text", "text": "Summarize the document above."},
        ],
    }],
)
print(resp.usage.prompt_tokens_details.cached_tokens)
For full control over Claude prompt caching — caching the system prompt and tool definitions, multiple breakpoints, and the native cache_creation_input_tokens / cache_read_input_tokens usage fields — use the native Messages API.

Usage object

Every response (and the final streaming chunk) includes a usage object:

FieldMeaning
prompt_tokensInput tokens we billed — cached tokens included.
prompt_tokens_details.cached_tokensCached input — a subset of prompt_tokens, billed at a reduced cache-read rate.
completion_tokensVisible output tokens.
completion_tokens_details.reasoning_tokensReasoning tokens (billed at output rate).
total_tokensSum of input + output + reasoning.
credits_consumedKunavo addition. Raw cost in kie credits (1 credit = $0.005).

Where to go next