Docs

Chat completions

Kunavo's /v1/chat/completions endpoint is bit-for-bit OpenAI-compatible across the Claude and Gemini families. Streaming, tools, vision, reasoning — all work via the same SDK.

Esta documentação está em inglês. Para um guia de início rápido em português, veja:Guia em português — Claude API no Brasil →

Endpoint: POST /v1/chat/completions. Request and response shape match OpenAI's chat completions API exactly — including streaming and the optional tool_calls / reasoning_tokens fields.

Basic call

resp = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[
        {"role": "system", "content": "You are a senior staff engineer."},
        {"role": "user", "content": "Pros and cons of postgres LISTEN/NOTIFY for a job queue?"},
    ],
    temperature=0.4,
    max_tokens=800,
)
print(resp.choices[0].message.content)

Parameters

Every standard OpenAI parameter is accepted. Some only make sense for certain providers — the translator passes through what each upstream supports.

Param	Type	Notes
`model`	string (required)	Any enabled slug from /v1/models.
`messages`	array (required)	Standard OpenAI message format.
`temperature`	0..2	Sampling temperature. Default 1.
`top_p`	0..1	Nucleus sampling. Mutually exclusive with temperature in some models.
`max_tokens`	int	Output cap. Reasoning tokens count separately.
`stream`	bool	Stream chunks as SSE. See below.
`tools`	array	Function/tool definitions. Claude and Gemini both support tool use.
`tool_choice`	auto\|none\|named	Force a specific tool or let the model decide.
`response_format`	object	Set <code>{type: "json_object"}</code> for guaranteed JSON output.
`seed`	int	Deterministic sampling where supported.
`stop`	string\|array	Hard stop sequences.

Streaming

Set stream=True. Kunavo emits server-sent events in OpenAI's exact format: each chunk is a chat.completion.chunk with choices[0].delta.content. The final usage payload arrives with data: [DONE].

stream = client.chat.completions.create(
    model="gemini-2-5-pro",
    messages=[{"role": "user", "content": "Explain B-trees in one paragraph."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

Streaming works for every chat model. For Claude (Anthropic-protocol upstream), we translate the Anthropic event stream into OpenAI deltas on the fly — your SDK doesn't notice the difference.

Tool / function calling

Tool calling works across providers. Define tools as JSON schema; the model returns tool_calls in its message; you execute and feed results back as role: "tool" messages.

tools = [{
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string"},
                "unit": {"type": "string", "enum": ["c", "f"]},
            },
            "required": ["city"],
        },
    },
}]

resp = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
    tool_choice="auto",
)

# Inspect tool calls the model wants to make
for call in resp.choices[0].message.tool_calls or []:
    print(call.function.name, call.function.arguments)

Vision / multimodal input

Models with the vision capability accept image content blocks. Use either an HTTPS URL or a data: base64 URI.

# Pass an image URL or a base64 data URI as part of a multimodal message
resp = client.chat.completions.create(
    model="gemini-2-5-pro",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {
                "url": "https://example.com/cat.jpg"
            }},
        ],
    }],
)
print(resp.choices[0].message.content)

Vision-capable models in the catalog: Claude Opus / Sonnet / Haiku 4.x, Gemini 2.5 Pro / 2.5 Flash.

Reasoning tokens

Models with extended thinking (Claude Opus 4.7, Sonnet 4.6, Gemini 2.5 Pro) emit reasoning tokens in addition to visible output. They're billed at the output rate. Inspect them via usage.completion_tokens_details.reasoning_tokens.

# Claude's "extended thinking" mode + Gemini "thinking" mode both surface
# reasoning_tokens in usage. You're billed for them at the output rate.
resp = client.chat.completions.create(
    model="claude-opus-4-7",
    messages=[{"role": "user", "content": "Plan a 3-week MLOps migration."}],
)
print(resp.usage.completion_tokens_details.reasoning_tokens)

Prompt caching

A long system prompt, a reference document, a few-shot block — any stable prefix can be cached upstream and replayed on later calls at a fraction of the input price. Cache hits surface in the usage object as prompt_tokens_details.cached_tokens.

Gemini and GPT models cache automatically — no request change needed. cached_tokens is a subset of prompt_tokens and is billed at a reduced cache-read rate.

Claude caches only the prefix you mark with a cache_control breakpoint. Through this OpenAI-compatible endpoint, attach it to a content block:

# Claude caches prefixes you mark with cache_control. Attach it to a content
# block; later calls reusing that prefix read it at ~10% of the input price.
resp = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": LONG_DOCUMENT,            # the stable, reused prefix
                "cache_control": {"type": "ephemeral"},
            },
            {"type": "text", "text": "Summarize the document above."},
        ],
    }],
)
print(resp.usage.prompt_tokens_details.cached_tokens)

For full control over Claude prompt caching — caching the system prompt and tool definitions, multiple breakpoints, and the native cache_creation_input_tokens / cache_read_input_tokens usage fields — use the native Messages API.

Usage object

Every response (and the final streaming chunk) includes a usage object:

Field	Meaning
`prompt_tokens`	Input tokens we billed — cached tokens included.
`prompt_tokens_details.cached_tokens`	Cached input — a subset of prompt_tokens, billed at a reduced cache-read rate.
`completion_tokens`	Visible output tokens.
`completion_tokens_details.reasoning_tokens`	Reasoning tokens (billed at output rate).
`total_tokens`	Sum of input + output + reasoning.
`credits_consumed`	Kunavo addition. Raw cost in kie credits (1 credit = $0.005).