Most LLM aggregators stop at text. The moment your product needs to generate an image, edit a photo, animate a still into video, voice an ad, or compose music — you're back to managing three more vendors, three more bills, three more SDKs. Kunavo gives you all of it through the same OpenAI-compatible API. This guide covers when to reach for which modality, the model menu per modality, and how to chain them into production multimodal pipelines.

The modality menu (and what each is for)

Text generation — Claude Opus/Sonnet/Haiku, Gemini 2.5 Pro/Flash, GPT. Use for reasoning, chat, structured output. /docs/chat
Image generation (text-to-image) — Nano Banana Pro, GPT-Image-2, Nano Banana 2. Picks vary by use case (see shootout below)
Image editing (image-to-image) — Nano Banana Edit, GPT-Image-2 Edit. Replace text on packaging, swap backgrounds, restyle
Video generation — Veo 3 Fast/Quality/Lite. Text-to-video and image-to-video both supported
Audio — Music — Suno V5 / V5.5. Full songs from a prompt

Browse the full catalog at /models.

Picking an image model — the 30-second rule

Need clean text on the image (UI mockups, posters, signs)? → Nano Banana Pro. No contest
Literal, detailed prompts? → GPT-Image-2. Strong prompt understanding; flat price at every resolution
High volume where cost matters? → Nano Banana 2. Cheapest at 1K, fidelity holds up
Cinematic with depth? → Nano Banana Pro. Bokeh is genuinely cinematic

Full deep-dive in our image models shootout — our live image-model comparison.

Picking a video model

Veo 3 Quality — best overall, ~$1.20 per 5s 1080p clip
Veo 3 (default) — sweet spot, ~$0.50 per 5s clip
Veo 3 Lite — cheap iteration, ~$0.25 per 5s clip, use for ideation

Video quickstart in our Veo 3 video quickstart.

Generating an image — the basics

image.py

# Text-to-image: same OpenAI SDK, different model slug
img = client.images.generate(
    model="nano-banana-pro",          # Google's flagship — best text rendering
    prompt="a marketing hero for a fintech app, isometric, soft blue palette",
    size="1024x1024",
)
print(img.data[0].url)  # 24h temporary URL — download immediately

Result URLs are temporary (24h for images, permanent for videos via files.kunavo.com CDN). For images, download immediately and store on your own CDN.

Image-to-video — taking your existing assets to motion

You already have product photography on your site. Veo 3 can animate any of it into a 5-second product clip:

image_to_video.py

# Image-to-video: upload a hero image, animate into a 5-second product clip
video = client.videos.generate(
    model="veo-3",
    prompt="camera slowly orbits, soft cinematic light, product hero shot",
    image_url=hero_image_url,
    aspect_ratio="9:16",   # vertical for Reels / TikTok / IG short
    duration=5,
    resolution="1080p",
)
print(video.data[0].url)

Marketing teams that previously spent $500-1500 per video shoot compress that to $0.50-$1.20 of API call + 30 seconds of compute. Works especially well for e-commerce hero animations, App Store / Play Store screenshots, social media product showcases.

Chaining modalities — the real superpower

The interesting workflows mix modalities in sequence. A marketing clip production line:

pipeline.py

# Pipeline: GPT-Image generates a marketing hero, Veo 3 animates it,
# Suno scores the BGM. One API, one bill.
async def make_marketing_clip(product: dict) -> dict:
    # 1. Hero image
    img = await client.images.generate(
        model="gpt-image-2",
        prompt=f"hero shot of {product['name']}, premium aesthetic",
        size="1024x1792",
    )

    # 2. Animate to 5s vertical
    video = await client.videos.generate(
        model="veo-3",
        prompt=f"camera orbits {product['name']}, cinematic lighting",
        image_url=img.data[0].url,
        aspect_ratio="9:16",
        duration=5,
    )

    # 3. BGM
    bgm = await client.audio.music.create(
        model="suno-v5",
        prompt="upbeat synth, 30 seconds, brand-friendly",
    )

    return {"video": video, "bgm": bgm}

Run this for 100 SKUs in parallel with asyncio.gather() and you have a marketing catalog production pipeline. Total cost ~$1-2 per SKU end-to-end. Compare to a creative agency's $500-2000 per asset.

Practical considerations

Result URL lifetimes: images expire in 24h (re-host to your CDN); videos are permanent on Kunavo's files.kunavo.com
Generation time: image ~5-30s, video ~30s-3min, music ~30s. Use force-dynamic server-side, surface progress to users
Failed generations are free — Kunavo never bills 4xx/5xx, so iterate freely on prompts
Text rendering in video: Veo 3 struggles with multi-word text in video. Render text in post (CapCut, Final Cut, browser-side canvas) instead of asking the model to render it
Real faces / IP: all major models refuse celebrity, political, and copyrighted character generation. Stick to original characters or licensed IP
Aspect ratios: 9:16 (vertical short), 16:9 (landscape), 1:1 (feed), 4:5 (Instagram portrait). Pass via aspect_ratio on video; on images, use size="1024x1792" etc.

Common multimodal use cases (with starting prompts)

E-commerce product hero animations — image-to-video, orbital camera, 5 seconds, 9:16
Real estate virtual staging — image-edit to add furniture to empty rooms
Marketing carousel / Reels production — batch image generation in your brand style + Veo animation
Educational explainer clips — generate slides as images, animate them with Veo 3, score with Suno
Podcast intros & audio branding — Suno music beds and stings for your show, generated from a prompt
Game / app asset generation — character portraits, item icons, environment backgrounds via Nano Banana Pro / GPT-Image-2

When NOT to use multimodal

Strict brand assets — model output drifts, even with tight prompts. Use AI for variations on a human-designed master, not to replace the master
Faces of real people — refused by all major models; legal/IP risk if you bypass
Live video / real-time generation — current models take 30s-3min per clip. Not yet suitable for live use cases
High-stakes accuracy — medical imaging, legal evidence, scientific publication. Multimodal generation is creative, not factual

Multimodal under one bill is what Kunavo is really for. Start at /docs/images or /docs/video for endpoint reference, or browse /models filtered by category.

Multimodal AI guide — text, image, video and audio under one OpenAI-compatible API