Back to guides
Multimodal·May 25, 2026·10 min read

Multimodal AI guide — text, image, video and audio under one OpenAI-compatible API

Most aggregators only cover text. Kunavo gives you image (Nano Banana, GPT-Image, Flux, Seedream), video (Veo 3, Sora, Seedance), audio (ElevenLabs, Suno) on the same API. This guide is when to use which, and how to chain them for production-grade multimodal workflows.

Most LLM aggregators stop at text. The moment your product needs to generate an image, edit a photo, animate a still into video, voice an ad, or compose music — you're back to managing three more vendors, three more bills, three more SDKs. Kunavo gives you all of it through the same OpenAI-compatible API. This guide covers when to reach for which modality, the model menu per modality, and how to chain them into production multimodal pipelines.

The modality menu (and what each is for)

  • Text generation — Claude Opus/Sonnet/Haiku, Gemini 3 Pro/Flash, GPT. Use for reasoning, chat, structured output. /docs/chat
  • Image generation (text-to-image) — Nano Banana Pro, GPT-Image-2, Flux 2 Pro, Seedream V4. Picks vary by use case (see shootout below)
  • Image editing (image-to-image) — Nano Banana Edit, GPT-Image-2 Edit, Flux 2 Edit. Replace text on packaging, swap backgrounds, restyle
  • Video generation — Veo 3 Fast/Quality/Lite, Sora 2, Seedance. Text-to-video and image-to-video both supported
  • Audio — TTS — ElevenLabs V3 / Turbo / Dialogue. Multilingual, voice cloning available on enterprise tier
  • Audio — STT — Whisper-equivalent endpoints
  • Audio — Music — Suno V5 / V5.5. Full songs from a prompt

Browse the full catalog at /models.

Picking an image model — the 30-second rule

  • Need clean text on the image (UI mockups, posters, signs)? → Nano Banana Pro. No contest
  • Photorealism for products / portraits / food? → Flux 2 Pro. Sharpest skin, fabric, reflections
  • Illustration / editorial / stylized? → Seedream V4. Aesthetic-first, leans into your style prompts
  • Cinematic with depth? → Nano Banana Pro. Bokeh is genuinely cinematic

Full deep-dive in our image models shootout — 60 prompts, real comparison images.

Picking a video model

  • Veo 3 Quality — best overall, ~$1.20 per 5s 1080p clip
  • Veo 3 (default) — sweet spot, ~$0.50 per 5s clip
  • Veo 3 Lite — cheap iteration, ~$0.25 per 5s clip, use for ideation
  • Sora 2 / Sora 2 Pro — different aesthetic, stronger on text-only prompts (no image-to-video as of writing)
  • Seedance 2 — fast iteration, strong on stylized content

Quickstart for both Veo and Sora in our Veo + Sora API quickstart.

Generating an image — the basics

image.py
# Text-to-image: same OpenAI SDK, different model slug
img = client.images.generate(
    model="nano-banana-pro",          # Google's flagship — best text rendering
    prompt="a marketing hero for a fintech app, isometric, soft blue palette",
    size="1024x1024",
)
print(img.data[0].url)  # 24h temporary URL — download immediately

Result URLs are temporary (24h for images, permanent for videos via files.kunavo.com CDN). For images, download immediately and store on your own CDN.

Image-to-video — taking your existing assets to motion

You already have product photography on your site. Veo 3 can animate any of it into a 5-second product clip:

image_to_video.py
# Image-to-video: upload a hero image, animate into a 5-second product clip
video = client.videos.generate(
    model="veo-3",
    prompt="camera slowly orbits, soft cinematic light, product hero shot",
    image_url=hero_image_url,
    aspect_ratio="9:16",   # vertical for Reels / TikTok / IG short
    duration=5,
    resolution="1080p",
)
print(video.data[0].url)

Marketing teams that previously spent $500-1500 per video shoot compress that to $0.50-$1.20 of API call + 30 seconds of compute. Works especially well for e-commerce hero animations, App Store / Play Store screenshots, social media product showcases.

Chaining modalities — the real superpower

The interesting workflows mix modalities in sequence. A marketing clip production line:

pipeline.py
# Pipeline: GPT-Image generates a marketing hero, Veo 3 animates it,
# ElevenLabs voices the tagline, Suno scores the BGM. One API, one bill.
async def make_marketing_clip(product: dict) -> dict:
    # 1. Hero image
    img = await client.images.generate(
        model="gpt-image-2",
        prompt=f"hero shot of {product['name']}, premium aesthetic",
        size="1024x1792",
    )

    # 2. Animate to 5s vertical
    video = await client.videos.generate(
        model="veo-3",
        prompt=f"camera orbits {product['name']}, cinematic lighting",
        image_url=img.data[0].url,
        aspect_ratio="9:16",
        duration=5,
    )

    # 3. Voiceover (English + localized variants)
    voice = await client.audio.speech.create(
        model="elevenlabs-v3",
        voice="adam",
        input=product["tagline_en"],
    )

    # 4. BGM
    bgm = await client.audio.music.create(
        model="suno-v5",
        prompt="upbeat synth, 30 seconds, brand-friendly",
    )

    return {"video": video, "voice": voice, "bgm": bgm}

Run this for 100 SKUs in parallel with asyncio.gather() and you have a marketing catalog production pipeline. Total cost ~$1-2 per SKU end-to-end. Compare to a creative agency's $500-2000 per asset.

Practical considerations

  • Result URL lifetimes: images expire in 24h (re-host to your CDN); videos are permanent on Kunavo's files.kunavo.com
  • Generation time: image ~5-30s, video ~30s-3min, music ~30s. Use force-dynamic server-side, surface progress to users
  • Failed generations are free — Kunavo never bills 4xx/5xx, so iterate freely on prompts
  • Text rendering in video: Veo 3 struggles with multi-word text in video. Render text in post (CapCut, Final Cut, browser-side canvas) instead of asking the model to render it
  • Real faces / IP: all major models refuse celebrity, political, and copyrighted character generation. Stick to original characters or licensed IP
  • Aspect ratios: 9:16 (vertical short), 16:9 (landscape), 1:1 (feed), 4:5 (Instagram portrait). Pass via aspect_ratio on video; on images, use size="1024x1792" etc.

Common multimodal use cases (with starting prompts)

  • E-commerce product hero animations — image-to-video, orbital camera, 5 seconds, 9:16
  • Real estate virtual staging — image-edit to add furniture to empty rooms
  • Marketing carousel / Reels production — batch image generation in your brand style + Veo animation
  • Educational content with TTS narration — generate slides as images, narrate with ElevenLabs, optionally compose into video with timing
  • Podcast / audio drama production — Suno music + TTS dialogue mixed in your post-production pipeline
  • Game / app asset generation — character portraits, item icons, environment backgrounds via Flux 2 Pro / Seedream

When NOT to use multimodal

  • Strict brand assets — model output drifts, even with tight prompts. Use AI for variations on a human-designed master, not to replace the master
  • Faces of real people — refused by all major models; legal/IP risk if you bypass
  • Live video / real-time generation — current models take 30s-3min per clip. Not yet suitable for live use cases
  • High-stakes accuracy — medical imaging, legal evidence, scientific publication. Multimodal generation is creative, not factual

Multimodal under one bill is what Kunavo is really for. Start at /docs/images or /docs/video for endpoint reference, or browse /models filtered by category.