Why LLMs beat rules-based moderation

Sarcasm. Dog whistles. Brand-context misuse ("I love this product but I hate the customer service"). Multilingual nuance. Image+caption combinations where each alone passes but together don't. Rule-based moderation handles maybe 60% of trust & safety cases. The hard 40% is exactly where LLMs shine — they understand context, intent, and the gap between literal words and meaning.

Modern moderation pipelines mix both: rules for the easy 60% (regex, URL blocklists, hash-matched CSAM), LLM for the 40% that needs judgment.

Text moderation — Claude Haiku 4.5

Haiku is cheap and fast — ideal for high-volume moderation:

moderate_text.py

from openai import OpenAI
client = OpenAI(api_key="sk-kunavo-...", base_url="https://api.kunavo.com/v1")

POLICY = """You are a content moderator. Return JSON only:
{
  "verdict": "allow" | "review" | "block",
  "categories": ["toxic", "spam", "harassment", "sexual", "violence",
                 "self_harm", "deception", "brand_unsafe"],
  "confidence": 0.0..1.0,
  "rationale": "one sentence"
}
Be conservative — when in doubt, flag for human review."""

def moderate_text(text: str) -> dict:
    resp = client.chat.completions.create(
        model="claude-haiku-4-5",  # cheap + fast for high-volume
        messages=[
            {"role": "system", "content": POLICY},
            {"role": "user", "content": text},
        ],
        response_format={"type": "json_object"},
        max_tokens=200,
    )
    return json.loads(resp.choices[0].message.content)

Per-call cost: ~$0.0001-0.0003 depending on input length. At 1M moderation events/month: ~$100-300. Compare to commercial moderation APIs (Perspective at ~$0.0005/call, Hive at ~$0.002/call, Sightengine $0.003/call) — LLM-based moderation is comparable or cheaper, with much better contextual understanding.

Image moderation — Gemini 2.5 Flash

For images, Gemini 2.5 Flash has the best speed/quality/cost balance:

moderate_image.py

def moderate_image(image_url: str) -> dict:
    resp = client.chat.completions.create(
        model="gemini-2-5-flash",  # vision + cheap
        messages=[
            {"role": "system", "content": POLICY},
            {"role": "user", "content": [
                {"type": "text", "text": "Moderate this image:"},
                {"type": "image_url", "image_url": {"url": image_url}},
            ]},
        ],
        response_format={"type": "json_object"},
        max_tokens=200,
    )
    return json.loads(resp.choices[0].message.content)

Per-image cost: ~$0.001-0.003. Combined text+image moderation for social platforms with 100K posts/day: ~$3,000-9,000/month.

The three-tier verdict pattern

allow: low confidence of any violation → publish immediately
review: medium confidence → queue for human moderator. Most LLM verdicts here, but cheap to review
block: high confidence of severe violation (CSAM, credible threats, doxxing) → reject + log + escalate

Don't auto-block on LLM verdict alone unless it's a zero-tolerance category. False positives cost you users; let humans decide the gray zone.

Comparison to alternatives

Tool	Cost / 1M calls	Strength
Perspective API (Google)	Free (rate-limited)	English toxicity scoring
Hive Moderation	~$2,000	Image, video, audio — strong
Sightengine	~$3,000	Image specialist
OpenAI Moderation	Free	Text only, limited categories
Kunavo + Haiku/Gemini	~$100-300 text / ~$1,000 image	Contextual + multilingual + custom policy

The LLM advantage is policy customization. Hive and Sightengine give you fixed categories. With Claude/Gemini, you write the policy in English (or any language) and it adapts. Add a new restricted category? Update the system prompt. No model retraining.

Compliance angles

DSA (EU Digital Services Act): requires platforms to publish moderation transparency reports. The rationale field in the JSON above is what you log
CSAM: never route through LLM. Use hash-matching (NCMEC, IWF) — LLMs are not the right tool for known illegal content
PII in moderated content: hash or pseudonymize before sending to LLM if dealing with private DMs/messages — see the compliance guide

Get started: free signup, then read the /docs/chat reference for JSON-mode and structured output patterns.

AI content moderation — multimodal trust & safety with Claude and Gemini

Recommended models