Why LLMs beat rules-based moderation
Sarcasm. Dog whistles. Brand-context misuse ("I love this product but I hate the customer service"). Multilingual nuance. Image+caption combinations where each alone passes but together don't. Rule-based moderation handles maybe 60% of trust & safety cases. The hard 40% is exactly where LLMs shine — they understand context, intent, and the gap between literal words and meaning.
Modern moderation pipelines mix both: rules for the easy 60% (regex, URL blocklists, hash-matched CSAM), LLM for the 40% that needs judgment.
Text moderation — Claude Haiku 4.5
Haiku is cheap and fast — ideal for high-volume moderation:
from openai import OpenAI
client = OpenAI(api_key="sk-kunavo-...", base_url="https://api.kunavo.com/v1")
POLICY = """You are a content moderator. Return JSON only:
{
"verdict": "allow" | "review" | "block",
"categories": ["toxic", "spam", "harassment", "sexual", "violence",
"self_harm", "deception", "brand_unsafe"],
"confidence": 0.0..1.0,
"rationale": "one sentence"
}
Be conservative — when in doubt, flag for human review."""
def moderate_text(text: str) -> dict:
resp = client.chat.completions.create(
model="claude-haiku-4-5", # cheap + fast for high-volume
messages=[
{"role": "system", "content": POLICY},
{"role": "user", "content": text},
],
response_format={"type": "json_object"},
max_tokens=200,
)
return json.loads(resp.choices[0].message.content)Per-call cost: ~$0.0001-0.0003 depending on input length. At 1M moderation events/month: ~$100-300. Compare to commercial moderation APIs (Perspective at ~$0.0005/call, Hive at ~$0.002/call, Sightengine $0.003/call) — LLM-based moderation is comparable or cheaper, with much better contextual understanding.
Image moderation — Gemini 3 Flash
For images, Gemini 3 Flash has the best speed/quality/cost balance:
def moderate_image(image_url: str) -> dict:
resp = client.chat.completions.create(
model="gemini-3-flash", # vision + cheap
messages=[
{"role": "system", "content": POLICY},
{"role": "user", "content": [
{"type": "text", "text": "Moderate this image:"},
{"type": "image_url", "image_url": {"url": image_url}},
]},
],
response_format={"type": "json_object"},
max_tokens=200,
)
return json.loads(resp.choices[0].message.content)Per-image cost: ~$0.001-0.003. Combined text+image moderation for social platforms with 100K posts/day: ~$3,000-9,000/month.
The three-tier verdict pattern
- allow: low confidence of any violation → publish immediately
- review: medium confidence → queue for human moderator. Most LLM verdicts here, but cheap to review
- block: high confidence of severe violation (CSAM, credible threats, doxxing) → reject + log + escalate
Don't auto-block on LLM verdict alone unless it's a zero-tolerance category. False positives cost you users; let humans decide the gray zone.
Comparison to alternatives
| Tool | Cost / 1M calls | Strength |
|---|---|---|
| Perspective API (Google) | Free (rate-limited) | English toxicity scoring |
| Hive Moderation | ~$2,000 | Image, video, audio — strong |
| Sightengine | ~$3,000 | Image specialist |
| OpenAI Moderation | Free | Text only, limited categories |
| Kunavo + Haiku/Gemini | ~$100-300 text / ~$1,000 image | Contextual + multilingual + custom policy |
The LLM advantage is policy customization. Hive and Sightengine give you fixed categories. With Claude/Gemini, you write the policy in English (or any language) and it adapts. Add a new restricted category? Update the system prompt. No model retraining.
Compliance angles
- DSA (EU Digital Services Act): requires platforms to publish moderation transparency reports. The rationale field in the JSON above is what you log
- CSAM: never route through LLM. Use hash-matching (NCMEC, IWF) — LLMs are not the right tool for known illegal content
- PII in moderated content: hash or pseudonymize before sending to LLM if dealing with private DMs/messages — see the compliance guide
Get started: free signup, then read the /docs/chat reference for JSON-mode and structured output patterns.