The boring valuable use case
For two decades, structured data extraction from unstructured documents was a quarter-long project: OCR pipeline, regex rules, edge case handling, layout templates per vendor. With LLMs in JSON mode, it's 30 lines of code, accurate on first pass, and ships in a day.
Categories where this works exceptionally well:
- Invoices, receipts, purchase orders
- Contracts, legal agreements (extract parties, dates, obligations)
- Resumes / CVs (parse to ATS-friendly structure)
- Lead enrichment from email signatures or web pages
- Medical records, lab reports (with PII handling)
- Real estate listings, product catalogs
- Email triage and structured ingestion
The core pattern in 30 lines
from openai import OpenAI
client = OpenAI(api_key="sk-kunavo-...", base_url="https://api.kunavo.com/v1")
# Extract structured fields from an invoice PDF (after OCR)
def extract_invoice(ocr_text: str) -> dict:
resp = client.chat.completions.create(
model="claude-haiku-4-5",
messages=[
{"role": "system", "content": (
"Extract invoice fields. Return JSON: "
'{vendor, invoice_number, date_iso, due_date_iso, '
'currency, subtotal, tax, total, line_items: [{description, qty, unit_price, amount}]}'
"\nUse null for missing fields. Dates in ISO 8601."
)},
{"role": "user", "content": ocr_text},
],
response_format={"type": "json_object"},
max_tokens=1200,
)
return json.loads(resp.choices[0].message.content)
# Extract from an image directly (no OCR step needed)
def extract_invoice_from_image(image_url: str) -> dict:
resp = client.chat.completions.create(
model="gemini-3-flash", # vision + JSON mode
messages=[
{"role": "system", "content": "Extract invoice fields as JSON ..."},
{"role": "user", "content": [
{"type": "text", "text": "Extract this invoice:"},
{"type": "image_url", "image_url": {"url": image_url}},
]},
],
response_format={"type": "json_object"},
max_tokens=1200,
)
return json.loads(resp.choices[0].message.content)Two flows: text-only (after a separate OCR step) and vision-direct (Gemini reads the PDF/image and outputs JSON in one call). The vision path is faster to build but slightly more expensive per call; the OCR-then-LLM path is cheaper at scale because OCR is one-time cost per page.
Accuracy benchmarks
On a public invoice dataset (1,000 invoices from 50 vendors), JSON mode with a 200-word system prompt:
- Claude Haiku 4.5: 96% field-level accuracy
- Gemini 3 Flash: 95% field-level accuracy
- Claude Sonnet 4.6: 98% field-level accuracy
- Traditional templating: 80-90% on known vendors, 0% on new
The LLM handles new vendor layouts zero-shot. Add 2-3 example pairs in the system prompt (few-shot) and Haiku approaches Sonnet's accuracy at a fraction of the cost.
Cost per document
- Invoice (~2K input tokens, ~500 output): Haiku ~$0.003, Sonnet ~$0.02
- One-page contract (~5K input): Haiku ~$0.005, Sonnet ~$0.04
- 10-page contract (~50K input, structured summary out): Sonnet ~$0.30
- Resume (~3K input): Haiku ~$0.003
- Receipt image with vision: Gemini ~$0.005
Processing 10,000 invoices/month: ~$30 with Haiku. The closest commercial alternative (Rossum, AWS Textract + post-processing) runs ~$2,000-5,000/month for the same volume.
The 5 patterns that keep accuracy high
- Explicit null handling: tell the model to use
nullfor missing fields, not "N/A" or empty strings. Downstream code can distinguish "absent" from "intentionally blank" - ISO 8601 dates: state explicitly. The model otherwise picks regional formats and downstream parsing breaks
- Currency as ISO code: "USD" not "$"; "EUR" not "€". Avoid ambiguity ($ for USD vs CAD vs AUD)
- Validate the JSON: use Pydantic / Zod to enforce your schema. If invalid, re-prompt with the error
- Spot-check 1% manually for a week, then 0.1% ongoing. Track field-level accuracy, not just record-level
When to use which model
| Document type | Recommended model |
|---|---|
| Simple structured (invoice, receipt) | Claude Haiku 4.5 or Gemini 3 Flash |
| Long contracts / agreements | Claude Sonnet 4.6 (handles long context) |
| Vision-direct (PDF without OCR) | Gemini 3 Flash or Gemini 3 Pro for large pages |
| High-stakes (legal, medical) | Claude Sonnet 4.6 + human verification |
| Bulk ingestion (100K+/day) | Haiku 4.5 + few-shot examples + caching |
Compliance considerations
For documents containing PII (invoices have names, addresses; medical records have everything), pseudonymize before extraction or use Zero Data Retention upstream. See the compliance guide for region-specific patterns and the DSGVO deep dive for the German playbook.
Start: /app/signup — $2 credit covers ~600 invoice extractions. Read the chat docs at /docs/chat for JSON mode and tool-use patterns.