The boring valuable use case

For two decades, structured data extraction from unstructured documents was a quarter-long project: OCR pipeline, regex rules, edge case handling, layout templates per vendor. With LLMs in JSON mode, it's 30 lines of code, accurate on first pass, and ships in a day.

Categories where this works exceptionally well:

Invoices, receipts, purchase orders
Contracts, legal agreements (extract parties, dates, obligations)
Resumes / CVs (parse to ATS-friendly structure)
Lead enrichment from email signatures or web pages
Medical records, lab reports (with PII handling)
Real estate listings, product catalogs
Email triage and structured ingestion

The core pattern in 30 lines

extract.py

from openai import OpenAI
client = OpenAI(api_key="sk-kunavo-...", base_url="https://api.kunavo.com/v1")

# Extract structured fields from an invoice PDF (after OCR)
def extract_invoice(ocr_text: str) -> dict:
    resp = client.chat.completions.create(
        model="claude-haiku-4-5",
        messages=[
            {"role": "system", "content": (
                "Extract invoice fields. Return JSON: "
                '{vendor, invoice_number, date_iso, due_date_iso, '
                'currency, subtotal, tax, total, line_items: [{description, qty, unit_price, amount}]}'
                "\nUse null for missing fields. Dates in ISO 8601."
            )},
            {"role": "user", "content": ocr_text},
        ],
        response_format={"type": "json_object"},
        max_tokens=1200,
    )
    return json.loads(resp.choices[0].message.content)

# Extract from an image directly (no OCR step needed)
def extract_invoice_from_image(image_url: str) -> dict:
    resp = client.chat.completions.create(
        model="gemini-2-5-flash",   # vision + JSON mode
        messages=[
            {"role": "system", "content": "Extract invoice fields as JSON ..."},
            {"role": "user", "content": [
                {"type": "text", "text": "Extract this invoice:"},
                {"type": "image_url", "image_url": {"url": image_url}},
            ]},
        ],
        response_format={"type": "json_object"},
        max_tokens=1200,
    )
    return json.loads(resp.choices[0].message.content)

Two flows: text-only (after a separate OCR step) and vision-direct (Gemini reads the PDF/image and outputs JSON in one call). The vision path is faster to build but slightly more expensive per call; the OCR-then-LLM path is cheaper at scale because OCR is one-time cost per page.

Accuracy benchmarks

On a public invoice dataset (1,000 invoices from 50 vendors), JSON mode with a 200-word system prompt:

Claude Haiku 4.5: 96% field-level accuracy
Gemini 2.5 Flash: 95% field-level accuracy
Claude Sonnet 4.6: 98% field-level accuracy
Traditional templating: 80-90% on known vendors, 0% on new

The LLM handles new vendor layouts zero-shot. Add 2-3 example pairs in the system prompt (few-shot) and Haiku approaches Sonnet's accuracy at a fraction of the cost.

Cost per document

Invoice (~2K input tokens, ~500 output): Haiku ~$0.003, Sonnet ~$0.02
One-page contract (~5K input): Haiku ~$0.005, Sonnet ~$0.04
10-page contract (~50K input, structured summary out): Sonnet ~$0.30
Resume (~3K input): Haiku ~$0.003
Receipt image with vision: Gemini ~$0.005

Processing 10,000 invoices/month: ~$30 with Haiku. The closest commercial alternative (Rossum, AWS Textract + post-processing) runs ~$2,000-5,000/month for the same volume.

The 5 patterns that keep accuracy high

Explicit null handling: tell the model to use null for missing fields, not "N/A" or empty strings. Downstream code can distinguish "absent" from "intentionally blank"
ISO 8601 dates: state explicitly. The model otherwise picks regional formats and downstream parsing breaks
Currency as ISO code: "USD" not "$"; "EUR" not "€". Avoid ambiguity ($ for USD vs CAD vs AUD)
Validate the JSON: use Pydantic / Zod to enforce your schema. If invalid, re-prompt with the error
Spot-check 1% manually for a week, then 0.1% ongoing. Track field-level accuracy, not just record-level

When to use which model

Document type	Recommended model
Simple structured (invoice, receipt)	Claude Haiku 4.5 or Gemini 2.5 Flash
Long contracts / agreements	Claude Sonnet 4.6 (handles long context)
Vision-direct (PDF without OCR)	Gemini 2.5 Flash or Gemini 2.5 Pro for large pages
High-stakes (legal, medical)	Claude Sonnet 4.6 + human verification
Bulk ingestion (100K+/day)	Haiku 4.5 + few-shot examples + caching

Compliance considerations

For documents containing PII (invoices have names, addresses; medical records have everything), pseudonymize before extraction or use Zero Data Retention upstream. See the compliance guide for region-specific patterns and the DSGVO deep dive for the German playbook.

Start: /app/signup — pay-as-you-go from a $5 top-up covers ~1,500 invoice extractions, and your balance never expires. Read the chat docs at /docs/chat for JSON mode and tool-use patterns.

AI data extraction — structured output from PDFs, invoices, and unstructured text

推荐模型