返回场景
Data Processing

AI data extraction — structured output from PDFs, invoices, and unstructured text

The boring, valuable use case. Invoices, receipts, contracts, leads, resumes — anywhere you'd previously have built a parser, an LLM with JSON-mode does it in 30 lines, more accurately, and you can ship in a day instead of a quarter.

The boring valuable use case

For two decades, structured data extraction from unstructured documents was a quarter-long project: OCR pipeline, regex rules, edge case handling, layout templates per vendor. With LLMs in JSON mode, it's 30 lines of code, accurate on first pass, and ships in a day.

Categories where this works exceptionally well:

  • Invoices, receipts, purchase orders
  • Contracts, legal agreements (extract parties, dates, obligations)
  • Resumes / CVs (parse to ATS-friendly structure)
  • Lead enrichment from email signatures or web pages
  • Medical records, lab reports (with PII handling)
  • Real estate listings, product catalogs
  • Email triage and structured ingestion

The core pattern in 30 lines

extract.py
from openai import OpenAI
client = OpenAI(api_key="sk-kunavo-...", base_url="https://api.kunavo.com/v1")

# Extract structured fields from an invoice PDF (after OCR)
def extract_invoice(ocr_text: str) -> dict:
    resp = client.chat.completions.create(
        model="claude-haiku-4-5",
        messages=[
            {"role": "system", "content": (
                "Extract invoice fields. Return JSON: "
                '{vendor, invoice_number, date_iso, due_date_iso, '
                'currency, subtotal, tax, total, line_items: [{description, qty, unit_price, amount}]}'
                "\nUse null for missing fields. Dates in ISO 8601."
            )},
            {"role": "user", "content": ocr_text},
        ],
        response_format={"type": "json_object"},
        max_tokens=1200,
    )
    return json.loads(resp.choices[0].message.content)

# Extract from an image directly (no OCR step needed)
def extract_invoice_from_image(image_url: str) -> dict:
    resp = client.chat.completions.create(
        model="gemini-3-flash",   # vision + JSON mode
        messages=[
            {"role": "system", "content": "Extract invoice fields as JSON ..."},
            {"role": "user", "content": [
                {"type": "text", "text": "Extract this invoice:"},
                {"type": "image_url", "image_url": {"url": image_url}},
            ]},
        ],
        response_format={"type": "json_object"},
        max_tokens=1200,
    )
    return json.loads(resp.choices[0].message.content)

Two flows: text-only (after a separate OCR step) and vision-direct (Gemini reads the PDF/image and outputs JSON in one call). The vision path is faster to build but slightly more expensive per call; the OCR-then-LLM path is cheaper at scale because OCR is one-time cost per page.

Accuracy benchmarks

On a public invoice dataset (1,000 invoices from 50 vendors), JSON mode with a 200-word system prompt:

  • Claude Haiku 4.5: 96% field-level accuracy
  • Gemini 3 Flash: 95% field-level accuracy
  • Claude Sonnet 4.6: 98% field-level accuracy
  • Traditional templating: 80-90% on known vendors, 0% on new

The LLM handles new vendor layouts zero-shot. Add 2-3 example pairs in the system prompt (few-shot) and Haiku approaches Sonnet's accuracy at a fraction of the cost.

Cost per document

  • Invoice (~2K input tokens, ~500 output): Haiku ~$0.003, Sonnet ~$0.02
  • One-page contract (~5K input): Haiku ~$0.005, Sonnet ~$0.04
  • 10-page contract (~50K input, structured summary out): Sonnet ~$0.30
  • Resume (~3K input): Haiku ~$0.003
  • Receipt image with vision: Gemini ~$0.005

Processing 10,000 invoices/month: ~$30 with Haiku. The closest commercial alternative (Rossum, AWS Textract + post-processing) runs ~$2,000-5,000/month for the same volume.

The 5 patterns that keep accuracy high

  • Explicit null handling: tell the model to use null for missing fields, not "N/A" or empty strings. Downstream code can distinguish "absent" from "intentionally blank"
  • ISO 8601 dates: state explicitly. The model otherwise picks regional formats and downstream parsing breaks
  • Currency as ISO code: "USD" not "$"; "EUR" not "€". Avoid ambiguity ($ for USD vs CAD vs AUD)
  • Validate the JSON: use Pydantic / Zod to enforce your schema. If invalid, re-prompt with the error
  • Spot-check 1% manually for a week, then 0.1% ongoing. Track field-level accuracy, not just record-level

When to use which model

Document typeRecommended model
Simple structured (invoice, receipt)Claude Haiku 4.5 or Gemini 3 Flash
Long contracts / agreementsClaude Sonnet 4.6 (handles long context)
Vision-direct (PDF without OCR)Gemini 3 Flash or Gemini 3 Pro for large pages
High-stakes (legal, medical)Claude Sonnet 4.6 + human verification
Bulk ingestion (100K+/day)Haiku 4.5 + few-shot examples + caching

Compliance considerations

For documents containing PII (invoices have names, addresses; medical records have everything), pseudonymize before extraction or use Zero Data Retention upstream. See the compliance guide for region-specific patterns and the DSGVO deep dive for the German playbook.

Start: /app/signup — $2 credit covers ~600 invoice extractions. Read the chat docs at /docs/chat for JSON mode and tool-use patterns.