DocsData Extraction Agent

COOKBOOK_DATA_EXTRACTION

COOKBOOK

Turn unstructured text (PDFs, Emails) into structured JSON data.

OVERVIEW#

Extracting structured data from unstructured documents is a classic NLP task. In this recipe, we'll build an agent that extracts invoice details into a validated JSON schema.

1

DEFINE_THE_SCHEMA

The trick is to create a "No-Op" tool whose only purpose is to define the output structure.

extract-invoice.ts
import { Agent, Tool } from '@AKIOS/sdk'
import { z } from 'zod'

// 1. Define the Schema
const InvoiceSchema = z.object({
  invoice_number: z.string(),
  date: z.string().describe("ISO 8601 format"),
  vendor: z.string(),
  line_items: z.array(z.object({
    description: z.string(),
    amount: z.number(),
    quantity: z.number()
  })),
  total: z.number()
})

// 2. Create a "Save" tool
const saveInvoice = new Tool({
  name: 'save_invoice',
  description: 'Call this tool to save the extracted invoice data.',
  schema: InvoiceSchema,
  execute: async (data) => {
    // Save to DB
    console.log("Saving:", data)
    return "Success"
  }
})

// 3. The Agent
const extractor = new Agent({
  name: 'Extractor',
  model: 'gpt-4o', // Smart models work best for complex extraction
  systemPrompt: `You are a data entry clerk. 
  Extract info from the text and save it using the tool.
  If fields are missing, mark them as null or 0.`,
  tools: [saveInvoice]
})

// 4. Run
const rawText = `
  INVOICE #INV-2024-001
  Date: Jan 15, 2024
  From: Acme Corp
  
  Services:
  - Consulting: $500 (2 hrs)
  - Hosting: $50
  
  Total Due: $550
`

await extractor.run(rawText)

TESTING_EXTRACTION#

Run the agent against a raw text invoice to verify the extraction accuracy.

typescript
const rawInvoice = `
INVOICE #INV-2024-001
Date: March 10, 2024
To: Acme Corp

Items:
1. Cloud Hosting (Pro Plan) - $200.00
2. Data Storage (1TB) - $50.00

Total: $250.00
`;

const result = await extractionAgent.run(rawInvoice);
console.log(JSON.stringify(result, null, 2));

COST_OPTIMIZATION

For high volume, use a cheaper model like `gpt-3.5-turbo` or `mistral-small` once you have verified the prompt works reliably.