Skip to content
English

Document ingestion

Ingestion is the entry point for every document in Simon. A dedicated sub-agent reads each file, splits it into segments when it contains several, and extracts structured data ready for the rest of the accounting cycle. The whole thing is automatic: you upload, Simon orchestrates.

flowchart LR
A[PDF/CSV file] --> B[Segmentation]
B --> C[Annotation]
C --> D[Validation + upload]
D --> E[Workflow check]

1. Segmentation

A file is not necessarily a single document. A ten-page PDF may bring together a bank statement (pages 1 to 6), an invoice (page 7) and a tax notice (pages 8 to 10). The sub-agent therefore starts by identifying the segments — the logical documents contained in the file — and declares for each one its type (FACTURE, RELEVE_BANCAIRE, FRAIS, BULLETIN_PAIE, AVIS_FISCAL, ENGAGEMENT, OD), its page range and a short summary.

2. Annotation, by type

Each segment is then handed to a specialized annotation skill. The skill knows the expected schema for its document type and guides the extraction of the right data:

Document typeExtracted data
InvoiceNumber, dates, third party (SIREN / intra-EU VAT), net/VAT/gross amounts, lines, category
Bank statement (PDF)IBAN, bank, transactions (date, label, amount, direction), balances per account
Bank statement (CSV)Mapping of columns to standardized transactions
Expense reportDate, issuer, amounts, category, details (fuel, meals…)
PayslipPeriod, employee, gross, contributions, taxable net, net to pay
Tax noticeType of tax, body, amount, deadline, taxable base
EngagementType (loan, leasing…), schedule, rate, term
ODType of OD, date, balanced accounting lines

3. Validation and upload

An annotation is only accepted if it holds up. Simon confronts it with a strict schema and a few business checks: consistency of amounts (net + VAT = gross for an invoice), accounting balance (debit = credit for an OD), consistency of balances (opening balance + movements = closing balance for a statement), and format conformity (normalized dates, amounts to two decimal places, nine-digit SIREN).

4. Batch check

Once all segments are annotated and uploaded, Simon checks that the batch is complete: has every declared segment actually been processed? If everything is in order, the files are archived and the documents enter the workflow. Otherwise, they remain pending correction.


The special case of bank statements

Statements receive additional processing, because the direction of a transaction cannot always be read from the text. For PDFs, Simon relies on the position of the credit/debit columns to confirm the direction of each line. It then checks that each account (main account, card…) presents an opening balance, movements and a consistent closing balance, before materializing each transaction as a bank transaction in the database.

Reclassification

If a segment has been mistyped — an invoice taken for an expense report, for example — the agent can reclassify it. The document then goes back through annotation, this time with the right procedure.

Also worth noting: the duplicate block is soft. When a file closely resembles an already-processed document without being its exact copy, Simon lets you confirm whether it is indeed a deliberate duplicate.


What triggers what comes next

A successful upload moves the document into the lifecycle: it is first uploaded, then the workflow moves it forward — data extraction, then validation of the checks. The agent only proceeds to qualification once the document is validated.