## DocProcessor Summary DocProcessor is an AI-powered document understanding engine leveraging **Large Language Models (LLM)** and **Vision-Language Models (VLM)** to read, interpret, and extract structured data from diverse text-based documents across multiple languages and formats. ‌ ### Key Value Propositions * **Flexibility**: Choose precisely which data to extract via configuration—no model training required * **Speed**: Integrate new document types quickly without model retraining, reducing time from weeks to days * **Robustness**: High-quality extraction powered by LLM/VLM technology * **Multilingual**: Native support for 5 major European languages * **Easy Integration**: RESTful API for seamless system integration ## Technical Specifications ### Supported Languages (V1) DocProcessor supports the following languages for document text: * French * English * German * Italian * Spanish ### Supported Document Types (V1) | Document Type | Description | Key Fields Extracted | | --- | --- | --- | | Bank Account | Bank account document containing bank and account holder details | IBAN, BIC, Bank Name, Account Owner Name | | Energy Invoice | Simple energy consumption invoice without payment schedule | Address, Customer Name, Issue Date | | Energy Schedule | Monthly energy invoice payment schedule | Address, Customer Name, Issue Date | | Family Allowance | Family allowance document | Name, Address, Beneficiary Number, Payment Date, Benefit Amount, Family Quotient | | Insurance Attestation | Insurance attestation document containing personal information | Name, Address, Issue Date | | Payslip | Payslip document containing details about the employer, employee, and payment | Employer Name, Employer Address, Employee Name, Employee Address, Payslip Date, Net Salary, Gross Monthly Income, Monthly Income, Annual Income, Entry Date, Company Office ID, Code NAF, NIR | | Phone Invoice | Phone invoice document | Name, Address, Issue Date, Phone Number | | Provider Attestation | Subscription or plan attestation for a residence | Name, Address, Issue Date | | Retirement Pension | Retirement pension containing personal information with address | Name, Address, Issue Date | | Tax Notice | Tax notice document containing fiscal and personal information | Tax Year, Fiscal Number 1, Fiscal Number 2, Tax Reference, Address, Date, IBAN, BIC, Taxable Income, Tax Reference Income, Global Gross Income | | Vehicle Registration Certificate | The French vehicle registration certificate (carte grise) issued by the Agence Nationale des Titres Sécurisés (ANTS) or authorized professionals. Contains details about the vehicle, its owner, and technical specifications | Registration Number, VIN, Vehicle Type, Vehicle Make, First Registration Date, Issue Date, Formula Number, Legal Entity | ### Document Format (V1) * **Formats**: PDF, JPEG, PNG * **Pages**: Multi-page support * **Document Type**: Text document * **Maximum Size**: 20 MiB ### Performance & SLA * **Target Response Time**: 10 seconds * **Availability**: 100% * **API Standard**: REST * **Output Format**: Structured JSON ## How It Works DocProcessor follows a simple workflow to process your documents and extract structured data: ### Step 1: Submit Your Document Upload your document in one of the supported formats: * PDF * JPEG * PNG The system accepts documents up to 20 MB in size. ### Step 2: Automatic Processing Once submitted, DocProcessor automatically: * Reads and interprets the document content * Identifies the document type * Processes the document in its original language (French, Italian, German, English, or Spanish) * Extracts the specified data fields according to the configuration ### Step 3: Receive Structured Data The system returns extracted data as structured JSON, with typed data structures: * **TEXT**: Simple textual values * **ADDRESS**: Structured address with street, zip code, and city * **NAMES**: First names and last names separated * **DATE**: ISO-formatted dates (YYYY-MM-DD) ### Key Capabilities **Multilingual Support**: DocProcessor automatically detects and processes documents in French, Italian, German, English, or Spanish without requiring language specification. **New Document Types**: The system can adapt to new document formats without model retraining, ensuring no downtime when introducing new document types. ### Data Types Supported DocProcessor extracts fields with the following data types: * **TEXT**: Simple textual values (e.g., company name, fiscal number) * **ADDRESS**: Structured address with street, zip code, and city * **NAMES**: First names and last names separated * **DATE**: ISO-formatted dates (YYYY-MM-DD) ## Use Cases ### Example Use Case: Payslip Data Extraction A customer needs to extract specific data from employee payslips across multiple countries and formats. **Traditional Approach**: * Requires approximately 8 weeks to train models for each document variant * Limited flexibility for field customization * Significant development effort for new formats **DocProcessor Approach**: * Extract exactly the fields needed (e.g., last name, first name, gross salary) * Integration in days instead of weeks * Easy modification via prompt-based configuration * Support for documents from different countries without retraining ### Example Use Case: KYC Onboarding Financial institutions can automate customer data extraction from: * Tax notices for income verification * Bank account statements for IBAN validation * KBIS documents for company verification * Multiple document types in a single workflow ## API Integration ### Sample JSON Output ```json { "textDocumentInfo": { "documentTypeDetail": "PAYSLIP", "fields": { "EMPLOYER_NAME": { "data_type": "TEXT", "value": "NETHEOS" }, "NET_SALARY": { "data_type": "TEXT", "value": "250,76" }, "GROSS_SALARY": { "data_type": "TEXT", "value": "151,67" }, "EMPLOYER_ADDRESS": { "data_type": "ADDRESS", "address": "avenue bernard claude Parc Club du Millenaire", "zipCode": "34000", "city": "MONTPELLIER" }, "EMPLOYEE_NAME": { "data_type": "NAMES", "firstNames": "JOHN", "lastName": "CENA" }, "EMPLOYEE_ADDRESS": { "data_type": "ADDRESS", "address": "Les impasses de la Mer Appt 34 70 rue de Pivert", "zipCode": "34000", "city": "MONTPELLIER" }, "PAYSLIP_DATE": { "data_type": "DATE", "value": "2015-01-01" } } } } ``` ## Constraints & Limitations ### Technical Constraints * Documents must be text-based (not handwritten) * Maximum file size: 20 MB * Documents must be in A4 format or a similar standard size * Requires reasonable image quality for accurate extraction ## Benefits & Competitive Advantages ### Speed to Market * New document type integration in days vs traditional 8 weeks * No model training required for new formats * Rapid adaptation to customer-specific requirements ### Flexibility & Scalability * Customizable field extraction via configuration * Prompt-based approach allows easy modifications * Extensible architecture for future document types ### Quality & Accuracy * Powered by state-of-the-art LLM/VLM technology * High extraction quality across multiple languages * Robust handling of varied document structures # Integration with Namirial OnBoarding ## Overview DocProcessor integrates with Namirial OnBoarding (NOB) to enable automated document processing within customer onboarding workflows. **Current Status**: Demo integration - simplified workflow for evaluation purposes. ## Current Integration (Demo) ### Available Features The current integration offers a single, pre-configured workflow designed for demonstration and evaluation purposes: * Single document upload per request * Pre-configured document processing (no customization available) * No input parameters required * Processing through DocProcessor backend * Results available in NOB backoffice ### Limitations * Configuration options (`parameters` and `settings`) are not yet available * Document cannot be passed directly via API (upload link only) * Single workflow configuration * Limited to demonstration scenarios ## Integration Methods ### 1. Request Creation from Backoffice #### Setup The integration uses a Request Type based on a specific model. **Note**: The name references the legacy Text Engine system and will be updated to reflect the DocProcessor integration. #### Process **Step 1: Create Request** 1. Access the Namirial OnBoarding back office 2. Select the Request Type that covers 3. Click "Create" No parameters need to be configured - the system uses a pre-defined configuration. **Step 2: Upload Document** 1. The system generates a unique link 2. Share the link with the end user 3. User accesses the link and uploads a single document via the web interface **Step 3: Processing** * Document is sent to DocProcessor * Automatic processing and data extraction * Results available in the NOB back office ### 2. Request Creation via API #### Endpoint ``` POST https://test-eu-ie1-api.namirialonboarding.com/api/v2/requests ``` #### Headers ``` Authorization: Bearer {YOUR_ACCESS_TOKEN} Accept: application/json Content-Type: application/json ``` #### Request Body ```json { "requestTypeId": "8870fa7a-2e51-4af4-9724-2ac4230163db", "parameters": {}, "settings": {} } ``` **Note**: The `parameters` and `settings` fields are currently empty and not configurable. They are reserved for future enhancements. #### cURL Example ```shell curl 'https://test-eu-ie1-api.namirialonboarding.com/api/v2/requests?language=en' \ -H 'Authorization: Bearer {YOUR_ACCESS_TOKEN}' \ -H 'Accept: application/json' \ -H 'Content-Type: application/json' \ --data-raw '{"requestTypeId":"8870fa7a-2e51-4af4-9724-2ac4230163db","parameters":{},"settings":{}}' ``` #### Response ```json { "requestId": "unique-request-id", "link": "https://...", "status": "created" } ``` The response contains: * `requestId`: Unique identifier for tracking * `link`: URL to share with end user for document upload * `status`: Current request status #### Important Notes * The `requestTypeId` is specific to each integration and environment * Currently, the document cannot be passed in the API call body * Users must use the generated link to upload documents ## Planned Enhancements ### 1. Internal Operator Review Step Enable manual review and validation after automatic processing **Workflow**: 1. Document is processed automatically by DocProcessor 2. Request enters "Pending Review" status 3. Internal NOB operator reviews: * Original document * Extracted data * Processing results 4. Operator can: * Approve the request * Reject the request * Correct extracted data if needed 5. Request proceeds to next workflow step **Features**: * Automatic or manual assignment of requests to operators * Workload monitoring dashboard * Review history and audit trail ### 2. Document Upload via API Ability to pass the document directly in the request creation call, eliminating the upload link step. **Benefits**: * Fully automated workflow (no user interaction required) * Direct integration with external systems * Faster processing time ### 3. Configurable Parameters Enable configuration of document processing parameters per request. **Expected Parameters**: * `documentType`: Specify expected document type for optimized processing * `language`: Override language detection * `extractionFields`: Customize which fields to extract * `validationRules`: Apply custom validation logic **Expected Settings**: * `confidenceThreshold`: Minimum confidence score for extracted data * `manualReviewRequired`: Force manual review step * `webhookUrl`: URL for asynchronous notifications