Table Extraction from Financial Documents
Convert complex financial tables into structured CSV/JSON using Table Transformer, TrOCR, and pandas
Problem Statement
Financial documents often contain complex tables with merged cells, nested structures, and multiple formats. Manual extraction is time-consuming and error-prone.
Task Goals:
- Detect tables accurately in multi-page documents
- Extract text from table cells reliably using OCR
- Handle merged cells, nested tables, and inconsistent layouts
- Convert tables into structured CSV or JSON for downstream analysis
Solution Overview
NEO built a specialized table extraction pipeline combining:
- Table Transformer: Detects table structures and boundaries
- Microsoft TrOCR: Recognizes text in each table cell
- Post-processing with pandas: Handles merged cells, nested tables, and produces clean CSV/JSON
The pipeline supports multi-page documents and complex financial tables while ensuring error handling for irregular structures.
Workflow / Pipeline
| Step | Description |
|---|---|
| 1. Data Ingestion | Load financial documents (PDFs/images) from multiple sources |
| 2. Table Detection | Apply Table Transformer to locate table boundaries and cell positions |
| 3. OCR Text Extraction | Use TrOCR on each detected cell to extract text accurately |
| 4. Post-processing | Resolve merged cells, nested tables, and standardize headers using pandas |
| 5. Output Generation | Export clean CSV or JSON ready for downstream analytics |
| 6. Error Handling | Log any inconsistencies, empty cells, or unrecognized structures for manual review |
Repository & Artifacts
GitHub Repository: Table Extraction from Financial Documents
Generated Artifacts:
- Annotated financial table datasets
- Trained Table Transformer models
- TrOCR cell-level text extraction outputs
- Pandas scripts for post-processing
- CSV and JSON structured outputs
- Logs and error reports for edge cases
Technical Details
- Preprocessing: Page segmentation, rotation correction, image enhancement
- Table Detection: Table Transformer for robust detection of irregular tables
- Cell OCR: TrOCR applied per cell, with multi-language support if needed
- Post-processing: pandas handles nested tables, merged cells, header normalization
- Error Handling: Flag empty cells, inconsistent row lengths, and merge errors
Results
- Table Detection Accuracy: 95%+ for multi-page documents
- Cell OCR Accuracy: 97% for numeric and textual data
- CSV/JSON Conversion: Successfully handled nested tables and merged cells
- Multi-format support: PDF, scanned images, XLSX imports
Best Practices & Lessons Learned
- Pre-annotate tables in sample documents for Table Transformer fine-tuning
- Normalize header names and cell data immediately for consistency
- Log edge cases for continuous improvement of pipeline
- Separate detection, OCR, and post-processing for modularity
Next Steps
- Add support for more financial document types (e.g., balance sheets, income statements)
- Implement automated reconciliation and anomaly detection for extracted tables
- Extend multi-language support for international financial documents
References
- GitHub Repository
- Table Transformer Paper: Link
- Microsoft TrOCR: Hugging Face
- pandas Documentation: Link