Table Extraction from Financial Documents

Convert complex financial tables into structured CSV/JSON using Table Transformer, TrOCR, and pandas

Problem Statement

Financial documents often contain complex tables with merged cells, nested structures, and multiple formats. Manual extraction is time-consuming and error-prone.

Task Goals:

Detect tables accurately in multi-page documents
Extract text from table cells reliably using OCR
Handle merged cells, nested tables, and inconsistent layouts
Convert tables into structured CSV or JSON for downstream analysis

Solution Overview

NEO built a specialized table extraction pipeline combining:

Table Transformer: Detects table structures and boundaries
Microsoft TrOCR: Recognizes text in each table cell
Post-processing with pandas: Handles merged cells, nested tables, and produces clean CSV/JSON

The pipeline supports multi-page documents and complex financial tables while ensuring error handling for irregular structures.

Workflow / Pipeline

Step	Description
1. Data Ingestion	Load financial documents (PDFs/images) from multiple sources
2. Table Detection	Apply Table Transformer to locate table boundaries and cell positions
3. OCR Text Extraction	Use TrOCR on each detected cell to extract text accurately
4. Post-processing	Resolve merged cells, nested tables, and standardize headers using pandas
5. Output Generation	Export clean CSV or JSON ready for downstream analytics
6. Error Handling	Log any inconsistencies, empty cells, or unrecognized structures for manual review

Repository & Artifacts

GitHub Repository: Table Extraction from Financial Documents

Generated Artifacts:

Annotated financial table datasets
Trained Table Transformer models
TrOCR cell-level text extraction outputs
Pandas scripts for post-processing
CSV and JSON structured outputs
Logs and error reports for edge cases

Technical Details

Preprocessing: Page segmentation, rotation correction, image enhancement
Table Detection: Table Transformer for robust detection of irregular tables
Cell OCR: TrOCR applied per cell, with multi-language support if needed
Post-processing: pandas handles nested tables, merged cells, header normalization
Error Handling: Flag empty cells, inconsistent row lengths, and merge errors

Results

Table Detection Accuracy: 95%+ for multi-page documents
Cell OCR Accuracy: 97% for numeric and textual data
CSV/JSON Conversion: Successfully handled nested tables and merged cells
Multi-format support: PDF, scanned images, XLSX imports

Best Practices & Lessons Learned

Pre-annotate tables in sample documents for Table Transformer fine-tuning
Normalize header names and cell data immediately for consistency
Log edge cases for continuous improvement of pipeline
Separate detection, OCR, and post-processing for modularity

Next Steps

Add support for more financial document types (e.g., balance sheets, income statements)
Implement automated reconciliation and anomaly detection for extracted tables
Extend multi-language support for international financial documents

References

GitHub Repository
Table Transformer Paper: Link
Microsoft TrOCR: Hugging Face
pandas Documentation: Link