Case Study
FinTech

Transforming the Tax-Filing Industry with Large Scale Document Processing Systems

Highlights

We partnered with a fast-growing digital tax preparation and financial information platform allowing consumers to prepare tax returns approved by tax professionals and securely access information seamlessly. By implementing an event driven system and Optical Character Recognition (OCR), consumers are able to execute their tax review to be processed entirely by a computer system in real time. The highly scalable solution provides a streamlined process for consumers and with Data Lake integration serves as a source of valuable client insights for professionals in the financial industry.


Success Metrics


  • Event-driven architecture allows for ingestion of over 100,000 documents per hour with orchestration of serverless OCR to execute character extraction from documents
  • A process that typically takes accountants up to 2 weeks to deliver for their clients, can be delivered same day including consideration of human verification
  • End to end ingestion, processing, storage of document results in minutes
  • Highly reliable — with retry for failures and process redundancies, highly scalable — with event driven architecture and serverless compute, and highly available service — with API gateway and serverless ingestion

Industry
FinTech
Headquarters
Unites States

Challenge

It is no secret that tax documents come in a variety of formats and levels of quality. When managing a high volume of variable tax documents, there is a risk of process failure due to non-standardized formats. The challenge was to develop a highly scalable, reliable, and durable system with redundancies for failed processing attempts, retries, and fallbacks. The system would need to process documents in the millions a month during tax seasons with highly concurrent asynchronous processes including uploads, pre-processing, OCR, categorization, and storage. The system would also need to handle multi-tenant data lake capabilities for documents and insights owned by particular institutions.

Solution

We began with a review of the available DocAI parsers, specifically for supported tax documents, testing for defects and limitations, assessed and scored for quality, and built a solution that encompassed a wide variety of types and formats of documents. For example, when dealing with multiple-type document parsers, we were able to customize pipelines to recognize a mixture of document formats and aggregated task files found within a single document.  

We designed and implemented a streaming data architecture, where a single document at different stages in the processing lifecycle would be backed up to enable retry at a failed stage. Redundancies were in place where when a document type was not recognized, there was a versatile OCR parsing solution to process that document. During parsing pipelines, we applied data scrubbing, anonymization of the data, access controls, and multi-tenant, globally unique blob storage. Serverless microservices were used for all compute workloads.

From a process standpoint, we initiated two work streams for each project team, with an emphasis on frequent touch points.  The Data Science team lead the OCR process to extract valuable data from various document types using existing and newly built parsers, processing for image and data quality, cleaning, and organization of data. In parallel, the engineering team was building an event system architecture to upload, process, and store the extracted data. The new system was a series of asynchronous workloads that ran through custom data pipelines, boasting a reliable process, and advanced monitoring to manage failures.

Results

Benefits of design decisions include:

Event driven architecture

  • Limitless number of node additions for new documents types, workflows
  • Real time stream processing to reduce industry standard from weeks to minutes
  • Highly concurrent, asynchronous compute with minimal bottlenecks
  • Record keeping of events at different nodes to enable reliability and redundancies

Serverless compute

  • Unlimited scale required to meet document throughput and system availability

Improved Security with:

  • Short-lived signed urls
  • Least privilege IAM and bucket access
  • Shared VPC
  • Isolation of production and testing environments
  • API Gateway and credential-led access
  • Vertical integration of document processing removed reliance on third party APIs, keeping more traffic within VPC internals

Best Practices adopted:

  • Event reprocessing for disaster recovery
  • Terraform IaS
  • Custom SQL based logging of events
  • Database metrics

Serverless OCR and Document Type Extraction

  • Document types supported increase by 300% through Google parser types
  • Fallback parsers enabled % of documents processed by AI to reach 100%


See more case studies

Predictive Maintenance for Oil and Gas Supermajors

Cloud-based simulations to predict failure of equipment and improve the efficiency of maintenance operations

Read Case Study

Event-Driven Kubernetes pipelines for high performant at-home patient health monitoring systems

Reducing patient falls — a billion dollar cost to the healthcare system, can be combatted with help of Kubernetes Engine

Read Case Study