Case Study

Streamlining tax-filing with intelligent document OCR and Machine Learning


We partnered with a fast-growing digital tax preparation and financial information platform allowing consumers to prepare tax returns approved by tax professionals and securely access information seamlessly. By implementing machine learning tasks like intelligent document Optical Character Recognition (OCR), we were able to establish a streamlined process for consumers and serve as a source of valuable client insights for professionals in the financial industry.

Success Metrics

  • Event-driven architecture allows for to ingestion over 100,000 documents per hour and run OCR at 10,000 documents per hour
  • Human in the loop data verification and automation to save time to ingest, verify, and parse data, down from 3 hours to just 15 minutes
  • End-to-end deployment of new OCR tax parsers under 10 minutes
  • Real-time data quality assessment in place

Unites States


It is no secret that tax documents come in a variety of formats and levels of quality. When managing a high volume of variable tax documents, there is a risk of process failure due to non-standardized formats. The challenge was to develop a production pipeline that would process a high volume of files, extracting with accuracy through accurate ML models, and monitor the entire process end-to-end. Careful consideration would be taken to manage the asynchronous processes required to upload tax documents, perform data cleansing, process with OCR, categorize and store.

The challenge faced was not only a data challenge, but a human workflow challenge. In working with the client team, to avoid silos, the structure of the project team consisting of  data science & engineering would need to be revised to meet frequent milestones and touchpoints. The risk of silos, slows down the ability to move models to production, deliver on a solid MLOps process and ultimately solve the business project goals


As a standard, we start all Machine Learning projects with a review of the data. We first review the existing MLOps process, including an exploratory data analysis, a review of engineering and data science teams, and current cloud infrastructure required to host the ML pipelines.

An exploratory data analysis was conducted to understand data sources, model requirements, and associated features. The AI parsers for tax documents were tested for defects and limitations. In addition, we scored document quality and reviewed the available document parsers in the market. Since no out-of-the-box solution existed, a decision was taken to use multiple document parsers where a custom solution was built for a mixture of document formats, aggregating multiple tax files into a single document. 

We designed and implemented a streaming data architecture, where new docs would be streamed into BigQuery and share data lake insights with financial partners with security and access control safeguards. During parsing, we applied data scrubbing, anonymized the data, and access control controls.

From a process standpoint, we initiated two work streams for each project team, with an emphasis on frequent touch points.  The Data Science team lead the OCR process to extract valuable data from various document types using existing and newly built parsers, processing for image and data quality, cleaning, and organization of data. In parallel, the engineering team was building MLOps architecture to upload, process, and store the extracted data. The new system was a series of asynchronous workloads that ran through custom data pipelines, boasting a reliable process, ability to learn from models and advanced monitoring to managing failures.


The result was a newly formed data warehouse and data lake, ready to receive and store processed documents, with streamlined updates and access control. Data Loss prevention and network security safeguards were also implemented on a row-level to ensure anonymized data to share with partners and the ML application. Models could now be built atop of a clean warehoused data set.

The CTO and leadership wanted to see things in production, so our goal was to put things into production, not keeping models in staging and test environments. With the new working structure in place, leadership was confident that future advances to the models would not run into the risk of siloed experiments but as a cross-functional unit,  as each team now played a role in managing the cloud infrastructure and deploying production ML workloads.

The new process has future-proofed ML capabilities with a unified data lake to help to ensure the application can take advantage of BigQuery ML with segmentation models, efficiency metrics, reduced time in ops, silhouette scores. The customer driven applications can now pull from the data lake to power platform service. This was the path to commercialize the consumer facing product.

See more case studies

Predictive Maintenance for Oil and Gas Supermajors

Cloud-based simulations to predict failure of equipment and improve efficiency of maintenance operations

Read Case Study

Computer Vision Machine Learning for at-home patient health monitoring systems

Reducing patient falls — a billion dollar cost to the healthcare system, with computer vision machine learning and MLOps.

Read Case Study