We partnered with a fast-growing digital tax preparation and financial information platform allowing consumers to prepare tax returns approved by tax professionals and securely access information seamlessly. By implementing machine learning tasks like intelligent document Optical Character Recognition (OCR), we were able to establish a streamlined process for consumers and serve as a source of valuable client insights for professionals in the financial industry.
It is no secret that tax documents come in a variety of formats and levels of quality. When managing a high volume of variable tax documents, there is a risk of process failure due to non-standardized formats. The challenge was to develop a production pipeline that would process a high volume of files, extracting with accuracy through accurate ML models, and monitor the entire process end-to-end. Careful consideration would be taken to manage the asynchronous processes required to upload tax documents, perform data cleansing, process with OCR, categorize and store.
The challenge faced was not only a data challenge, but a human workflow challenge. In working with the client team, to avoid silos, the structure of the project team consisting of data science & engineering would need to be revised to meet frequent milestones and touchpoints. The risk of silos, slows down the ability to move models to production, deliver on a solid MLOps process and ultimately solve the business project goals
As a standard, we start all Machine Learning projects with a review of the data. We first review the existing MLOps process, including an exploratory data analysis, a review of engineering and data science teams, and current cloud infrastructure required to host the ML pipelines.
An exploratory data analysis was conducted to understand data sources, model requirements, and associated features. The AI parsers for tax documents were tested for defects and limitations. In addition, we scored document quality and reviewed the available document parsers in the market. Since no out-of-the-box solution existed, a decision was taken to use multiple document parsers where a custom solution was built for a mixture of document formats, aggregating multiple tax files into a single document.
We designed and implemented a streaming data architecture, where new docs would be streamed into BigQuery and share data lake insights with financial partners with security and access control safeguards. During parsing, we applied data scrubbing, anonymized the data, and access control controls.
From a process standpoint, we initiated two work streams for each project team, with an emphasis on frequent touch points. The Data Science team lead the OCR process to extract valuable data from various document types using existing and newly built parsers, processing for image and data quality, cleaning, and organization of data. In parallel, the engineering team was building MLOps architecture to upload, process, and store the extracted data. The new system was a series of asynchronous workloads that ran through custom data pipelines, boasting a reliable process, ability to learn from models and advanced monitoring to managing failures.
The result was a newly formed data warehouse and data lake, ready to receive and store processed documents, with streamlined updates and access control. Data Loss prevention and network security safeguards were also implemented on a row-level to ensure anonymized data to share with partners and the ML application. Models could now be built atop of a clean warehoused data set.
The CTO and leadership wanted to see things in production, so our goal was to put things into production, not keeping models in staging and test environments. With the new working structure in place, leadership was confident that future advances to the models would not run into the risk of siloed experiments but as a cross-functional unit, as each team now played a role in managing the cloud infrastructure and deploying production ML workloads.
The new process has future-proofed ML capabilities with a unified data lake to help to ensure the application can take advantage of BigQuery ML with segmentation models, efficiency metrics, reduced time in ops, silhouette scores. The customer driven applications can now pull from the data lake to power platform service. This was the path to commercialize the consumer facing product.