To successfully develop and deploy a machine learning (ML) application that is resilient and robust, it’s important to understand how to design a bulletproof ML application development workflow.
In this post, I am going to take you through the machine learning lifecycle we use here at Bitstrapped with our client projects. This process ensures that the resulting production ML application delivers on its objectives and is resilient and adaptive over time.
The workflow we use has two distinct phases: An Experimental Phase and a Production Phase. Each phase of the workflow has a set of unique workflow stages. Below I have defined each stage, its role, and its objectives to help you better understand the entire methodology.
The Experiment Phase of the workflow is broken down into three key stages:
The Production Phase of the workflow has four key stages:
Let’s now step through each stage of the workflow to help you understand why each element is important and what each one is designed to do.
To help illustrate the workflow, we’ll use a sample use case where we’re deploying a machine learning application to detect patient falls in video footage acquired from video equipment in a hospital environment.
In any ML application workflow process, the first step is to define the problem we are trying to solve for our client and collect data that we can use with our machine learning models.
In our fall-detection application example, we would first want to collect video data in a hospital. The application we develop will use image processing to detect patient falls from the live video feeds. In a production environment, once a fall is detected the application would report the incident to the nursing or support staff that is caring for the patients and trigger an alert so staff can respond and help the fallen patient.
The process starts with data collection from cameras that are set up in the institution. If they don’t already exist they would be installed to specifications. If they did exist, our engineers would extract sample video data from them to experiment with.
The sample video would then be ingested into a data warehouse in our development environment. We only work with a sample because experimenting on a live video would be onerous. Using a subset of the dataset makes testing easier, and once we transition to the Production Phase do we then introduce the full dataset to the application.
The next step in the MLOps life cycle is data labeling.
In our example, our team of data scientists would use a data labeling software tool to mark up the video and isolate 3 to 5-second video segments that have captured examples of patient falls. Perhaps we might start with 1000 hours of footage used in experimentation.
That can be a very labor-intensive process so our data scientists use a variety of tools and services to speed up the labeling process or in some cases automate labeling in the test data.
You need to be able to know which video feeds had people falling in them and which ones didn't. We might choose to label video of people sitting, standing or walking to help identify behaviors captured on video.
This helps a machine learning application define what is happening in a video and what a patient fall looks like relative to the other activities shown in the video. This makes it easier to accurately detect patient falls from live footage in a production environment.
We would also use labeling to identify specific elements of the video footage. If the video sources show patient rooms then we could also use labeling to identify unique physical spaces in the hospital. This information is going to be used in the ML model to assist in understanding the context of the video and how it relates to the behavior you want to identify; In the case of our example: a patient fall.
Once the sample data has been labeled, our team then selects a machine learning model and starts to test different algorithms to provide the application with the best method of identifying a fall in the sample video. This is where the data scientists apply their math and magic.
We might use any number of video image processing algorithms and for each algorithm, we would test different parameters on them.
We might test color video versus black and white footage. Or we might adjust picture quality parameters such as contrast, brightness, or sharpness. We may even test various video resolutions. All these adjustments and tests help us optimize the ML application’s accuracy for fall detection.
As the data scientists test various algorithms and parameters with those algorithms, we also want to track each attempt and each test iteration.
That is because, in future, we might want to be able to go back to a specific model that we have tested and see what data we tested it on and what parameters we used. This is all documented using a tool like Vertex AI.
This is important in MLOps because you can only improve on something if you know exactly what you had to begin with. If that model starts to fail in the future or the data changes and you want to improve it, you should at least know how it was created and under what conditions.
At this point in the MLOps lifecycle, our data scientists will also transition off their local computers and start their work on the Google Cloud Platform (GCP) to leverage speed and performance. This elastic computing capability allows for quickly expanded computer processing, memory, and storage resources on demand without worrying about capacity planning and engineering for peak usage. The GCP offers performance that a local computer can’t provide.
Once our data scientists have selected a good algorithm and worked to optimize it using labeled data and further tweaks, they will arrive at fairly robust functionality.
However, we’re not yet done here. They will then move to the next stage where they iterate and tune the model even more. In this optimization step, we are looking to see if there are better parameters that can be selected to improve performance. We call this hyper-parameter tuning.
Hyperparameters that are tested might include adjusting the number of layers that are used in a neural network. Hyperparameter selection might also include weights and biases which can be adjusted for optimum performance. There may be any number of additional hyperparameters we might test in our efforts to optimize the ML model’s performance.
By the end of the Experimental Phase stages in the MLOps lifecycle, the end result is an algorithm that is set up and functioning well on sample data and is demonstrably functional. We would also have a record of all the experimentation conducted to date and the various outcomes that the work has produced.
At this point in the MLOps life cycle, we have completed the Experimental Phase and iterated through it until our team determines it is time to move into the Production Phase.
Around our shop, we call it “productionalizing” the model. (Some people call it “productionizing” the model.) You won’t find this industry buzzword in most dictionaries, but the definition of this term is that we put the machine learning model into production.
In the next phase, the objective is to put the fully tested ML application (a packaged binary) onto the camera system in the hospital. We are not done at this point, however.
The first step in this Production Phase is to conduct application training with the full set of data. While we may have been experimenting earlier with 100 GB of data while we were experimenting, we now need to train the application using the full set of data. So we’d be scaling up to use 100TB of video from the 100 GB of sample data we experimented with.
Once the model is trained on the full dataset we need a model registry to track versions of the model as it is deployed into production and tested.
For that, we can use DVC (3) a tool by Iterative.ai or we can use Vertex AI models. The model registry is a repository that stores all of the different model versions through the application’s production life cycle.
Once the model is in full production and its versions are tracked in a repository it can be further tweaked and optimize
We also want to do A/B testing of the model in the production environment and compare it to an earlier version of the model. We might also do what we call “canary testing” where we test a new production binary on a small subset of the data in the production system and evaluate how well it is doing. Then we incrementally increase the volume of data it processes as it proves itself.
From the example, we might start with 1 percent of the video data and increase it to 50 percent, comparing it to a previous version of the application. When the development team is happy with the performance of a model, it is then given 100 percent of the data available. AT this point the older version of the model is removed from production.
Once we deploy a model into production, it is important to monitor the system for CPU and memory usage, and other project-specific infrastructure performance like data bandwidth, camera power states, etc.
However, we also conduct what is called “drift monitoring”. This is also sometimes called model drift or data drift. There are two forms of model drift. The first one is called the train inference, drift, or skew. You can call it either.
What you’re looking for here is a major change in the production environment that which the model is operating.
To use the fall detection application example to understand this, let’s say the application has been deployed for five years and functions extremely well. But then the facility decides that it will upgrade all the cameras to the next generation of equipment. Now the system is receiving ultra-high-definition video feeds. High definition video that is captured by the system is replaced with ultra-high definition and the resolution goes from 1280x720 pixels (720p HD) to 3840x2160 pixels (4K).
The original machine learning code has not changed, but the data quality has changed. With machine learning, if the data changes, the behavior of the program may need to be able to automatically adapt to it.
Data trends can also shift over time. We wanna be able to detect those and automatically update and retrain our models to accommodate them.
Here is how data drift might show up in our example ML application. Let’s say your model was trained typically on video of adults in a facility, but now the hospital reconfigures its services to care for more children as patients. It could be that the model doesn’t detect children falling as effectively as it does for adults.
Or the hospital does a renovation of a wing and the position of the beds change when the patient rooms get a design refresh. That too could require a model update.
The server that runs the model can monitor for drift. When it is detected, it can either send out an alert that can then be used to kick off the human intervention or automated pipelines that are engineered to self-correct the model.
A machine learning development team will work to anticipate this and automate a response as best as possible. That is the key to machine learning, adapting the model as inputs change.
This Machine Learning Workflow is a highly tuned process that is designed to ensure that the MLOps lifecycle is correctly designed, tested and deployed so that the application achieves the results our clients are seeking. Even though we used an image processing application as an example, this same machine learning lifecycle can be used in any ML project on any application. The use case may change but the way we approach it at Bitstrapped doesn’t. We use a tried and true formula in all our ML development projects. If you have questions or would like a free discovery call with our team contact us.