Machine Learning

How Exploratory Data Analysis (EDA) Accelerates ML

By
Adam Thorsteinson
Bitstrapped
Updated
October 20, 2021

Constructing machine learning models is easier than it has ever been. Cloud services like the Google Cloud Platform (GCP) and Amazon Web Services (AWS) have brought tools for building and deploying machine learning (ML) models directly to the hands of anyone who’s willing to put in the time and effort to learn how to use them. With modelling itself being more efficient, it’s what comes before and after the model-fitting that will bring businesses a competitive advantage in their space. Exploratory Data Analysis (EDA) is a critical part of this process.

Exploratory Data Analysis (EDA) helps ensure your ML implementations are reliable, robust, and can bring insights before you engineer the pipeline and push to production

Here are three ways that EDA can improve your ML implementations:

Exploratory Data Analysis Surfaces Data Quality Issues (Before it’s too Late)

Any ML practitioner will tell you that the best way to improve your modelling output is not by picking a more sophisticated model, but by boosting your data quality & quantity. 

Quantity is often the quickest criteria to evaluate through exploration. As a rule of thumb, the more complex a machine learning model you’re looking to train, the more data will be required to do it well. A simple regression model may only need a few hundred data points for useful insights, a recommendation model to detect the subtle content preferences of an entire user-base would likely require hundreds of thousands. Taking a quick look at your data will tell you whether or not you’re set up for success.

The behaviour of ML systems is largely an artifact of the data that they’re built on.

Taking an early glance into your dataset will help you understand whether you have substantial amounts of missing data, duplicate data, data whose type is incorrect or inconsistent, or simply erroneous values -- data quality issues that can degrade model quality.

Because believe it or not, depending on the platform you’re using, your models will still train smoothly with any or all of those elements present, but of course its predictive performance will suffer. These types of issues can generally be raised and addressed by data professionals in collaboration with the business.

Part of the job of ML practitioners is to assess a data set’s readiness for machine learning training given the business problem at hand and to prevent bias from creeping into the pipeline along the way. We are more susceptible to biases than we believe, and so are our ML systems. There can be biases in the way your data has been collected and analyzed (e.g. sampling bias, measurement bias, labelling bias), in the way your models are evaluated after training, or even in the way that our business problem is formulated in the language of data. (For a more in-depth look at biases in ML, see this paper). Some of these biases can be assessed in the data exploration phase to lead to a more robust machine learning product when things move into production.  


EDA Reveals Data Insights & Augmentation Opportunities

After considerations around bias and basic cleaning, (exploring the problem space and raw materials), we can move into an exploration of the patterns inherent within the data. This is often the most overlooked step in ML preparation, and it’s a shame because it delivers a ton of value later down the road. 

This is where data visualization is key. Visualize the distribution of your numerical features to determine whether they’re normal, shewed, exponential, etc. Take a look at the correlations or mutual information shared between your features to see whether they’re highly related or mostly independent. Not only do these findings matter, but these visualizations can start to shed light on the patterns and trends behind the business problem you’re investigating.

Often businesses turn to machine learning as a way to get insights about their data.

EDA gives you the opportunity to ask yourself: Is my question already answered after having simply visualized the data? The insights you desire may already exist in your data without ML and can be surfaced using an EDA.

If desired insights exist, then you may not even need to pursue a machine learning solution, you may just need a well-designed dashboard fed with a pipeline of the most up-to-date data. You’ve just saved your organization a ton of time and effort going down an ML rabbit hole when the optimal solution was a much simpler one.

Of course, many times we’ll still need to move forward with that initial machine learning vision, but now you’ll be moving with much more clarity. Visualizing the distributions and relationships across your features can reveal that you may need to do some feature transformation to improve your modelling results. Perhaps taking the logarithm of features that have highly skewed distributions, or taking the ratios between two features in order to develop a more relevant metric, or performing principal component analysis on a set of features that are highly correlated. These actions are all dependent on your dataset, and there’s no better way to determine when a particular action is needed than to actually look at your data.

Exploring your data can also reveal that it doesn’t carry enough signal to bring predictive power. This is when you would look to data augmentation techniques. Adding features to your dataset, whether from elsewhere in the business or from external sources, can bring new information that boosts the power of your ML products. 

All of these feature transformations and data augmentation actions can be encoded into a solution like a feature store so that once the actions are explored and finalized, they’ll be performed automatically as new data flows in.

EDA Informs Modelling Decisions

Lastly, a solid EDA can actually make the modelling process more focused, and quicker as a result. Remember those visualizations and trends you explored earlier? Those will help determine just how complex of an ML model you’ll need for your product.

Machine learning is a special type of endeavour where bringing in the most powerful tool possible can actually make your product worse. Imagine if having Michael Jordan play on your schoolyard basketball team would almost guarantee you’d lose. That’s the world we find ourselves in with machine learning. A good ML practitioner will tell you that the best model for a product is, a) one that is appropriate for the problem & data at hand, and b) the simplest one for the job. In the spirit of Occam’s Razor — the simplest viable model is usually the best one.

If the trends in your data are linear, you don’t want a complex non-linear model, even if that model is state-of-the-art. If you have a limited volume of data, you’ll want the model with fewer parameters to fit.

Exploring your data will give you insight into the complexity of the relationships in your data, guiding your decision-making process when choosing a set of models to prototype.

Conclusion

In an environment where we’re encouraged to build pipelines and productionalize, it can be tempting to find the nearest ML use case, hook up the hose and let things flow. But taking some time before deployment to actually look at your data will reveal information that will change the way you implement your ML models.

Of equal importance is the attention paid to concerns after the model has been trained. How and where will it be deployed? How frequently will it be updated with new data? Are you monitoring your models for drift? These are some of the considerations that fall under the umbrella of MLOps, and they ensure that our ML products bring real business value over time. 

If you’re looking for help with EDA or with anything under the MLOps umbrella, get in touch.



Article By

Adam Thorsteinson

Modelling the complexity of everyday life. Experience with machine learning, pattern recognition, statistics, EDA, computer vision, NLP, visualization, cloud, data communication and education. I conduct strong analysis and I tell stories with data.

Related Articles

Machine Learning
Coming Soon

Machine learning in business: 10 production applications

Here are 10 machine learning applications that are used in business production environments today. See what's possible and consider how these use case scenarios could be used in your business to improve productivity, help decision making, satisfy customers and mitigate losses.

Machine Learning
Coming Soon

7 Common Machine Learning Mistakes and How to Avoid Them

Machine learning can be a reliable tool for your business if you know what not to do. Here are common machine learning mistakes and how to avoid them.

Machine Learning
Coming Soon

Machine learning consulting: Guide to hiring an ml developer

Machine learning consulting is a specialty computer science service that helps companies develop ML applications. Here's a guide to hiring a consultant.