Reproducibility is the accountability required from businesses to further understand and trust the adoption of Machine Learning into our day-to-day lives.
As Machine Learning becomes more productionized, many businesses and researchers may feel compelled to rush in on the action, to the detriment of full comprehension of the intricacies involved in the implementation of certain methods, or what is sacrificed by rushing through processes without the right procedures in order, all in hopes of faster results.
With access to the information at our fingertips and Machine Learning failures making major headlines, like in the case of Microsoft’s Twitter Chatbot, the skepticism of the public has increased hence emphasizing the need and importance of reproducibility in any methodology.
Note: The lessons in this article are taken from my notes of the Deployment of Machine Learning Models course on Udemy.
What is Reproducibility?
In a machine learning context, reproducibility is in reference to the ability to duplicate a model precisely, such that when the model is passed the same input data, the reproduced model would return the same output.
Failure to consider reproducibility beforehand could sequel several repercussions further in development. A prime example is typically taken from a financial standpoint. If one invests significant resources into model development in the research environment, but there’s a failure to reproduce the model in production then the model and its predictions are essentially useless. That’s wasted time, effort, and finances – all down the drain.
Additionally, reproducibility plays a pivotal role in allowing us to vet our machine learning models against prior solutions. Without the ability to reproduce prior results, we would have no way of accurately distinguishing whether a new model has exceeded the performance of a previous model.
The Challenges of Reproducibility
When creating and building a machine learning model, practitioners will work in multiple environments throughout the process – an environment describes the state of a computer where the software (or other products) are developed or put into operation. There are typically three environments involved with developing a machine learning model:
- Research Environment – In the research environment, Data Scientists would perform data analysis to understand the data, build models, and evaluate them to better understand what decisions the model is making and whether it meets the standard for the desired outcome of the project.
- Development Environment – In the development environment, Machine Learning Engineers seek to reproduce the machine learning pipeline developed in the research environment. This environment also incorporates software engineering best practices to ensure the pipeline can operate as a proper software system.
- Production Environment – In the production environment, the model is able to serve other software systems and/or clients.
Each environment beyond the research environment attempts to replicate the pipeline developed research. However, the procedures to reach the desired state for each environment are different. This introduces many potential hindrances and challenges to reproducibility in almost every individual step of the pipeline, from data acquisition to getting predictions from the model.
Reproducibility in Data Gathering
Always remember Data comes before Science; Without data machine learning models are insignificant. Data acquisition is one of the most important and difficult problems to address, on the topic of reproducibility.
"A model will never come out exactly the same unless the exact same data and processes are used to train it." – Solledad Galli, PhD. Instructor of Deployment of Machine Learning Models (Udemy)
When working in the research environment, a Data Scientist may have access to one version of the training data, however, by the time the Machine Learning Engineers must reproduce the pipeline in production, the data may have changed. This is down to the way some databases function. They may be constantly overwriting older versions of the data with newer updated versions which changes the training data over time. Also, randomness is introduced if Data Scientists use SQL to query data since SQL loads data into memory, randomly.
Many workarounds exist to address such challenges, though none are absolutely reliable in all situations. For instance, one solution may be to ensure Data Scientists save a snapshot of the training data that was used to build the model. This method fails when the data is very large, and if it impedes teams from abiding by data regulations.
Another method, and probably more ideal, would be to add accurate timestamps to the data sources, which could later be used to identify what data was used to train a model. However, be sure to note that this method does not address the problem if the data is constantly replaced when updates are made. To further add, if the database isn’t designed to keep track of timestamps, it may require a large effort to incorporate this functionality.
Reproducibility in Feature Engineering
If we get the reproducibility aspect of the data acquisition phase wrong, then there’s likely to be problems when trying to reproduce the feature engineering phase of the pipeline.
"The parameters derived from the data will change if the data is not identical in both environments" – Solledad Galli, PhD. Instructor of Deployment of Machine Learning Models (Udemy)
Take the scenario in which the data has missing values and the Data Scientist opts to fill the values by imputing the mean of the feature. If there are variations between the training data in the research and production environment, then it’s possible there will be variations on what the value of the mean will be for a particular feature. Of course, if the training data from the research environment can be reproduced then this problem disappears.
Other factors that may hinder reproducibility during the featuring engineering phase include:
- Complex extraction methods (problems relate back to lack of reproducibility in the data from the research environment)
- Not keeping Hyperparameters constant (they must remain constant between environments)
- Feature generation methods that depend on random sampling (setting seed parameters ensure the same samples are generated)
Leveraging the power of version control to track all the differentials can solve problems related to reproducing the feature engineering phase – if they don’t stem from the inability to reproduce the data acquisition phase.
Reproducibility in Modeling
Some Machine Learning models introduce randomness during training, which introduces challenges to reproducibility. A prime example of this phenomenon is with tree ensembles (Random Forest, Gradient Boosting, etc.) and Neural Networks.
For example, the Random Forest model uses bagging and feature randomness when constructing each individual tree to create an uncorrelated forest of trees. And, Neural Networks introduce randomness during weight initialization. This randomness causes inconsistencies between models, regardless of whether they are built on the same training data.
In a similar fashion to feature engineering, a simple solution to combat this threat to reproducibility is to set the seed and track hyperparameters when it’s required.
Reproducibility in Deployment
Challenges of reproducibility in the deployment phase of the Machine Learning workflow revolve around the failure to integrate the model to other systems.
A primary example of reproducibility failure in deployment occurs when the population used to train the model isn’t representative of the population in the live environment. This may be down the either the live or research environment using filters that went undetected at the time of model building, or the Data Scientists that built the model did not fully understand how the model would be consumed in production.
Another scenario occurs when programming languages are changed between environments. Machine Learning is often done in Python or R, whereas applications typically use C++ or Java. This may incentivize teams to reproduce the research environment in another language which significantly increases the likelihood of 1) human error and 2) deployment error. Working in one programming language is the best way to overcome this challenge.
Lastly, a change in software versions may hinder the ability to reproduce the machine learning pipeline in deployment since each change may cause discrepancies within the pipeline. This problem is a lot less common, however, it’s probably the most challenging to fix, hence ensuring all software environments are the same throughout the pipeline is vital.
Wrap Up
Data Scientists and Machine Learning Engineers often face several challenges when trying to ensure the machine learning workflow is reproducible. While some problems would require significant change to overcome, others can be quite simple (i.e. setting the seed). The main gist is that all of the potential hindrances to reproducibility must be address to attain true reproducibility.
Thank You for Reading!
If you enjoyed this article, connect with me by subscribing ** to my FRE**E weekly newsletter. Never miss a post I make about Artificial Intelligence, Data Science, and Freelancing.
Related Articles
Machine Learning Model Deployment
Data Scientists Should Know Software Engineering Best Practices