Introduction
If there is one immediate action that the Machine Learning (ML) community needs to take, it is to deal effectively with the data crisis. The data crisis that is upon us is driven by the cost and complexity of Data Engineering and the hodgepodge mix of nascent tools and approaches that are being thrown at it.
For far too long there have been discussions about "democratizing AI", with initiatives and efforts spanning the full gamut of the imagination. Yet the key ML ingredient, data, has seemingly been absent from the discussion. At the risk of pointing out the obvious, a properly functioning ML pipeline requires the democratization of data before there can be any democratization of ML.
There are very few organizations whose ML efforts are not severely hamstrung by the resources needed to develop and maintain datasets. From academia to industry, from materials science to law, the need to store, manage, share, find and use data for ML research and application is an obstacle that few can surmount. This is significantly impeding the advancement of ML across academia and industry with substantive societal impacts, including hindering scientific research.
To be clear, the democratization of data is not just about access to data, but the empowerment of those that need to use it. That we have placed a name on the skills and tooling needed – data engineering – does not help in solving the problem one iota. While eliminating data engineering altogether is unrealistic, there must be an effort to significantly mitigate it. To do so a dominant open source design for a Data Management platform for ML is called for that is 1) available to the global community (commons) and 2) usable by persons who are not programmatic experts.
A handful of the world’s foremost technology firms have the resources and expertise to undertake the development of data management platforms for their own internal needs. This however only reinforces a situation of the have and have-nots. A dominant open source design originating from the cloud is unlikely as the major cloud providers are each building their own proprietary platforms. There is however nothing stopping the cloud providers from implementing their own version of a common platform that is fully integrated into their stacks.
Requirement
When referring to a platform I am are referring to an all-in-one solution for data management. At a high level the Data Management Platform needs to target the following:
- Empower Data Science and ML users around the world by taking the engineering out of data engineering.
- Support management of the full data life cycle in ML.
- Support the ML lifecycle with integration across the ML workflow.
- Support the predominant ML frameworks.
- Support a variety of ML data including sensor and instrumentation data.
What needs to be recognized
While this all sounds great, the problem is that no such platform exists, so the question is how do we get there? In recent times I have been debating this with several individuals and organizations and have concluded that the following needs to be recognized:
Not an engineering problem
It must be realized that the absence of a dominant design for a data management platform is not an engineering problem but a strategic issue that goes beyond engineering.
Expertise and resources
Big tech has the expertise and resources to develop and maintain such a platform and have viable internal development efforts underway to serve their own internal needs. Startups by contrast do not have the resources, experience, and user base to initiate it. To kick-start the effort it should be "capitalized" with the contribution of one or more of the internal platforms currently under development within big tech.
Initial user base and adoption
If the platform is to have any sort of future, it will need to have a roadmap for widespread adoption – including the internal requirements of big tech. The best way to obtain this adoption will be through collective collaboration on the platform itself. A big tech genesis will give the platform the necessary adoption due to large internal and external user communities. A large user base amongst other things will help to future proof the platform, justify the cost of development and maintenance, and drive continued adoption and investment in the platform.
It will also signal to the broader community that it is worth making the investment in learning and adopting the platform from 1) a people standpoint – think universities incorporating training programs pertaining to the platform within their curricula, and 2) startups and technology provide supporting integration from a technical standpoint.
Collaborative economics
Data engineering challenges are a cost even for big tech and collaborating on a dominant design for a common platform would reduce development costs for big tech and significantly so for everyone else. The platform would not be a direct revenue driver, rather a useful contributor to a revenue generating ecosystem by increasing the application and adoption of ML. Collaboration at this level makes economic sense – coopetition.
Architectural debate in waiting
A Data Orientated Architecture vs Microservice Architecture is a key point of discussion that may not be resolved, at least in the near term. Pursuing one or both is feasible as long as there are several firms pursuing it.
Next steps
To move beyond talk a growing group within the ML community are reviewing options to initiate the following:
- Establish an engagement mechanism and facilitate the coordination and collaboration of industry on the development and release of a data management platform.
- Recognize the needs and call out the requirements of a dominant design for a platform.
- Review current development efforts and database technologies and the sharing of experience in building such platforms.
- Devise new solutions and a development roadmap that accounts for future challenges.
- Coordinate the development and release of a platform for the commons.
- Coordinate the maintenance and future development of the platform.
Summary
The data science and Machine Learning community in collaboration with leading technology firms need to bring together existing development efforts to coordinate the eventual release of a data management platform for the commons.
A data management platform for ML that ensures the consistent and high-quality flow of data throughout the ML lifecycle will make building high-quality datasets, and the building of ML systems more repeatable and systematic.
Now is the time for a united effort to make data engineering uncool again. It starts with recognizing that substantively more can be achieved by a group of organizations collaborating on a platform as requirements are broadly similar and where there are differences within the group those differences are shared by others around the world.
Related reading:
A review on data cleansing methods for big data. https://www.sciencedirect.com/science/article/pii/S1877050919318885
An evolution of data-oriented programming https://tborchertblog.wordpress.com/2020/02/13/28/.
Analyzing and Mitigating Data Stalls in DNN Training https://vldb.org/pvldb/vol14/p771-mohan.pdf
Benchmarks and Process Management in Data Science: Will We Ever Get Over the Mess? https://dl.acm.org/doi/10.1145/3097983.3120998
Challenges in Deploying Machine Learning: a Survey of Case Studies https://arxiv.org/abs/2011.09926v2
Cloudy with High Chance of DBMS: A 10-year Prediction for Enterprise-Grade ML https://arxiv.org/abs/1909.00084
Data Engineering for Everyone https://www.sigarch.org/data-engineering-for-everyone/
Data lifecycle challenges in production machine learning https://sigmodrecord.org/publications/sigmodRecord/1806/pdfs/04_Surveys_Polyzotis.pdf
Data Management Challenges in Production Machine Learning https://dl.acm.org/doi/10.1145/3035918.3054782
Data Platform for Machine Learning https://dl.acm.org/citation.cfm?doid=3299869.3314050
Data Scientists in Software Teams: State of the Art and Challenges https://ieeexplore.ieee.org/document/8046093
Data Validation for Machine Learning https://mlsys.org/Conferences/2019/doc/2019/167.pdf
Detecting data errors: Where are we and what needs to be done? https://dl.acm.org/doi/10.14778/2994509.2994518
Extending Relational Query Processing with ML Inference https://arxiv.org/abs/1911.00231
Firebird: Predicting fire risk and prioritizing fire inspections in Atlanta. https://www.cc.gatech.edu/~dchau/papers/16-kdd-firebird.pdf
For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html
Hazy Research: Data-Centric AI https://github.com/hazyresearch/data-centric-ai
ItemSuggest: A Data Management Platform for Machine Learned Ranking Services https://research.google/pubs/pub47850/
Modern data oriented programming http://inverseprobability.com/talks/notes/modern-data-oriented-programming.html.
Data readiness levels. https://arxiv.org/abs/1705.02245
People, Computers, and The Hot Mess of Real Data https://dl.acm.org/doi/10.1145/2939672.2945356
Rethinking Data Storage and Preprocessing for ML https://www.sigarch.org/rethinking-data-storage-and-preprocessing-for-ml/
Rules of machine learning: Best practices for ML engineering https://developers.google.com/machine-learning/guides/rules-of-ml
Technology readiness levels for machine learning systems https://arxiv.org/abs/2101.03989
tf.data: A Machine Learning Data Processing Framework https://arxiv.org/pdf/2101.12127.pdf
Uber’s Big Data Platform: 100+ Petabytes with Minute Latency https://eng.uber.com/uber-big-data-platform/
Zipline: Airbnb’s Machine Learning Data Management Platform https://databricks.com/session/zipline-airbnbs-machine-learning-data-management-platform