Let’s start by affirming the power of AutoML tools. Any user, regardless of technical capability, can now set up model creation in a few minutes when it previously took expert data scientists hundreds of lines of Python. AutoML accelerates the process of stepping through feature engineering, trying many different algorithms, tuning parameters, and ultimately identifying an accurate model.
Consequently, it has become a crucial pillar to democratizing data science, as it abstracts away the coding and algorithmic function calls from the prize of a productionalizable model. At Einblick, we’ve observed firsthand how our AutoML tools have empowered non-technical analysts and operations managers to start replacing "gut feel" with accurate models.
But for us, AutoML represents a tooling enhancement to achieve the goal of accelerating model building and democratizing data science. It is not, however, a magic wand that can be waved to instantly create data science. A more realistic analogue might be that AutoML tools are electric can openers. They are hands free to use and accomplish goals faster and more cleanly than manual cranking.
So an important reminder to organizational leadership is to not overinvest in just technical AutoML solutions, but rather the investment should be more with both people and process.
1. Domain Knowledge Improves Inputs
Basic feature engineering and data cleansing tools do come baked into most AI / ML tools. Leading AutoML platforms (briefly & shamelessly plugging my own product, Einblick here) will include a similar set of candidate transformations including one-hot encoding (categorical variables to 1/0), imputation, scaling, ratios, NLP text feature extraction. However, these approaches are a "see-what-sticks" approach.
However, human based domain knowledge has a few comparative advantages, which augment automatic feature engineering, including the following:
- Detection of real-world motivated changes to pattern: A human might recognize a shift in datasets that represents a nameable event that occurred. Examples include that the organization launched a new initiative, there was a strategic shift, a natural disaster occurred, a financial crisis, etc… The model can only infer from the information revealed from the low-level statistical patterns in the data. Whereas human intuition relies on a vast repository of additional knowledge to interpret data.
- Outlier identification based on expectation: An AutoML algorithm might be able to identify variables that are outside of 3 standard deviations, and eliminate them. However, similar to the above, understanding whether values are legitimate is a human task. Take a retail bank: A 900 credit score seems feasible, but is not a within the possible range of 300–850 for standard scoring. By contrast, a million dollar checking account is rare and much higher than the average, but immediately we know it is possible. Domain knowledge is what allows an analyst to classify whether outlying values are legitimate.
- Intelligent and interpretable data transformation: A classic example is the relationship between weight and heart attacks. While [weight] positively correlates with [heart attacks], a better predictor might be [weight] / [height] since someone very tall and heavy probably is still healthy. Still more subject matter expertise can tell you that squaring the denominator makes Body Mass Index – which then is a commonly used metric.
2. Explainability Tools Must Create Discussion and Then Iteration
Models are only helpful when they are implemented. Buy-in is won through being able to clearly communicate what a model is doing, address questions, and resolve any points of disagreement about input drivers and output implications.
AutoML do have a range of model explainability features available, from ranking variable importance and letting users peturb data, to partial dependence plots and visualizations of independent conditional expectation. But these are tools for a data scientist. They do not help disseminate information, and explain the model to the wide range of stakeholders relevant to an analysis. Teams and tools must go beyond baked-in packages to facilitate better stakeholder understanding to drive iteration:
- Descriptive visualization of prediction results: While helpful, pre-boxed outputs that represent the best of model explainability tools require either faith in the process without understanding, or potentially too much prior Data Science knowledge. Instead, practical descriptive analytics should be run over the model instead. Do values make sense, do the segmentations exist as I expect they do, are there any inexplicable patterns when I visualize predictions against key drivers, etc…? A variable identified as important can be confirmed via a quick normalized histogram of the targeted response variable broken out by the driver.
- Rapidly evaluate the impact of changes: Models will need to run again and again and again. Based on the results, users should be able to nimbly jump back to data flow tasks to augment the dataset, move towards descriptive visuals to check a hypothesis, or just rerun the model sans a bad variable. AutoML tools are a good way to find good models, but it does not imply they are sufficient to knock out a problem with a single shot.
In summary, do not only focus on the automatic model and ignore the need for human interactivity with data after the model is created. Many AutoML workflows are implicitly asserting "trust us." If the statistics about the model look good, and it was generated by a smart tool, then surely it makes sense to implement! Data science democratization does not imply that users should give up on creating well-understood explainable models.
Originally published on Einblick: https://einblick.ai/automl-not-enough-citizen-data-science/
Try a new, more dynamic way to integrate AutoML in your data science workflow https://einblick.ai/try-einblick/