What’s auto ML?

Siobhan Cronin
Towards Data Science
4 min readMay 11, 2018

--

The purpose of computing is insight, not numbers — Richard Hamming (photo by CoWin)

I recently told a friend I was looking into auto machine learning (ML) projects, and he asked me if that meant machine learning for cars, so … I thought I’d write up a brief post.

Auto ML services provide machine learning at the click of a button, or, at the very least, promise to keep algorithm implementation, data pipelines, and code, in general, hidden from view. If you mention this prospect to a crowd of engineers, some will say “hooray!” and some will say “oh no!”, and, I suppose until recently I found myself in the middle of those exclamations (“orayoh!”).

I first played with an auto ML platform last year during a demo at IBM. Coming from iTerm, Python, and vim, I found it strange to be so removed from the code. But I could see how processes that had already been vetted in R&D could be streamlined with auto ML. Not every team is working on bleeding edge algorithm design, and so there is definitely room to optimize the process of making minor adjustments to already-verified models.

I could see how auto ML might situate itself in the life cycle of a growing ML/AI product, but could such a platform truly become “a data scientist’s best friend”? I wanted some other data points, so I reached out to my friend Pierre Forcioli-Conti who was busy working on his startup Workbench, and we sat together in his office on Market Street and played with sample data in his environment while chatting about workflow. His vision was more of a collaborative common workspace for teams (like Google Docs for data projects), and I remember being particularly excited about the data parsing and cleaning features. But, for whatever reason, I returned home to my tools with no clear plan to integrate auto ML into anything I was building (although I did sneak off and play with auto-sklearn and TPOT a bit, just to see what they were up to).

Fast forward to this week, where I saw a post about H2O from someone I trust in data science, Erin LeDell. I followed her link to the product, and looking into the docs, I found a good amount of support for different languages and cloud integration, which was encouraging to see.

The install was painless, and the interface was approachable, situated at the intersection of data analysis packages like Pandas and modeling tools like scikit-learn, Tensorflow and Keras, with a Jupyter-like cascading stream of cells. Both working with the local host interface (putting together processing steps in a document called a “Flow”) and importing the software directly into the Python shell felt more intuitive and familiar than what I’d encountered with the ML tools on AWS or IBM’s Watson Studio.

I did find myself running a clock in my head tracking “how long would this usually have taken me?”. I imagine one could learn the idiosyncrasies of H20 fairly quickly, and then from that vantage point gauge where the tool might best fit into one’s toolkit. I’m personally interested in hyperparameter optimization for neural networks, particularly for larger deep learning projects, and found myself imagining how that exploration might go. What kinds of bugs would I run into? What kind of support would I find? If H2O really could be one-stop shopping for data munging, EDA, modeling, and processing on GPUs, then I could totally see moving H2O more front and center on my desktop.

I worry that teams will use auto ML to hunt and peck for models that can deliver “results.”. Machine learning is an art as much as a statistical modeling discipline, and while I believe deeply in the creative exploration aspect of what we do, what kinds of guideposts and sanity checks should we stitch into auto ML processes to ensure we continue asking strong questions as our pipelines become more abstracted? Perhaps Clippy has a model-minded cousin who could pop up periodically and ask things like, “Does this data representation match your intuitions?”, or “Are we at risk of over-inflating the error here?”, or “How are you interpreting these performance results?”. Having access to the code alone doesn’t answers such questions, and perhaps abstraction will free up more capacity to ask such questions, but my concern is that over time such questions might start to generate responses such as, “I don’t know, that was button I was told to press”. This is not a caution for H2O alone, but for all auto ML enterprises, and all of us charged with ensuring we use mathematical instruments effectively.

I see H2O coming out ahead on cloud integration, which signals a prioritization of customers of a certain scale (or with plans of scaling), and they definitely have their ducks in a row by not assuming any particular stack (offering their service to Java, Python, and R communities alike). All in all, I feel more of a “hooray” than an “oh no” about auto ML, and look forward to seeing how teams will use these tools in their product cycle.

--

--

I write about engineering, machine learning, and data stewardship. Advisor @landedhomes