Doing ML effectively at startup scale

MLOps without Much Ops

A mini-series with Ciro Greco and Andrea Polonioli

Jacopo Tagliabue

Published in

Towards Data Science

5 min readSep 22, 2021

Life at “reasonable scale”

If you do not work for Big Tech— the Googles, Facebooks, Amazons of this world — , chances are that you work for a “reasonable scale” company.

Reasonable scale companies aren’t like Google. They can’t hire all the people they dream of and they don’t serve billions of users per day from a cloud infrastructure they own. Reasonable scale companies process millions of data points, not billions; they can hire dozens of data scientists, not hundreds, and they have to optimize for their computing costs.

At the same time, reasonable scale companies have plenty of interesting business problems that could be addressed by using Machine Learning. Actually, it would make total sense to address them with Machine Learning and maybe they are already trying. It is just hard to implement the right processes, when you have constraints on talent, budget and data volumes.

The conceptual area covered by reasonable scale use cases. Advanced AI startups with small teams (bottom right), and late, bigger adopters starting to develop a ML roadmap (top left), are the ideal targets for MLOps without much Ops [ Image by Authors ].

Truth is, outside of Big Tech and advanced startups, ML systems are still far from producing the promised ROI: it takes on average 9 months for AI projects to go from pilot to production, and Gartner is betting on year 2024 (!) for enterprises to shift from pilots to operationalization. As the number of AI projects continues to rise, the need for a mature MLOps approach becomes more and more evident: as the opportunity cost of taking the wrong choices in such foundational area may cripple even the best business, it is crucial for executives and management to understand the implications of a sound MLOps strategy. And now it the perfect time to look into it.

Over the past years, we have learnt that positive ROI from ML can be achieved even at a reasonable scale. We know it by being in the somewhat privileged position of helping the digital transformation of hundreds of mid-to-large enterprises. Most importantly,

we know it because we are a reasonable scale company ourselves.

We have decided to share what we have learnt along the way in a multi-part series, focused on how to build and scale ML systems to deliver faster results in the face of the above constraints: small ML teams, limited budget, terabytes of data. Our aim is to provide you with a playbook of proven best practices to successfully navigate the rapidly evolving MLOps landscape when designing a production system end-to-end.

There are already many tutorials on tool X or framework Y (including our own!), but (for good pedagogical reasons) they focus on tools in isolation, often in toy-world scenarios. We decided to take a longer, but hopefully more rewarding route here: our discussions are by design a bit more nuanced, and comprise evidence from scholarly papers, open source code, and first-hand startup experience.

Our (ambitious) goal is to provide you with a template for building an AI company, not a tiny feature.

Building AI effectively with open-source and SaaS

The idea at the heart of this series can be stated very concisely:

to be ML productive at reasonable scale you should invest your time in your core problems (whatever that might be) and buy everything else.

While stating the main principle is easy, living life at the reasonable scale involves all sorts of subtle ramifications, from competing for talent to keeping the P&L in check. The corollary of our principle is that we should do everything in our power to abstract infrastructure away from ML developers. Since we are dealing with reasonable scale, there is not much value in devoting resources to deploy and maintain functionalities that today can be found as PaaS/SaaS solutions (e.g. Snowflake, Metaflow, SageMaker).

Living life at the reasonable scale involves all sorts of subtle ramifications. But given that MLOps is a brand new field, executives and management are not always fully aware of it: we designed this series with a number of different personas in mind, allowing us to explore ML productivity from different angles. As a small teaser trailer, these are some of the focus area:

ML tools for everyone: MLOps grows from the seeds of a 20 year long effort to embrace open source technology. Today, open source adoption is accelerating across enterprises making it increasingly easy for small teams to be extremely productive at scale. And yet, the idea of using open source tools is often frowned upon by team leaders and execs: we will explain how companies and business leaders can de-risk their open stack strategy and get the most mileage out of open source software.
Less-is-more: by trading-off more computing with significantly less human effort we make the statement that a small, happy ML team is significantly better than a bigger, less focused group. In other words, a possibly larger AWS bill is often offset by higher retention rate and greater ML productivity. The implications are far-reaching: for instance, consider how traditional metrics such as R&D headcount may need to be re-assessed and different benchmarks may be required. A modern MLOps approach can result in traditional R&D metrics and benchmarks becoming virtually obsolete.
Empowered developers grow better: recruiting and retention in a competitive market are constant challenges for companies, especially in ML. As it turns out, one of the main reasons for turnover of ML practitioners is devoting a sizable portion of their time to low-impact tasks, such as data preparation and infrastructure maintenance.

Curious to learn more?

We will explore all aspects of ML productivity at reasonable scale in the next months. Follow us on Medium or Linkedin for the latest update!

Jacopo Tagliabue, Ciro Greco and Andrea Polonioli.

FAQs

Do you have a TL;DR version? Not really, which is why we are starting a (small) series: we really tried to fit everything in one post, but our first readers seemed to agree it was just too dense to make sense. If you want something to listen to while you are running, some of these themes have been anticipated at the MLOps at Reasonable Scale talk at Stanford MLSys.
Can cutting-edge ML really live outside of Big Tech? Yes, it really can. Creating a self-service system for team members and external collaborators to spin up GPUs, execute queries, share findings through endpoints (not slides!) is a great way to do product development and research at a fast pace.
Talk is cheap, show me the code! If you are impatient and want to jump directly to the engineering side of it, we shared an open source repository implementing our principles, from parsing raw data to serving predictions (note: the project works under realistic data load, thanks to a massive eCommerce dataset we recently released).

Acknowledgments

We wish to thank Ville, Savin and Oleg for precious feedback on previous iterations of the project; Piero Molino and the Stanford MLSys group for inviting us to a great session; Mike Purewal, who gracefully rejected our first draft and pushed us to be better; finally Luca Bigon, who lives and breaths the reasonable scale.

Of course, this series wouldn’t be possible without the commitment of our open source contributors:

Patrick John Chia: local flow and baseline model;
Luca Bigon: general engineering and infra optimization;
Andrew Sutcliffe: remote flow;
Leopoldo Garcia Vargas: QA and tests.