I am probably a Machine Learning Engineer. This field is quite new and all the job titles are pretty much a joke to me. The boundaries between a Data Scientist, an AI Specialist or a ML Engineer are quite blurred, and the one that you choose to put on your LinkedIn really follows personal preferences or the personal preferences of who hired you. Whatever you say you are when you first introduce yourself to people, doesn’t much change your day to day job in the vast majority of the companies: you collect data, train models, test their performances and eventually, if you are lucky – yes, more than good – you deploy them, probably on the cloud. In the last couple of months though, I have been focusing more and more on the technical aspects of my company’s Machine Learning products than on their task-specific performances. My team and I have put a lot of effort in getting them in a good shape, very often without real impact on their final users or on our business (directly, at least).
If I look back to when this whole journey started, it is really impressive w[here](https://towardsdatascience.com/detecting-scams-using-ai-for-real-6c96e712cf09) we currently are, considering both the impact that we have as a team in our operations and the technical maturity of our products and people. It didn’t start like this though and to get to this point we went through a very linear progression that allowed us to gain traction, support and knowledge. This post won’t cover all the preliminary steps needed to have something ready to be deployed, I already did that here, here and somehow also here, but rather would try to shed some light on what happens next and what worked fine for us in getting to the stage where we happily are now.
Deploying the model
Getting off the ground as soon as possible is key. The goal of this first stage should be allowing the stakeholders to evaluate the model’s impact, and the deployment doesn’t need to be fancy to achieve so. The only thing that cannot be overlooked is a solid process for collecting and comparing business metrics once the model is live. Even the best model of the history of Machine Learning won’t have any success in the industry if there doesn’t exist a way to prove that it is useful. Very often these metrics are not even technical but are more likely to be related to the efficiency of a team or a department. In our case, for example, we allowed our CS (Customer Solution) department to spend 50% less time on a task and increased their effectiveness of 80%. And the classifier F1 was quite poor at that time, to be honest. Ideally, we wish to have a perfectly deployed model in a completely controllable and measurable environment, but this is never the case, especially in the early stages of a data team. In this bargaining scenario, it is better to trade off on the former than on the latter.
Ideally, we wish to have a perfectly deployed model in a completely controllable and measurable environment, but this is never the case, especially in the early stages of a data team.
The amount of good Engineering practices you can ignore is very dependent on the engineering culture of your company. I mean, a Machine Learning model is a Python function. You need theoretically 30 seconds to expose it as a Flask API on a local machine by enabling port forwarding on your router. Luckily, our Ops Team is very strong and did a great job in allowing us to employ some cutting edge cloud technologies quite easily.
Our first deployment consisted of a single pod on Kubernetes. The pod contained three different containers: the API, an async task queue (Celery) and a Redis database to allow communication between the two.

The model used in the prediction tasks was stored in one of the Docker images. We didn’t have any solid process in place to retrain it and, in the case, we needed to build and push the whole image again. We were not collecting any technical metric (number of executed jobs, prediction job latency, etc) and we didn’t have any staging environment releasing directly in production. I am not proud of it. But honestly we never felt the need of having more. If we tried to fix all these possible issues before the first deployment we would have never probably had the impact we had and we would have lost motivation and traction along the way.
Becoming business critical
If all went well and all your hypotheses of having deployed something actually useful were true, it is very likely that your model (now product – fancier) will become business critical. Also, business critical is quite a buzzword. The easiest way to understand if it is already the case is by trying to think about what happens if your product, for some reasons, messes up. If it is bad, you are somehow business critical. We crossed that line as soon as we integrated our technology with our core platform. If you break something on the platform, users can feel it.
The easiest way to understand if your product is business critical is by trying to think about what happens if it messes up for some reason.
First thing we did was that of starting to release on a staging environment. Testing machine learning products is tricky and it is probably impossible to have a full coverage of the use cases before safely releasing in production, but being sure that all the technical components and the basic flows work fine is crucial to avoid serious issues. If something breaks no one will notice it. Sounds obvious, of course, and it is, but it would have been useless to engineer something like this, when we weren’t even sure if our deployment would have survived the month. It is interesting that the engineering best practices developed in the last 30 years not only fit perfectly in this new domain, but that we actually felt the need for them along the process. We didn’t employ them just because someone asked us to do so.
It is interesting that the engineering best practices developed in the last 30 years not only fit perfectly in this new domain, but that we actually felt the need for them along the process.
On the same line is monitoring. We already had in place a very robust strategy to assess business-related performances, as said, but we were completely unaware of the state of our deployment engineering-wise. How long does it take to complete a prediction job? Do we need to add more resources to the pod? How many jobs are executed per day? Is the percentage of the predicted labels somehow constant? For a classical API these metrics are collected from the beginning, and it is a really bad practice to release anything without them. We felt the urgency along the process, and we got to the same final result.

In our company we were already using Prometheus and Grafana for monitoring and it made sense of course to use those tools also for our projects. The ephemeral nature of the jobs required a little bit of engineering effort though, having to deploy and take ownership over a Prometheus Pushgateway to store these metrics and expose them to the shared Prometheus server. With the help of the QA team we then defined alerts and created some nice graphs. We are now employing the best strategies in order not to mess up, but if it happens we are the first to know, at least.

Decoupling the model
The biggest asset of a data product is the model that is powering it. After a couple of different ML deployments we realised that apart from a small custom part mainly related to the sample creation process, the vast majority of the code needed for setting up the server and the prediction process is pretty much the same. On another level, deploying the models with Docker was also causing a lot of useless work, having to build big images defined in complex Dockerfile just to change a very small subset of the resources that is not strictly dependent on the actual way in which it is used. The model and the prediction logic can be decoupled and their release process should be different and not synchronised.
The model and the prediction logic can be decoupled and their release process should be different and not synchronised.
Therefore, we decided to split the API and task queue from the model and related files. We improved our code in order to retrieve these resources asynchronously when the prediction process starts, fetching them from a cloud object storage (GCS, in our case). Of course, this improvement also needed a new image for the processing logic, but we didn’t release any new version since then. Serving a new model is now indeed a matter of storing it in the bucket and raising a PR in our configuration repository, employing the same GitOps approach we used during the whole evolution of this project.
This step has empowered us to retrain the models as often as we wanted with very little effort. The task of releasing a model’s version though is the very difference that a deployment in this domain has from the classical engineering deployments. It is quite straightforward (even if complex and somehow boring) to describe flows and write tests. If a release candidate passes them, it is good to go. For data products is very different. It’s of course possible to check if the job scheduling logic is behaving as expected, but telling if a listing is a scam is a completely different matter.
What we needed at this point was a way to be sure (enough) that a retrained model would have performed on live data as good as the currently deployed, ideally better. Of course, serving the models in parallel for a couple of days and monitoring their performances worked fine, but the time and effort needed to perform the task frequently started to be a bottleneck for our team. In the majority of our settings a simple offline performances estimation is not enough to validate the model, being the data sent for prediction possibly different from the one that can be retrieved from our data sources some days after. We designed a task-specific testing framework that exposes the retrained model to the same data that the deployed model used. Employing a smart strategy in defining the training dates we can simulate the model’s release and measure a very reliable estimation of the actual performances of the retrained version in 10 seconds, not 2 days. A very granular logging policy has been crucial to achieve this goal: all the stages and results of the prediction process are stored and can be accessed when needed.
We scheduled a bi-weekly cronjob that creates a pod and retrains the model, possibly saving it if the release simulation goes well and notifying the team about the results. The new model is then released (staging first, production later) editing the configuration file on Kubernetes. The pod containing the prediction logic is restarted and fetches the retrained version from the bucket, completely agnostic about its content.

Our current setting allows us to fully focus on improving the actual Machine Learning engine with a very clear separation of responsibilities. This whole process took roughly 8 months and we managed to keep serving our users fine without any downtime or delay. How much of all this has been noticed by them? Zero, probably. Machine Learning engineering, if exists, is probably this: doing stuff on data products that end users don’t notice but that make the world a better place to live in.
Machine Learning engineering, if exists, is probably this: doing stuff on data products that end users don’t notice but that make the world a better place to live in.
A lot of this stuff wouldn’t have been possible without the help of Duc Anh Bui.