Photo by Maarten van den Heuvel on Unsplash

Cooking (S)REcipe for Data Science

How to integrate Site Reliability Engineering (SRE) into your Data Science teams, products and services

Aishwarya Prabhat

Published in

Towards Data Science

11 min readNov 2, 2020

Disclaimer: All opinions expressed are my own. Brace yourself for some very tasteful cooking puns!

In this article I juice out key ideas from Site Reliability Engineering: How Google Runs Production Systems & The Site Reliability Workbook and blend them with Data Science. I make an argument for why implementing Site Reliability Engineering (SRE) practices will not only enhance the reliability of your Data Science products, but will also increase the overall effectiveness of your Data Science team. I then detail the what and how of integrating SRE into Data Science teams, products and services.

While I encourage you to read through the whole article, feel free to sample from all over this platter of ideas! Here’s what’s on the menu:

Starters/Appetizers: The starting questions
1.1 What is SRE?
1.2 Why do I want SRE in my Data Science product lifecycle?
1.3 Can I implement SRE without hiring dedicated Site Reliability Engineers?
Key ingredients: The critical components of SRE
2.1 Service Level Objectives (SLOs)
2.2 Service Level Indicators (SLIs)
2.3 Error Budget
2.4 Error Budget Policy
2.5 Monitoring
2.6 Alerting
2.7 Supporting Processes
Put on your thinking “toque blanche”: SRE guiding principles to keep in mind
3.1 Some SRE is better than no SRE
3.2 Things WILL go wrong, all we can do is prepare!
3.3 SRE metrics are not instruments to play the blame game
Running a kitchen without chefs: Implementing SRE without dedicated Site Reliability Engineers
4.1 Start simple, start small
4.2 Adopt a shared responsibility model
4.3 Eliminate toil through automation from the get-go
Takeaways: Some final thoughts

0. Setting the table

A running example

To illustrate key ideas, I will be using the following running example:

You are the proud product owner of a website that allows users to upload images of food items and determine, using your state-of-the-art Deep Learning model, whether or not the food item is a sandwich. You would like to integrate SRE into the product lifecycle of your sandwich classification service. For this article, we will ignore the front-end related SRE and focus on the back-end serving the Machine Learning model.

1. Starters/Appetizers

The starting questions

1.1 What is SRE?

SRE is a rather broad concept so, if you try googling “SRE”, you might be disappointed to find that there isn’t a clear agreed-upon functional definition. So, let me give you my definition, one which we can get to work with right away:

Site Reliability Engineering refers to a framework of principles, practices and processes, with the primary goal of enhancing the reliability of (digital) products and services.

1.2 Why do I want SRE in my Data Science product lifecycle?

To enhance the reliability and robustness of Data Science products
By having solid SRE principles and practices in place, you can take a more methodical and proactive approach to increasing the robustness of your product as opposed to a chaotic and reactive approach to when something goes wrong.
To prioritise between maintaining something old and developing something new
SRE is a very numeric and data-driven approach to software operations. Using SLOs, error budgets, error budget policies, monitoring and alerting, you can have very clear criteria to determine whether your team should dedicate more time to maintaining and improving the sandwich classifier or should they develop a new feature for classifying other types of food items.
To improve the quality of product development process
I argue, that as you inculcate SRE culture into your team, the development practices and culture of your team will improve. As the developers becomes more “SRE aware”, there will be a propensity to optimize their development practices around building robust and reliable Data Science products, down to the micro level of writing code, conducting code reviews and coming up with QA test cases.

1.3 Can I implement SRE without hiring dedicated Site Reliability Engineers?

Short answer: Yes!
Long answer: Look at section 4 (Running a kitchen without chefs) but read through section 2 first.

2. Key ingredients

The critical components of SRE

The following are they key concepts/components that you need to

2.1 Service Level Objectives (SLOs)

SLO is arguably the most important concept in SRE. SLOs specify a target level for the reliability of your service and are typically defined with an implied period of time. SLOs are customer-centric thresholds that quantify the level of reliability at which your customers will be happy with your Data Science product. For example, in the case of the sandwich classification service, we could have the following 2 SLOs over a window of 1 month:

Availability: 99.9% availability
Latency: A latency of 500ms for 99% requests

2.2 Service Level Indicators (SLIs)

SLIs are specific metrics that help to quantify SLOs. SLI is typically a ratio of two numbers:

(number_of_good_things_happening)/(number_of_total_things_happening)

For the sandwich classifier our SLIs would look like:

Availability: Ratio of non-5XX API requests to total requests
Latency: Ratio of requests faster than 500ms to total requests

Such metrics make good SLIs because they are intuitive — 0% means nothing is working and 100% means everything is working.

2.3 Error Budget

Error budget = 100%-SLO . For the sandwich classifier this would mean that given the monthly availability SLO is 99.9%, the error budget is 0.1% (43 minutes and 12s). Error budgets are metrics around which our alerting and error budget policy will be setup. Error budgets are also useful in other ways. For example, we can quantify how bad an outage incident was based on how much of the monthly error budget the incident exhausted.

2.4 Error Budget Policy

According to the Site Reliability Workbook, an error budget policy defines specific steps that will be taken by the various stakeholders once the error budget is exhausted or is close to exhaustion. For our sandwich classifier example, some key actions in the error budget policy might include the following:

The Data Scientists/Machine Learning Engineers developing the model will prioritize fixing production bugs till the service is within budget, instead of developing the new model for classifying pizzas
A code freeze will be enforced to halt any further changes to the production deployment to minimize further risk. Changes will only be allowed once there is sufficient error budget.

2.5 Monitoring

Monitoring typically includes dashboards displaying information about critical metrics in a human-readable time-series form. Monitoring dashboards should be able to help you answer 3 main questions in the event of an incident — what happened, when did it happen and why did it happen? The following are a list of metrics/components that you should be monitoring:

SLIs and error budgets: For our sandwich classifier, we should have dashboards to display the number of requests grouped by status codes (2XX, 4XX, 5XX), p99 latency and remainder of error budget for the month.
Model performance metrics: For our sandwich classifier we should monitor the ratio of images classified as sandwich vs not-sandwich. This will help in checking if the model is overfitting/underfitting and whether there is a need to retrain or replace the model on a better dataset. (For a use case like fraud detection, this would be absolutely critical!)
Dependencies: If the sandwich classifier relies on an external database, the dashboard should monitor the health and performance of this database.
Intended changes: If the sandwich classification model has been retrained to classify hotdogs as sandwiches, then we should capture and confirm the successful deployment of this change through a graph on the dashboard.
Traffic & resource usage: The dashboards should capture traffic, throughput and resource usage such as CPU and memory. This is to ensure that if the usage is increasing, as would be the case for a typical new Data Science product, resources are scaled promptly to handle larger traffic. Traffic dashboards can also help to identify traffic patterns, in case you would like to employ some form of autoscaling.

2.6 Alerting

Incidents trigger alerts, alerts trigger action. The goal of alerting is to notify the right person about the right thing when a significant event happens. For example, if after deploying a new sandwich classification model, the latency suddenly spikes and remains high for half an hour, the necessary stakeholders should be alerted and prompted to take mitigative actions. Alerts rely on monitoring and thresholds to determine when, who and what to alert.

2.7 Supporting Processes

There are 3 key processes that are typically required to maximise the utility yield of other SRE components:

Incident escalation protocol: It is ideal to have a well laid out plan on how to approach a production incident. This protocol should include details like who (developer, tech lead or product owner?) should take action when (10%, 25% 50% of error budget exhausted?) and what specific actions should be taken. In my opinion, this information should also be incorporated into the error budget policy and should be tied closely with the alerting as well.
Postmortems: Blameless postmortems are critical to learning from incidents, so as to improve the reliability and robustness for the future. How postmortems are conducted will play a crucial role in setting the tone in the early stages of integrating SRE into your Data Science product lifecycle.
SLO & SLI revision: Depending on the rapidity of your project lifecycles and maturity of SRE framework, regular revisions of your SLO targets and SLIs will be required to ensure that SRE is implemented in a sustainable way, as the requirements of your users and stakeholders change over time.

Once you depart from the initial stages of establishing some rough SRE practices, a 4th implied but critical process/component is a well thought out on-call setup. In the words of the SRE Workbook,

…being on-call means being available during a set period of time, and being ready to respond to production incidents during that time with appropriate urgency.

You would want some employees to be on-call outside of office hours to ensure the reliability of your service. These on-call members of your team should have a detailed step-by-step guide on what to do in the event of an incident in the form of an on-call playbook and should have access to relevant technical documentation, error budget policy and escalation protocol. You can read more about it here.

3. Put on your thinking ‘toque blanche’

SRE guiding principles to keep in mind

3.1 Some SRE is better than no SRE

I argue that having partially implemented SRE is better than having none. For example for the sandwich classification service, you may not have charted out a precise error budget policy but by having some basic SLOs, monitoring and alerting, atleast you will be able to proactively fix issues when the website goes down instead of being notified by a barrage of user complaints. It is best to define a vision of what an ideal SRE implementation might look like and then iteratively and gradually progress towards it.

3.2 Things WILL go wrong, all we can do is prepare!

There is a nonzero probability that one or more components will fail at some point in time resulting in less than 100% availability. If you do manage to create an experience that is 100% reliable for your customers, and want to maintain that level of reliability, you can never update or improve your service. Hence, 100% reliability is the wrong target. Another caveat is that as you go from 99% to 99.9% to 99.99% reliability, each extra nine comes at an increased cost, but the marginal utility to your customers steadily approaches zero.

3.3 SRE metrics are not instruments to play the blame game

It can be tempting to exploit the numeric data-driven approach of SRE by pegging your employees’ performance KPIs to these metrics. However, this can be really detrimental in the long run. It can perpetuate a culture of counterproductive finger-pointing. What you want is a culture of shared responsibility in which incidents and mistakes are mitigated and eliminated through the systematic implementation of SRE principles, practices and processes.

4. Running a kitchen without chefs

Implementing SRE without dedicated Site Reliability Engineers

4.1 Start simple, start small

Start with maybe 1 to 3 basic SLOs like availability and latency that matter the most to your customers. Over time, a couple more SLOs can be added to improve the robustness of your overall product.
Set reasonable targets like 90% instead of something very stringent like 99.9999%. Over time, these targets can be gradually refined and tightened.
Setup some simple dashboards for just your SLOs with basic alerting to capture incidents and key trends. Most cloud providers have convenient monitoring and alerting solutions. If not, it is not a very tedious task to setup a simple Grafana-Prometheus-AlertManager stack. Over time, more monitoring and alerting can be added.
Instead of formally adopting SRE practices right away, start with a dry run for a period of time, perhaps a quarter, where your team tries to emulate ideal escalations, postmortems and SLO revisions.

4.2 Adopt a shared responsibility model

While the SLOs typically need to be owned by the product owner (eg: CTO or a Product Manager), SRE cannot be integrated into your product lifecycle until and unless everyone is onboard. “Everyone” includes all the stakeholders — Data Scientists, Machine Learning Engineers, DevOps, QA, Product Managers, Business Requesters etc. Get everyone involved in understanding and planning the integration of SRE into your Data Science product lifecycle. Rollout your SRE in phases. Divide the production system into components (eg: front-end, Machine Learning model, backend, infrastructure) and get the different people responsible for the components to work together through the incident escalation protocol. Run escalation drills and postmortems to reinforce SRE culture within the team.

4.3 Eliminate toil through automation from the get-go

An important concept in SRE is the elimination of toil. Toil refers to repetitive, mundane maintenance-type of tasks that require little to no creative problem solving. In my opinion, toil elimination should be a priority from the get-go. Find opportunities in the SRE implementation process that might be automated. Script them away! While it might initially take a little longer to automate a task, it is a viable investment because such toil can become a massive liability as the SRE integration scales within your team, products and services.

5. Takeaways

Some final thoughts

There is a lot to unpack when it comes to understanding and implementing SRE. In the world of Data Science, the uncertainty brought about by black-box models such as Deep Learning models adds to the complexity around how to define your SLOs, alerts and escalation policy. My recommendation is to fret not and just get started! Like your Machine Learning model, you and your team too will get better with more experience.

Meet the chef!

Hi! I am Aish. The first two letters of my name are “AI” and AI and Machine Learning are what I am passionate about. I am currently a Senior Data Scientist in Singapore and a Master’s student at GeorgiaTech. You can reach out to me via LinkedIn.