Source

Data Science in a Serverless World

Building Data Products with Managed Services

Ben Weber
Towards Data Science
7 min readNov 9, 2020

--

In large companies, there are typically separate teams for training machine learning models and putting these models into production. A data science team may be responsible for feature engineering, model selection, and hyperparameter tuning, while a machine learning engineering team is responsible for building and maintaining the infrastructure required to serve the trained models. As cloud platforms provide more and more managed services, this separation of concerns is not as necessary and it’s now possible for data scientists to build production systems. Data scientists at this intersection of data science and machine learning engineering are often referred to as applied scientists, and it’s a role that can provide significant value for a company.

Google Cloud Platform (GCP) and other cloud providers now provide serverless and fully-managed solutions for databases, messaging, monitoring, containers, stream processing, and many other tasks. This means that a team can quickly build end-to-end machine learning pipelines, while reducing the amount of time it takes to provision and monitor infrastructure. In enables data scientists to build data products that are production-grade, where monitoring is set up and the system can recover from failures. Instead of focusing on only model training, applied scientists can also build production systems that serve ML models.

Why is it useful for a single team to be capable of building end-to-end machine learning pipelines with managed services?

  • Rapid Development: Having one team responsible for model building and model deployment typically results in faster iteration on projects, resulting in more prototyping and testing before scaling up systems. Using managed services such as Memorystore for Redis on GCP enables a team to prototype data products with an application database without having to worry about spinning up and monitoring infrastructure.
  • Reduced Translations: When separate teams perform model building and model deployment, different sets of tools are typically used and it may be necessary to translate models trained with Python into a different programming language such as Go. When the team is responsible for both tasks, it’s common to use similar tools for both tasks, such using Python for both model training and model serving. This reduction in translation is especially useful for building real-time systems.
  • Develop Expertise: Building end-to-end systems means that data scientists will get hands on experience with tools typically outside of their standard tools, such as using NoSQL tools and Kubernetes.

While there are several advantages to having a single team build data products using managed services, there are some downsides:

  • Cost: Managed services are typically more expensive than hosted solutions when you reach a certain scale. For example, serverless functions are great for prototyping, but may be cost prohibitive for high volume pipelines.
  • Missing Features: A cloud provider may not have a fully-managed offering for the service you need to use, or the offering may not provide the performance requirements required for the application. For example, some organizations use Kafka in place of PubSub for messaging because of low latency requirements.
  • DevOps: Having a data science team build and deploy machine learning pipelines means that the team is now on call for the data product, which may not be an expectation of the role. It’s useful to partner with an engineering or cloud operations team for critical applications.

What does it look like for a data science team to build an end-to-end data product using managed services? We’ll walk through an overview of real-time model serving on GCP for a mobile application.

Data Collection Pipeline

Data Collection

The first component to set up for this data product is the pipeline for collecting events from the mobile application. We need to stream data from the mobile device to BigQuery in order to have historic data for training a model. For model application, we’ll use Redis to perform real-time feature engineering. We’ll need to author two components for this pipeline: a web service that translates HTTP post events into JSON messages and passes the events to PubSub, and a Dataflow job that sets up PubSub as a data source and BigQuery as a data sink. The web service could be written using Flask in Python or Undertow in Java, while the Dataflow job can be authored with Python or Java. We can use the following services to build the collection pipeline with managed services on GCP:

  • Load Balancing: This service provides layer 7 load balancing for providing an endpoint that mobile devices can call to send tracking events. To set up the load balancer, we first need to deploy the tracking service on Kubernetes and expose the service using a node port.
  • Google Kubernetes Engine (GKE): We can use Docker to containerize the web service and host the service using managed Kubernetes. The service receives tracking events as HTTP posts and translates the post payloads into JSON strings that are passed to PubSub.
  • PubSub: We can use PubSub as a message broker for passing data between different services in the pipeline. For the collection pipeline, messages are passed from the web service to a Dataflow job.
  • Cloud Dataflow: The Dataflow job defines a set of operations to perform on a data pipeline. For this pipeline the job would perform the following operations: consume messages from PubSub, translate the JSON events into BigQuery records, and stream the records to BigQuery.
  • BigQuery: We can use BigQuery as a managed database for storing tracking events from the mobile application.

Once these components are set up, we have a pipeline for collecting tracking events that can be used to build machine learning models. I provided code examples for a similar pipeline in the post listed below.

Model Training Pipeline

Model Training

With GCP, we can use Google Colab as a managed notebook environment for training models with libraries such as scikit-learn. We can use the Python notebook environment to train and evaluate different predictive models and then save the best performing model to Redis, where it can be consumed by the model application service. It’s useful to use a portable model format such as ONNX, in case the application service is not written in Python. The post below provides an introduction to the Google Colab platform.

Model Serving Pipeline

Feature Engineering & Modeling Serving

Once we have a model trained that we want to serve, we’ll need to build an application that will update the feature vector for each user in real-time based on the tracking events and also serve model predictions in real time. To store the feature vectors that encode tracking events into user profiles, we can use Redis as a low-latency database where we retrieve and update these values, as covered in the blog post listed below. For model serving, we need to build a web service that fetches the user profile from Redis, applies the model stored in Redis that was trained using Google Colab, and returns the model prediction to the model application, where it can be used to personalize the application. Similar to the tracking pipeline, the web services can be hosted on managed Kubernetes by containerizing these applications.

Monitor Components

Monitoring

To monitor the application, we can use the logging and monitoring managed services in GCP provided through Stackdriver. We can log events from the services hosted on GKE and from the Dataflow job to Stackdriver logging. We can also send custom metrics from these services to Stackdriver monitoring, where it’s possible to set up alerts based on thresholds. Setting up these data flows makes it possible to set up monitoring for the pipeline, where alerts are triggering and forwarded to Slack, SMS, or PagerDuty, and the application logs can be viewed using Stackdriver. Code examples of using Stackdriver for monitoring and logging are available in the post below.

Conclusion

Using managed services enables data science teams to get hands on with putting models into production, by reduces the amount of DevOps work required to set up and monitor infrastructure. While enabling data science teams to build end-to-end pipelines can result in faster development of data products, relying too much on managed services can drive up costs at scale. In this post we discussed some of the managed services available in GCP that data scientists can leverage to build real-time ML models.

Ben Weber is a distinguished data scientist at Zynga. We are hiring!

--

--