Will Data Go Cloud Native?

The tools and platforms that data professionals use are increasingly running on cloud native technology.

Charles Landau
Towards Data Science

--

Photo by Jonny Gios on Unsplash

Data tools remain an extremely active space, which is very exciting for me as a lazy user. Don’t take my word for it, you can read Matt Turck’s excellent post: Resilience and Vibrancy: The 2020 Data & AI Landscape.

If you dig into the massive infographic in Matt’s post, you’ll find that many of the listed technologies, tools, and companies are cloud native. What’s striking about this trend is that it’s occurring all up and down the modern data infrastructure stack — from ingestion to storage to processing and prediction.

One thing I wonder, though, is whether cloud native data tools are going to become dominant.

What is cloud native?

For a variety of reasons, people have come to use the term “cloud native” for two very different things.

  1. “Cloud native” means using “containers, service meshes, microservices, immutable infrastructure, and declarative APIs”, as described by the CNCF.
  2. “Cloud native” means using the cloud service provider “native” tools. In this usage we’re talking about provider-specific tools with tight integration, for example with AWS Control Tower, Config, or ParallelCluster.

It’s unfortunate that there’s two almost opposite usages of the term, but I can’t do anything about that. (I actually think that you could end up implementing both meanings of the term in some instances, maybe somewhere in Serverless Land, so they aren’t quite opposites.) In any case, for this article, I’m going with the first usage.

Cloud Native Data Tools

Let’s get real: data tools run on cloud native.

Spark now targets deployment on Kubernetes. This by itself is tremendous. In addition, AWS recently rolled out a feature for their managed Spark cluster that lets you run it on their managed Kubernetes cluster. (Would I sign up run production Spark workloads on Kubernetes? I’m not sure.)

Jupyterhub, RStudio, and Kubeflow are all examples of data infrastructure-as-software using Kubernetes. These are honorable, value-add tools that offer a consistent user experience to data scientists, using the tools that they are already familiar with (Kubeflow adds some new things, but also embeds Jupyter). This space is far from settled: AWS Sagemaker, Azure ML Studio, and Google AI Platform are all strong offerings, which sometimes lightly overlap with cloud native. The “open core” up and comer, Databricks, is driving hard to IPO. Now that Spark runs on Kubernetes, will Databricks develop a Helm chart?

Airflow, the popular data pipeline tool which recently hit 2.0, runs on Kubernetes. Kafka, the big honking streaming data platform, is simplifying its architecture, and the Strimzi project is simplifying Kafka deployment on Kubernetes. (Strimzi is currently a CNCF Sandbox project, meaning that it may be awhile longer before you can kiss your dedicated production Kafka clusters goodbye.) People are trying to put production databases on Kubernetes (and I have questions). ESRI, the winners of the geospatial data market, will add support for Kubernetes to their flagship product ArcGIS. Feature stores, a kind of multi-workflow database for MLOps, run on Kubernetes.

These technologies won’t all melt into Kubernetes overnight, and cloud native hasn’t won by any stretch, but there is clearly some momentum.

Don’t Panic

Despite its complexity, I tend to think that Kubernetes sprawling into data infrastructure is just a function of Kubernetes sprawling into everything. It is moving into Edge. It is moving into IaaS. You can run blockchain on it apparently. Banks are using it.

I generally have no idea how this will play out, but I fully expect it to remain exciting and dynamic. These problems sit at the intersection of extremely vibrant communities (data, ML, cloud native, serverless, open source), vendors of all shapes and sizes, and a ton of investment. That seems to me like a formula for interesting times.

What you definitely shouldn’t do, unless you hate money, is run out and buy every data scientist on your team a copy of the Cloud Native Infrastructure book and a Kubernetes bootcamp. If you’re a data professional reading this and thinking about “learning Kubernetes”, that’s a perfectly fine thing to do. It can be good to diversify. Just be aware what you’re getting yourself into.

Originally published at https://theslipbox.substack.com.

--

--