Paved roads are common in technology infrastructure but there’s never been a better time to build highways. Over a century ago, as car prices dropped from $850 in 1913 to less than $300 by 1924 because of mass production via assembly lines, we saw the birth of US Numbered Highways system to better serve long distance travel. These highways triggered investment in infrastructure that further mobilized of people and goods at an unprecedented scale, fundamentally changing our way of life. Just like the core innovation earlier wasn’t cars but the assembly line, the ‘Big Three’ cloud providers today innovate to deploy the modern equivalent of the assembly line and implement operations at massive scale. Just like cars were the rage a century ago, today we live in the golden age of consumer applications that deliver services and goods, from food and furniture to virtual travel, at the press of our fingertips. But where are the digital highways? The ‘goods’ in the digital world today is data. Platform teams across enterprises are building their own paved roads and Data Infrastructure, uncovering new rocks in their local domains. This is a stab at imagining modern data highways that every company could use to move faster in a data-driven world over the next decade.
Early Days
Traditionally, data lakes (DLs) have been the place where structured and unstructured data from across the organization gets dumped. These data dumps have naturally led to rising costs with proprietary, on-premises solutions in the past. Today, customers can create their own data lakes in the cloud. The steps in building a data lake generally involve defining the storage (landing/ingest) location, configuring ingestion into this location, preparing and enriching data, configuring access controls, and finally querying the data for analysis as shown below.
But this arrangement of ingest, storage, transform, analyze steps isn’t new. This has been around for decades in spite of ‘modern’ tools appearing to replace previous generation tools. What has changed is the reduction of ‘muck’. With right level of abstraction, practitioners don’t have to rediscover the same rough edges, gaps, or surprises over and over again.
Miles to Go
In modern organizations using data for analytics and machine learning, we find that teams still need to solve several hard problems – from the most immediate issues of query performance and data quality to the more strategic issues around data privacy. Teams are also growing sensitive to their infrastructure costs and the time to deploy (or time to insights), both serving as proxy measures of productivity and long-term competitiveness. These problems suggest the need for modern data highways that are yet to be built.
A. High Performance Querying – With the separation of storage (now available reliably and cost-effectively in cloud object storage) from compute, a query engine (with caching for additional performance) that can compile queries into efficient logical and physical query plans is essential to abstract the underlying storage layouts and formats. Such query engines naturally rely on indexing as a core capability, which can ideally accommodate new types of indexes for new categories of data over time. Not surprisingly, modern data warehouses that have already built such high-performance query engines can now act as a single source of truth for a wide range of query patterns, from ad-hoc analyst queries, data science tasks and ML engineering to scheduled business intelligence reports. At their heart is a high-performance query engine. [See: Snowflake, Delta Lake from Databricks, Amazon Redshift] As with Alibaba Cloud’s Hologres, which launched in Feb 2021, we expect to see further improvements in query latency over the next year. What’s next for data stores? My bet is the emergence of an order of magnitude more complex queries over the next five years (more in a section below).
B. Data Quality & Management – A key organizational capability is aligning different teams on agreeing what ‘good data’ means. Tactically, this requires defining a data schema and rules that govern good quality data. Strategically, formalizing the organizational tribal knowledge and setting expectations are critical in building deeper trust in the business decisions and end-user experience that this data is used to serve. [See: WhyLabs, Great Expectations, Anomalo, Monte Carlo, BigEye Data, Hubble, Databand, UnravelData] Moving beyond data quality, we expect new work on continuous evaluation and multi-dimensional evaluation metrics to shift DataOps/MLOps from targeting observability to informed controllability, holding upstream systems and teams accountable for downstream errors and corrections. For example, fixing entity embeddings upstream can address errors in downstream products that consume these embeddings. Recent papers such as Bootleg, RobustnessGym show that data management techniques (not modeling techniques) such as augmentation and weak labeling can increase accuracy over baseline models, especially over long tail distributions [See: Snorkel]
C. Data Privacy – Another key organizational capability is building mechanisms for intentional data use and control. At the first level, this requires being intentional about whether business decisions require data that’s not intended to be used for that scenario. For example, when you see a creepy advertisement, did the business intend to show you a creepy ad, or was it unintentional representing poor judgement? Several companies have built open-source frameworks (e.g. Netflix’s Metacat, LinkedIn’s DataHub, Airbnb’s Dataportal, Uber’s Databook, Lyft’s Amundsen) for data cataloging, lineage tracking, and usage policy enforcement. These frameworks enable reproducibility in addition to ensuring that models are trained on data that is intended for a given business purpose. The next level is ensuring that no human (incl. employees) can access user data. The third level is deploying well-defined data retention and deletion mechanisms to implement reliable control of an individual or group’s data based on their consent. The problem with using traditional security mechanisms, say the techniques used for credit card protection, is that replacing a specific field with an encrypted token (aka ‘pseudonymization’) fails when one cannot point to what specific data field is actually sensitive. While a dataset may appear innocuous on its own, it can become sensitive when used in conjunction with another dataset. Encrypting a given dataset (‘de-identification’) may appear safe but what if multiple (unintended) parties hold the decryption key? It’s still unclear whether encrypted data itself could deliver the desired level of accuracy. Differential privacy offers provable guarantees to ensure that nothing can be learned about a specific individual by introducing calibrated noise in the data. This enables businesses to assess tradeoffs between accuracy and privacy, and of course serves to provide utility from a well-calibrated dataset that would otherwise not be available. While larger companies such as Google, Apple, Facebook, and LinkedIn are building for privacy internally, and companies such as Privacera, Immuta, and Privitar are showing promise, it remains the most underserved space in the data landscape.
D. Data Transformations – Data transformations enable the filtering, shaping, enriching of data into a form that lights up insights and generates features for machine learning (ML) models. SQL remains the most popular transformation interface today while various pipeline-as-code tools stitch together multiple dependent steps. [See: Dbt, Airflow] This isn’t necessarily a hard problem in itself, but it’s where a lot of the ‘muck’ and undifferentiated heavy lifting soaks up time and productivity. Fully managed, cost-effective services in the cloud such as Prefect and Flyte are promising but this space remains isolated from cost constraints, business SLAs, quality and accountability, and intended-use requirements that we mentioned above. Could there be a way to fundamentally reformulate this problem space? For example, if application production environments can be defined declaratively with desired constraints, why not data infrastructure environments as well?
E. Feature Engineering – Adding new features or gathering new types of data frequently results in long lead times if the feature isn’t easily ‘backfill-able’. As a data scientist who wants to gather a new signal from users, you have to first design the feature, add it to the source application to collect it, and then wait weeks to gather enough data to draw reliable conclusions. This extends model development cycles to several months. These batch or streamed data pipelines may further feed into an offline feature store, again on S3, that holds registered ML model ‘features’. Ensuring that ML models see the same distribution of data during training as they do during inference remains a hard problem that Zipline from Airbnb is beginning to address and perhaps Tecton might build on.
F. Model Building – As companies develop more sophisticated ML systems, say by composing models together or training larger models with unstructured data such as audio and video, the cost of compute required for training quickly balloons. However, as models begin to architecturally converge, we expect these training costs taper off for most applications, which may build on optimized versions of standard models or use APIs, say from Amazon Rekognition or perhaps Microsoft’s GPT-3 API. More importantly, bringing model computation closer to the source data and leveraging the underlying relational structure of data to execute complex queries would result in orders of magnitude reduction in costs. This may lay the ground to the next generation of query engines with further optimizations in how the data is organized. [See: Lliquid, Relational.ai]
G. Model Deployment – Sadly, inference and training environments continue to use different tools, systems, and pipelines. These isolated tools and processes create painful and steep barriers in safely deploying models to production, leading to data scientists spending time on infrastructure issues instead of data and model design. Tools that offer a paved path for data scientists early in the prototype stage naturally enable them to hit production sooner. [See: Algorithmia, Seldon, Bighead from Airbnb, OctoML]
Time to build… what?
We’ve been making reasonable progress with query performance but data quality and management, data privacy, feature engineering, and model building still have a long way to go. Amazon founded the revolution with cloud services more than a decade ago and we’ve now seen transformative product categories built in the cloud. It’s time to build the muscle and guardrails to leverage this infrastructure to strengthen our organizations and data practices. Are you seeing the same or different gaps? What are we missing?