Image by Thomas Spicer

Sorry, Data Lakes Are Not “Legacy”

Why “data lakes are dead” talk contributes little to modern data architecture conversations

Thomas Spicer
Towards Data Science
13 min readJan 29, 2021

--

Last year in a post about data lakes, we covered various “FUD” around lake architecture, strategy, and analytics. Fast forward a year; it seems that data lakes are now definitively considered a “legacy” data architecture.

During the Modern Data Stack Conference 2020 they stated data lakes have no place in a modern data architecture:

“In my opinion, data lakes are not part of the modern data stack. Data lakes are legacy,” Fraser said. “There are organizational [and] quasi-political reasons why people adopt data lakes. But there are no longer technical reasons for adopting data lakes.”

Indeed this is controversial, likely an attempt to be provocative. However, this is not the first time shade has been cast on data lakes as an antiquated model for data architecture.

In fairness, grand proclamations like “data lakes are legacy” are used too frequently in the industry. For example, did you know the Dremio data lake is on the verge of making a data warehouse obsolete, at least according to the article “Did Dremio Just Make Data Warehouses Obsolete?”.

If you are someone trying to chart a path forward with modern data architecture, these types of statements contribute to unnecessary obfuscation. As someone looking to chart a path to developing a solution for your team and company, it can be a challenge filtering through the noise.

For this post, we are going to take the other side that data lakes are not legacy. We will save the conversation that data warehouses are obsolete for another day.

Why Vendors Claim Data Lakes Are “Legacy”

What is the rationale for claiming a data lake should not be part of a modern data stack? There are a few themes;

  1. Compute and Storage: Data lakes are architecturally and technically deficient because advances in data warehouse products deliver a separate “compute and storage” solution
  2. Technology Debt: Teams inherited data lakes or created them for political, not technical, reasons. They are created by “someone at the top of the organization” who embraces marketecture, invented a chart with a data lake, and declared victory
  3. Cost: Historically, data lakes were adopted due to lower costs, but that cost-benefit no longer exists

We are going to rebut each one of these faulty themes.

1. Compute and Storage

At the Data Stack Conference 2020, it was framed that a modern data architecture should leverage “massively parallel processing and column store technology via separate compute and storage solutions.” Nothing controversial about this. The separation of compute and storage conveys several operational, technical, and financial benefits.

Unfortunately, compute, and storage was framed as failing data lakes, which is not accurate.

We need to dig into the compute and storage concept some more to illustrate this point. Some important distinctions, especially given a technical concept like compute and storage, have often manifested as pricing strategies.

Compute

On-demand computational activities, such as a query engine, allow data compute capacity to scale up as the need arises and scale down when the need subsides. Compute services like an on-demand query engine, and provide the primary SQL query capabilities for data lakes.

Unlike traditional data warehouse systems, compute resources scale independently. For example, the query of person X will not materially impact the query of person Y.

On-demand compute significantly reduces idle systems resources. Most vendors' pricing models have compute costs closely aligned with usage. This means you are only paying for active compute operations, conveying favorable cost economics when used efficiently. Why efficiently? Unfortunately, compute costs tend to be highly variable, which makes forecasting budgets challenging.

Your costs will depend on the types, quantity, frequency, and complexity of the operations being performed by users. Person X may be much less efficient, than person Y in writing SQL queries, which means that costs maybe be exponentially higher for similar outcomes.

Storage

Storage is where data is organized (or not organized), durable, persisted, accessible, and residing in a specific format(s).

Storage costs are often a fixed multiple, highly predictable, and scale as needed. For example, AWS and Google Cloud Storage standard pricing is $0.023 per GB per month.

For 1 TB, the costs will be about $23. For 10 TB, the cost would be about $230. Some other factors impact storage costs, but the idea here is that the costs allow teams to forecast costs over time easily.

The big takeaway is that separate compute and storage is not a deficiency of a lake but one of its principal architectural and strategic strengths.

Tight vs. Loose Coupling In A Data Lake Technology Stack

It was said in the conference that the separation of compute and storage is a key benefit data warehouses have over data lakes.

“Data warehouses that separate compute from storage have all of the advantages of data lakes and more”

From a technical perspective, compute, and storage is intended to be loosely coupled architecture. As a result, this is a benefit for warehouses. However, the benefit is not just for warehouses.

Any modern data architecture, by design, depends on a loosely coupled separation of compute and storage to deliver an efficient, scalable, and flexible solution. The fact that data warehouse vendors are introducing separate compute and storage is not an innovation compared to data lakes; it is achieving parity with data lakes.

The evolution of separate compute and storage in warehouses brings them in line with the architecture employed by productive data lakes via on-demand SQL query services.

In a post called When to Adopt a Data Lake — and When Not to, a dig at data lakes was that they could not scale compute easily or on-demand;

Some solutions architects have proposed data lakes to “separate compute from storage” in a traditional data warehouse. But they’re missing the point: You want the ability to scale compute easily and on-demand. A data lake isn’t going to give you this; what you need is a data warehouse that can provision and suspend capacity whenever you need it.

Data lakes don’t scale compute resources because there are no compute resources to scale. As a critique of data lakes, this one is a miss. So how do you query a data lake? Data lakes rely on well-abstracted compute (i.e., query services), like Amazon Redshift Spectrum, Amazon Athena, Facebook Presto, Ahana, and others.

These query services, by design, scale compute resources on-demand to query the contents of a data lake. As a result, a lake does not have to compute services because it operates in a loosely-coupled, federated model, where compute (query) services are attached to a lake.

So, what is the difference between tightly and loosely coupled compute and storage?

Loose & Tight Coupling — Technology vs. Product vs. Pricing

BigQuery and Snowflake are a couple of products referenced during the conference that were highlighted as leaders in the data warehouse segment to separate compute and storage.

When you push data to BigQuery, the data is resident within BigQuery storage systems. When data resides in the Snowflake storage system, that data is bound to Snowflake. As a result, data at rest within BigQuery and Snowflake is (generally) tied to their respective storage systems. The compute and storage paradigm in these cases are tightly coupled in a product context.

While the compute and storage internals are technically separate in each product, the practical end state is bound to each other in a product context. This is not unlike Oracle or MySQL, where compute and storage are intrinsically bound within each vendor’s product. Since compute and storage are intrinsically bound within each vendor’s product, Snowflake can not directly query BigQuery native storage contents just as BigQuery can not query native Snowflake storage.

Image by Thomas Spicer

While products might be tightly coupled at the product level, BigQuery and Snowflake manifest the benefits of separate compute and storage as an innovation in pricing.

As a result, a tightly coupled compute and storage product model is reflected in a loosely coupled pricing model.

A modern data lake architecture expects compute resources to be supplied by external SQL query services. The diagram below highlights how well-abstracted data lakes can independently serve different product compute services. Presto, Athena, Snowflake, and BigQuery all can operate as the compute tier for a data lake:

Image by Thomas Spicer

It can be easy to envision how different brands, divisions, or partner organizations can leverage preferred query services on a core data lake infrastructure in more complex organizations.

When we mentioned that Google BigQuery and Snowflake generally work from native storage systems, they offer to compute services independent from those internal storage systems.

For example, Snowflake allows you to query your data lake via external tables. Google also supports the same concept of querying external data sources.

Google is further extending this well-abstracted query model by investing heavily in BigQuery Omni. AWS has similarly extended Redshift with Spectrum on-demand compute query services.

These hybrid models reflect the need for flexibility. Extending query services is not a function of supporting “legacy” data lakes, but vendors recognize they operate within complex ecosystems facing constant pressure from rapid innovation and transformation in data science, analytics, operations, and reporting.

2. Technology Debt: Architecture

It is not uncommon for teams to inherit lots of technical debt despite a spoken (or unspoken) goal to carry as little debt as possible.

As a general rule of thumb, the more you opt toward vendor-specific technologies and solutions, the greater the risk you incur additional technical debt. This might be an acceptable compromise.

Teams often trade short-term velocity for long-term costs. Suppose management is randomly forcing a data lake “marketecture,” the risk of incurring debt increases. However, the same can be said of management, saying, “Hey, you must use Snowflake because my buddy Jed said it was awesome” or “I read a blog post on BigQuery is the best thing ever, we are going with that.” We all know cases where something like this happens.

Despite statements to the contrary, a data lake does not intrinsically carry any more technical debt than a warehouse. A data lake, done well, can help reduce the risk of incurring technical debt by properly abstracting aspects of your data architecture with best-in-class solutions. For example, a data lake can offer velocity and flexibility in employing different compute services, minimizing risks of vendor lock-in, and mitigating switching costs.

For example, at the Modern Data Stack Conference 2020, it was stated that people should not be concerned about vendor lockin in the data warehouse space. Why? If you depend on a vendor like Fivetran, it will reduce the risks of warehouse lock-in. Oddly, this answer to vendor lock-in actually is promoting vendor lock-in. What if you need to shift from using Fivetran to something else?

Obviously, vendors have a vested interest in getting you operationally and technically locked into a product. That is OK as a business strategy but certainly is not a strong counter-argument for any company attempting to create loosely-coupled, well-abstracted modern data architecture.

There might be cases where vendor lock-in and resulting technical debt might be a viable compromise. However, this compromise's benefit may be short-lived and quickly evaporate, given the rate of change going on in the industry.

For example, Athena and Presto offer federated queries, providing on-demand compute services for running SQL queries across data stored in relational, non-relational, object, and custom data sources. Federated queries reflect a distributed, loosely coupled approach that minimizes technical debt while accelerating data consumption.

Why mobilize data from MySQL to Snowflake if you can query it directly in place via Presto, Athena, or Redshift? Certainly, the need for service from Fivetran for moving data from MySQL to Snowflake is eliminated or significantly reduced.

Here is how PrestoDB describes what federated queries allow users to do:

Presto allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization.

Steve Mih, CEO at Ahana, echoed this sentiment, “Presto is part of the open analytics stack — open source, open format, open interfaces, open cloud.”

The idea of minimizing ELT and ETL with federated query services reflects a well abstracted compute services tier. AWS, Google, Microsoft, and many others are rapidly moving toward adopting a distributed query engine model within their products. Is this the right approach for everyone, probably not.

Given an accelerating rate of change in the data warehouse, query engine, and data analytics market, minimizing risk, lock-in, and technical debt should be a core part of a serious data architecture strategy.

3. Costs: Technical Architecture Or Pricing Model?

The topic of costs can be tricky. Typically, an analysis would occur around the total cost of ownership. For example, what are the costs associated with products? Operations? People? Training?

For the sake of simplicity, we will narrow the costs to vendor products and services around compute and storage.

Each vendor has a different pricing strategy for compute resources. One of the primary selling points for Google, AWS, Snowflake…is the promise of on-demand, no upfront costs, and low operating expenses.

While this is typically the case, the flip side of compute pricing is that it can be difficult to forecast. Do you know today how your current and future state use cases will impact costs 6, 12, or 24 months from now? As a result, compute pricing variability can result in unexpected costs. All too often, these costs become the target of significant operational and financial conversations with your CFO.

Compute costs growing over time. Image by Thomas Spicer

BigQuery & Snowflake

During the conference, both Snowflake and Google BigQuery were highlighted as leaders of this new variable pricing paradigm. Both services offer a pricing model that separates compute and storage. For example, the typical Google BigQuery usage costs about $0.020 per GB for storage and $5.00 per TB of data scanned in a query.

Snowflake costs are a little more complicated on the compute side than BigQuery; as Snowflake warehouse-size increases (the more you scale up compute resources), the more your costs increase. We go into Snowflake and Amazon Athena use cases below.

Costs: Snowflake Example

The following is a compute and storage pricing scenario for Snowflake. In our hypothetical scenario, we will start with the smallest warehouse size for Snowflake: X-Small. A warehouse will increase in costs as you go up in size. We will have about 1 TB of data in Snowflake.

Let’s assume we are running ETL processes every hour for 10 minutes, seven days a week. You also have analysts connecting to Snowflake from Tableau for 3 hours a day, Monday through Friday. The compute costs will be about $620 a month, while the storage costs are under $100. If your use case changes from 10 minutes to 20 minutes of ETL time, your compute costs are $862, while your storage costs generally remain flat.

Now, let’s say your processing workload changes, and you noticed performance bottlenecks with X-Small. Given the workload, you decided to scale from X-Small to Medium. The compute costs go from $620 to $1496 for the 10-minute ETL process and $862 to $2470 for the 20-minute ETL process per month.

In this example, the compute costs have scaled while storage costs remained flat. So far, so good. This is undoubtedly an innovation in pricing. Paying for resources as demand scales without adding or changing storage capacity offers flexibility and reduced operational complexity.

So what is the catch? With tightly coupled compute and storage, your architecture is bound to vendor pricing models. If Snowflake increases the costs for additional services, the chance of being “stuck” with rising costs is a real first-class financial consideration.

At some point, let’s say a few years from now, Snowflake decides it must focus on extending offerings to meet shareholder expectations for delivering profitability. Not an unreasonable expectation. However, how would this new drive for profitability impact pricing?

Let’s say your monthly Snowflake run rate was $2000 before the profitability push. Snowflake changes the cost basis for your use case, which increased by 20%. As a result, over the course of a year Snowflake will receive an additional $5000 revenue for the compute services.

The point here is that innovations in pricing do not always benefit the customer over the long term. Easy on-boarding and upfront simplicity invariably will have impact on the total cost of ownership.

Costs: Amazon Athena Example

Amazon Athena provides on-demand query services for data lakes. The pricing for Athena? It is $5.00 per TB of data scanned.

If you run 100 queries a day on average, scanning 25 GB of data per query will cost about $370.00. (3,042 queries per month x 0.0244140625 TB x $5.00 USD = $371.34 USD).

There are cases where we have seen Athena, used in conjunction with tools like Tableau Hyper, deliver a compelling cost/performance value proposition. Tableau Hyper supports data from a lake using an “in-memory data engine technology, designed for fast data ingest and analytical query processing on large or complex data sets.”

Employing Hyper ensures the number of queries to Athena Tableau needs to be reduced significantly. As a result, the costs for associated Tableau queries charged by Athena are reduced.

The same benefits detailed in the Athena/Tableau pairing can be true for Snowflake or BigQuery too.

The point here is that modern data architecture should start by exploring patterns, experimenting, defining priorities, and making objective decisions based on well-reasoned criteria.

Embracing Modern Data Lake Design Patterns

McKinsey has said the following about data lakes;

“ …lakes ensure flexibility not just within technology stacks but also within business capabilities.”

An accelerating rate of change in the “data” market is not abating. Fast-paced innovation in technology, operations, and pricing demands well-reasoned data architecture. As such, any reasonable assessment of a modern data architecture must factor in the impacts (positive or negative) of each choice, including using, or not using, a data lake.

Data lakes are not the right solution for all use cases. The absence of tightly coupled compute resources can be a real, practical consideration for a certain type of team. A solution like BigQuery or Snowflake offers operational and technical simplicity. Thankfully, these products also support loosely-coupled query services. This compromise may be useful if a team finds itself needing to leverage a data lake in the future.

The purpose of a lake in modern data architecture is to deliver a solution that values flexibility, open standards, minimizing vendor lock-in, and service federation. Regardless of your preferred architecture or platform, spending time defining an architecture that weighs the pros and cons of each choice is time well spent.

--

--