The New Generation Data Lake

The petabyte architecture you cannot afford to miss!

Published in

Towards Data Science

7 min readSep 20, 2021

An iceberg in the middle of the ocean with clouds behind — *Image by Hubert Neufeld:* *https://unsplash.com/photos/7S21XSxKxVk*

The volumes of data used for Machine Learning projects are relentlessly growing. Data scientists and data engineers have turned to Data Lakes to store vast volumes of data and find meaningful insights. Data Lake architectures have evolved over the years to massively scale to hundreds of terabytes with acceptable read/write speeds. But most Data Lakes, whether open-source or proprietary, have hit the petabyte-scale performance/cost wall.

Scaling to petabytes with fast query speeds requires a new architecture. Fortunately, the new open-source petabyte architecture is here. The critical ingredient comes in the form of new table formats offered by open source solutions like Apache Hudi™, Delta Lake™, and Apache Iceberg™. These components enable Data Lakes to scale to the petabytes with brilliant speeds.

To better recognize how these new table formats come to the rescue, we need to understand which components of the current Data Lake architectures scale well and which ones do not scale as well. Unfortunately, when a single piece fails to scale, it becomes a bottleneck and prevents the entire Data Lake to scale to the petabytes efficiently.

We will focus on the open-source Data Lake ecosystem to better understand which components scale well and which ones can prevent a Data Lake from scaling to the petabytes. We will then see how Iceberg can help us massively scale. The lessons learned here can be applied to proprietary Data Lakes.

The Data Lake Architecture

As shown in the diagram below, there are two main layers to a typical Data Lake architecture. The storage layer is where the data lives, and the compute layer is where the compute and analytical operations are executed.

Object Stores and File Formats — scalable

Data Lake storage is handled by object stores. We can massively scale object stores by simply adding more servers. These containers span across different servers, making objects stores extremely scalable, resilient, as well as (almost) fail-safe.

Today, the most popular Object Stores are Amazon S3™ (Simple Storage Services) offered by Amazon Web Services™ and the Hadoop Distributed File System (HDFS)™. A flat, massively scalable distributed horizontal architecture is used to localize these containers and the objects in them. Of course, you can find very similar services on GCP, Azure.

Apache Parquet™ and Apache ORC™ are commonly used file formats. They use columnar-storage file formats, which scales a lot better than row-based file formats for analytics usage. These file management systems read only the needed columns during read/write operations, which significantly speeds up the read/writes.

The storage layer with the object stores and file format systems described above looks like this.

Data Processing — scalable

The compute layer manages all execution commands, which includes Creating Reading Updating Deleting (CRUD) and performing advanced queries and analytical computations. It also carries the meta store, which contains and manages information such as the metadata and the file locations, and other information required to be updated in a transactional way.

Apache Spark™ is one of the more popular open-source data processing frameworks, as it can handle large-scale data sets with ease. Experiments have shown Spark’s processing speed to be 100x faster than Hadoop. Sparkachieves its scalability and speed by caching data, running computations in memory, and executing multi-threaded parallel operations. The Spark APIs allow many components of the open-source Data Lake to work with Spark.

PrestoSQL™, which is now rebranded as Trino™, is a distributed SQL query engine designed for fast analytics on large data sets. Trino was initially developed by Facebook in 2013. It can access and query data from several different data sources with a single query and execute joins from tables that are in separate storage systems like Hadoop and S3. It uses a coordinator to manage a bunch of workers running on a cluster of machines.

Presto does not have its own meta store. The Presto coordinator needs to call a meta store to know in which container the files are stored. It generates a query plan before executing it to the different nodes. Although efficient, the Presto coordinator represents a single point of failure and bottleneck due to its architecture.

Although both Spark and Presto are used as an SQL interface to the Data Lake, they are not used for the same purpose. Presto was designed to create and handle large queries of big datasets. It is used by data scientists and data analysts to explore large amounts of data. Spark, on the other hand, is primarily used by data engineers for data preparation, processing, and transformation. Since their purposes are not the same, they often both coexist in Data Lake environments.

Meta Stores — not so scalable

Meta store management systems run into issues when scaling massively. Let’s look at how.

The Apache Hive™ Meta Store, commonly used in open-source Data Lakes, was developed in 2010 by Facebook. Hive uses a Relational Database Management System (RDBMS) to keep track of the table’s metadata, such as location, indexes, and schema. RDBMS’ don’t easily scale to the petabytes, and therefore can represent a speed bottleneck for huge Data Lakes. Hive uses map-reduce, which involves translating the SQL query before sending it to the correct container. This can considerably slow down queries when accessing vast data stores. Another drawback of Hive is that it does not have a version control management system.

Nessie™ is more recent and scales better than Hive. It uses a scalable backend data store like Amazon DynamoDB™ to store the metadata. Nessie offers a well-designed Git-inspired data version control system. Its APIs make it easy to integrate with Spark, Hive, or Iceberg. However, the Nessie data store resides in the compute layer, which doesn’t scale as well as the object store, and therefore can become a bottleneck.

The Data Lake architecture with the data processing, metadata management, and table formats described above looks like this.

Now, let’s look at how Iceberg can overcome these meta-store bottlenecks.

The Iceberg Linchpin

Apache Iceberg is a table format specification created at Netflix to improve the performance of colossal Data Lake queries. It is a critical component of the petabyte Data Lake. Ryan Blue, the creator of Iceberg at Netflix, explainedhow they were able to reduce the query planning performance times of their Atlas system from 9.6 minutes using Hive and Parquet down to just 42 seconds with Iceberg.

Iceberg manages the table definitions and information (file metadata) in files located next to the data in the object store buckets. The key to scalability is that the Iceberg meta stores can be distributed across several resources in the object store. This provides Data Lakes with massive meta store scalability by avoiding the bottleneck created by the other meta-stores described above, which reside at the compute level. A detailed description of the Iceberg table architecture can be found here.

Iceberg comes with a set of APIs and libraries that data processing engines can use. Files are still stored using file formats like Avro, Parquet, or ORC.

Iceberg does not use traditional date-based sharding to manage buckets but instead uses its own optimization rules for greater flexibility and speed. Historical query speeds are greatly optimized using Iceberg’s sharding method. Iceberg can be integrated with Nessie for version control management and to roll back to prior table, partition, and schema layout instances.

Iceberg offers an SDK in both Java and Python. This SDK can be accessed by Spark, Presto Flink, and Hive. Even if you chose to use Iceberg as your table format, you still need to use tools like Spark and/or Presto to query Iceberg. At the time of this article, the Iceberg Python SDK is limited to read operations, but you can always use Spark or even Presto if you don’t want to directly integrate your application with Iceberg, which can be a challenge.

The new petabyte open-source Data Lake with the Iceberg file format management components looks like this.

Machine Learning platforms in the near future will include new data lakes based on technology like Iceberg in their architectures to provide massively scalable Data Lakes for data scientists to realize their most ambitious projects.

The New Generation Data Lake

The petabyte architecture you cannot afford to miss!

Written by Paul Sinaï