
Apache Iceberg: What is it? Apache Iceberg – is it a new data lake file format? A table format? Why is it so good? Time travel for data lakes? I will try to answer all these questions in this story.
Transactionally consistent data lake tables with point-in-time snapshot isolation is all we need.
Selecting a table format is a crucial choice for anybody pursuing a data lake or data mesh strategy. Some important data lake platform features to consider include schema evolution support, read and write time, scalability (is data processing Hadoop splittable?), compression efficiency and time travel to name a few. Choosing the right file and table format based on business requirements will define how fast and cost-effective your data platform might be.
Iceberg is a table format, engine and file format agnostic. Iceberg is not, in general, a development of a previous technology like Apache Hive. And it is a good thing, because developing something from an older technology might be limiting. A good example is how schema can change over time, and Iceberg can help to handle this as simply as renaming a column. It was designed and is proven to perform in data lake platforms at scale on the world’s most demanding workloads.
More and more data tools introduce Iceberg tables support.
For instance, we can create an Apache Iseberg table in AWS Athena (must be engine 3) like so:
Google’s BigQuery now also supports Iceberg tables:
Iceberg tables and data platform types
Netflix created Iceberg originally, and it was supported and donated to the Apache Software Foundation eventually. Now, Iceberg is developed independently, it is a completely non-profit, open-source project and is focused on dealing with challenging data platform architectures. It supports multiple Big Data file formats, including Apache Avro, Apache Parquet, and Apache ORC.
Keeping data in the data lake is one the most simple solutions when we design the data platform.
It requires much less maintenance compared to modern data warehouse solutions.
Data lakes are typically used to store all types of data – structured and unstructured – at any size. Data lakes were traditionally connected with the Apache Hadoop Distributed File System (HDFS). However, organizations are increasingly using object storage solutions such as Amazon S3, Google Cloud Storage or Microsoft Azure Data Lake Storage.
In a few words, data lakes simplify data management by centralizing the data.
Opposite to that, when every access is routed through a single system ( data warehouse), it simplifies concurrency management and updates but limits flexibility and raises costs. I previously wrote about it here:
It is all great but while building a data lake data platform we might face some other issues as well, i.e. no time travel, no schema evolution support, and complexity of data transformation and data definition.
When we build a data lake data platform, external tables will most likely bring the whole set of cons related to that architecture… but Iceberg helps to solve this.
At least some well-known limitations of traditional external tables:
- For example, we can’t modify them with DML statements and data consistency is not guaranteed. Having said that, if the underlying data was changed during the processing we might not get consistent results.
- Have usually a limited number of concurrent queries in modern data warehouses, i.e. 4 in BigQuery.
- External tables do not work with clustering and will not let export data from them.
- Will not let us use wildcards to reference table names.
- No time travel features
From my experience, data consistency is one of the most important features large-scale analytics would require. Iceberg solves it and now multiple engines ( Spark, Hive, Presto, Dremio, etc.) can operate on the same table simultaneously.
It also offers many other brilliant features such as rollback to the previous table version (to quietly resolve issues), and advanced data filtering capabilities at scale when processing huge amounts of data.
Data consistency and improved processing efficiency
A good example is when an ETL process modifies the dataset by adding and deleting files from storage, another application reading the dataset may process a partial or inconsistent representation of the dataset and produce inaccurate results.
Iceberg would typically mitigate these risks by leveraging lots of manifest (metadata) files to take snapshots along the way the data is being processed. It will capture schemas and maintain deltas, including file information and partitioning data to guarantee consistency and complete isolation.
Iceberg also automatically arranges snapshot metadata in a hierarchical manner which ensures quick and efficient table modifications without redefining all dataset files, resulting in optimal performance while working at data lake scale.
Iceberg offers SQL commands that allow you to combine new data (MERGE), change old rows, and remove specific rows.
Iceberg can eagerly rebuild data files to improve read performance or employ delete deltas to speed up updates.
A good merge example can be found here:
Another one which can be applied to that AWS Athena table above is this:

Time travel feature
Modern data warehouses allow to time travel in your data, i.e. we can go to a specific timestamp to get that particular state of data in our table. For instance, in Google Cloud BigQuery we can run this SQL below to do so:
SELECT *
FROM `mydataset.mytable`
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR);
It might be a "nice-to-have" feature for data lake platforms as well.
If you design a data architecture around files, such as Apache ORC or Apache Parquet, you will profit from ease of implementation, but you will also run into that issue described above. Time travel is not supported. Schema evolution has always been a problem too. When fields change over time it might be a problem. For instance, AVRO file format has schema evolution support and I previously wrote about big data file formats’ pros and cons here:
Big Data File Formats, Explained
Iceberg keeps the whole history within the Iceberg table format, with no storage system dependencies.
Iceberg table users may query past states at any Iceberg snapshot or historical point in time for consistent results, comparison, or rollback to remedy errors since the historical state is immutable.
SELECT count(*) FROM mydatabase.user_transactions FOR TIMESTAMP AS OF TIMESTAMP '2023-01-01 00:00:00.000000 Z'
Schema evolution support
It’s not a secret that data lake files might change over time, as well as their schemas. Adding a column in the Iceberg table now will not return "dead" data.
Column names and the order can be changed. The best part is that schema updates never need to rebuild your table.
ALTER TABLE mydatabase.user_transactions ALTER COLUMN total_cost_gbp AFTER user_id;
Iceberg allows for in-place table changes and it ensures correctness, i.e. new columns added never read existing values from another column. When data volume varies, we now can modify partition layout or change a table schema just using SQL, even for nested structures, i.e.
traffic_source STRUCT<name STRING, medium STRING, source STRING>,
This is what Iceberg calls Partition evolution.
When you modify a partition specification, the existing data written with an earlier specification remains unaffected.
Each partition version’s metadata is saved separately. Iceberg does not require expensive diversions such as rewriting table data or moving to a new table.
Improved partitioning
In Iceberg terms it is called "Hidden partitions". The traditional data lake way of using partitions is called "Hive partitioning layout".
Let’s consider an external table with Hive partitioning layout in BigQuery:
CREATE OR REPLACE EXTERNAL TABLE production.user_transactions (
payment_date date,
timestamp timestamp,
user_id int,
total_cost_usd float,
registration_date string
)
WITH PARTITION COLUMNS (
dt STRING, -- column order must match the external path
category STRING)
OPTIONS (
uris = ['gs://user-transactions/*'],
format = 'AVRO',
hive_partition_uri_prefix = 'gs://user-transactions'
);
Partitions must be explicit in Hive and are shown as columns, thus the **user_transactions**
table would contain a **dt**
date column. This means that all SQL queries that do something with our table will have to have **dt**
filter in addition to a **timestamp**
filter.
Hive requires partition values. It does not understand the link between the transaction timestamp and
dt
in our table example above.
Oppositely, Iceberg generates partition values by taking a column value and, if needed, modifying it. Iceberg is in charge of translating transaction `timestampto
dt` and maintaining the connection. Iceberg may hide partitioning since it does not require user-maintained partition columns.
Partition values are always created appropriately and utilized to speed up queries wherever possible.
dt
would be invisible to both producers and consumers and queries no longer rely on the actual physical layout of our table.
It helps to avoid partitioning errors while transforming the data. For example, using the wrong date format, led to incorrect partitioning, not failures: **20230101**
instead of **2023–01–01**
. This is a well-known and most common problem with Hive partitioning payout. Another problem that Iceberg helps to solve is that in Hive partitioning layout all working queries are related to the table’s partitioning scheme, therefore changing the partitioning setup will break queries.
Conclusion
Relational databases and data warehouses both support atomic transactions and time travel but they do it in their proprietary way. Iceberg, being an Apache project, is completely open source and not reliant on any specific tools or data lake engines.
Iceberg supports industry-standard file formats like Parquet, ORC, and Avro and is compatible with key data lake engines such as Dremio, Spark, Hive, and Presto. Its wide community collaboration generates new ideas and provides assistance in the long term. It features various active communities, such as public Slack channels, in which everyone is free and welcome to participate. It was designed and is proven to perform in data lake platforms at scale on the world’s largest workloads and environments. Organizations now can enjoy the full potential and advantages of moving to a data lake architecture by utilizing Iceberg. Feel the cost-effectiveness of cloud storage-based data platforms with no need to sacrifice the features and capabilities of traditional databases and data warehouse solutions.
It makes our data lake platform really flexible and reliable
To conclude these are the benefits of the Iceberg table format:
- Several separate programs can handle the same dataset concurrently and consistently.
- Improved data processing and data reliability (more efficient updates for very large data lake-scale tables).
- Improved schema handling as it evolves over time.
- ETL pipelines are greatly simplified (By acting on data in place in the data lake rather than transporting data between numerous separate systems).
Recommended read:
- https://cloud.google.com/bigquery/docs/time-travel
- https://www.apache.org/
- https://hive.apache.org/
- https://iceberg.apache.org/community/
- https://cloud.google.com/bigquery/docs/iceberg-tables
- https://medium.com/towards-data-science/data-platform-architecture-types-f255ac6e0b7
- https://medium.com/towards-data-science/big-data-file-formats-explained-275876dc1fc9
- https://iceberg.apache.org/docs/latest/evolution/
- https://iceberg.apache.org/docs/latest/partitioning/#icebergs-hidden-partitioning
- https://cloud.google.com/bigquery/docs/external-data-cloud-storage#sql