From Data Warehouse to Data Lake to Data Lakehouse

What’s for what, what you need, and what are the advantages and limitations

Published in

Better Software

14 min readJul 4, 2022

Before we go to Data Lake we need to go through the other Data Store technologies, to see the full picture and to understand, what has changed and what has not. On the journey, we will build a Data Lakehouse using the current state of art popular stacks and check use cases which fit and do not fit the Data Lakehouse paradigm.

Databases (same as before and here to stay)

Databases are as much the same as they were before. We have both NoSQL (Apache Cassandra,..)and SQL databases (Postgres,..) as well as less used but pretty useful Graph Database ( Titan and its query language Apache Tinker-pop/ Germlin). Then there is database-like software that are pretty popular. Example ElasticSearch/OpenSearch is not a database; it is a text-indexed (inverted index) data store. It is more of a search and analytic engine.

The use-case for the database is also the same as it was before, to write and read data many times; even inserts, updates etc. The main factors here are the same as they were years back. CAP theorem. Consistency, Availability, Partition Tolerance, that a database can have only one of Consistency or Availability, given NW partitioning happening. Different databases support different aspects, for some, it can be configured, ex — tunable consistency in the case of Cassandra so that it appears as if all three can be supported.

Before going deeper into a database; suffice to say; that it is here to stay and for a reason; mainly for Updating the Inserted records efficiently (that is fast) and in a consistent or available manner. Databases are also called OLTP Systems — Online Transaction Processing system. But not all DB’s support transactions, those that do need to be ACID compliant.

Here is an interesting question.

Can not databases scale? If so why anything else?

Sure they can scale, especially NoSql type databases like Cassandra, which are in production, with the known largest cluster at Apple at 75000 nodes. But the problem is the cost, and that has to do with the fact that compute and storage is tied together in a database.

In a database, the compute is very keenly tied to how the data is represented in storage, and the storage is very keenly tied to disk semantics.

Example Cassandra always appends to the end of a file — the SSTable file, to reduce disk seeks. This design was forged at the times when hard disks were magnetic and random seeks were slower. This also has the advantage now in the age of SSD to reduce write amplification. This means that when a record has to be updated since it appends always to the end of the file, it has to have the logic of timestamp for every column and logic to present from only the latest timestamp. Same with delete; it inserts an updated timestamp at the end of the SSTable file, with a marker called Tombstone set as True; indicating the row is deleted. Similarly but different logic with Write Ahead Log with Postgres.

Basically, Databases are complex beasts and their purpose is to store and update (or delete) and retrieve hot data. So they optimise storage and retrieval and update efficiency and the trade-off is additional complexity and cost of computing.

So What exactly can a Database Handle?

Databases can scale; Databases also compress data, so even small clusters can still handle millions of rows and specific use-cases with ease.

Big Data is when the disk space needs to be in hundreds or thousands of Terabytes or more and growing rapidly. It may be that your Enterprise is like Apple and has the financial power and resources needed to process Petabytes of data or DB-like use cases; where it makes sense to have and maintain gigantic DB clusters (NoSQL clusters scale linearly, most others tend to need exponentially more resources as some limits are reached) If so then you do DB; else, you move to Simple Storage Service aka Object Storage aka S3 and Data LakeHouse; here you have unlimited storage that you can have at a fraction of the cost, and some CPU instances for query; and most cloud providers give CPU on-demand here — Serverless.

This is a good article to read from Minio, one of the popular Object storage companies -https://blog.min.io/block-object-storage-myths/. It tells the Object storage is as fast as any other. For people who are not familiar with storage, this question may come.

So why not run a database on S3 and get the best of both worlds?

You cannot build, or maybe you can (https://people.csail.mit.edu/kraska/pub/sigmod08-s3.pdf), but it may not be efficient to build a database system on top of Object storage.

Object storage uses a simple API like the REST-based API. Databases are written using File I/O on top of Block storage at the lowest level. For example, the writer here is from Cassandra using Java Buffered Writer. Though one may be tempted to think of abstracting the write to S3 write, the semantics and algorithms for Write and Read are very much tied to ACID or CAP compliance in the DB. This is not available on top of S3.

There are better ways to do this with the Data Lake and Data LakeHouse concept. But before we go there, there is another older beast that we need to confront and then forget thankfully.

Data Warehouse — A Dinosaur getting extinct

Traditional data warehouse companies are fast jumping to the Cloud. Basically, these had proprietary high-powered Intel chipsets and storage in server racks. The use case was to load data from disparate systems in an Enterprise, for example, an Airline, flight booking databases, cargo databases, Airline personnel, Operations etc using custom-made ETL scripts to this single warehouse schema or set of schemas; on which custom reports are generated and send to management. These are pretty costly systems and the ETL is also done by specialised teams or outsourced companies. Examples are Teradata

Hadoop and HDFS — An alternative that is going out of style

At one time HDFS and Hadoop File System burst into the landscape as an open-source alternative to Data Warehouse. The map-reduce paradigm of Hadoop also became famous for how it can work in a massive distributed and parallel manner on the data in the HDFS.

So we have a highly scalable data store based on commodity hardware/servers called the HDFS, and we have the highly scaleable Map-reduced processing engine of Hadoop that can work on the data in HDFS. Sure it is not as easy to query as the ETL-prepared data in a data warehouse, but with some complexity and engineering, you could get similar reports with much lower costs. Examples are Cloudera, Hortonworks (now merged), IBM etc.

One main reason for this going out of favour is the complexity to operate a Hadoop cluster, Apache Zookeeper and the like; and also the overhead of maintaining the HDFS.

HDFS and Hadoop ecosystem has originated/inspired by Google’s internal GFS and MapReduce system and subsequent papers; so this paper from Google Cloud Storage explaining the cost benefits to migrate away from HDFS is a worthwhile read if you are not convinced still about moving away from Hadoop ecosystem https://cloud.google.com/blog/products/storage-data-transfer/hdfs-vs-cloud-storage-pros-cons-and-migration-tips, and that was in 2018

Note that there are many Database and Data Access systems centred around the Hadoop ecosystem. However these are not Kubernetes native and I would suggest that one check before use, though not being Kubernetes native is not a criterion for disqualification, HDFS is clearly on its way out. Some examples — are Apache Kudu, Apache Drill/Hive, Impala (SQL), Apache Druid, Apache HBase etc. Note that we still are using the SQL parts of Apache Hive as Presto/Trino uses it as the meta-store.

Simple Storage Service (S3) from Amazon — A new era

The key term here is Simple. There are very less features compared to Database and HDFS. There is no Append functionality; S3 stores data as immutable objects with metadata along with the object. Here is the press release from 2006, when it was first released.

https://press.aboutamazon.com/news-releases/news-release-details/amazon-web-services-launches-amazon-s3-simple-storage-service

There are two key main selling points here — first, it was extremely reliable, and durable ( Five 9‘s) and second, it was very in-expensive. There is no compute cost; you have only to pay for the storage you use. You don't have to keep Cloud instances active or up for this storage.

S3 by itself does not give a SQL or Map-Reduce interface out from its data. It just gives a simple REST API for Putting and Getting data, and if you Put data it gives you an ID using which you can query and Get the data. Any or all other logic is up to the application. For example, there is no guaranteed time limit for consistency. It is eventually consistent and it is up to the application to handle it. Example this use-case from Netflix. This is the reason why it is cost-effective.

S3 — AWS S3 some inner workings came to light in the recently published paper *https://assets.amazon.science/77/5e/4a7c238f4ce890efdc325df83263/using-lightweight-formal-methods-to-validate-a-key-value-storage-node-in-amazon-s3-2.pdf

Just a side note. Object storage is also File I/O on top of Block storage. Actually, there is only a block-based driver at the disk level in Linux.

So is a Data Lake just S3 storage?

The answer is yes. Data Lake is just an S3 Storage where you load whatever you want and keep. All cloud providers give S3 or Object-based storage. And on the bare metal, you can have Ceph or Mino-based S3 running.

Data LakeHouse

So then what is a Data LakeHouse? Well, it is like a Data Warehouse but on S3.

This means that it is like a Data Lake in that you can store data here, and it is like a Data Warehouse in that the Data Stored here can be queried by SQL interface for analytics and reporting.

How is this achieved, the SQL querying part ?. There are two parts to this apart from the S3 for the data store.

First SQL tables are created as metadata for the data held in S3. This SQL Table and Schemas are persisted in a database, which also holds the S3 ids of the objects.

There is then also a SQL query engine, that can parse, understand and split the query and distribute it among worker nodes.

There is also a third party, the data formate (Parquet, ORC) in which data is stored more efficiently; but we leave it out for later.

We will build a data lake with the most popular kids on the block- S3 and Trino (previously known as PrestoSQL). The meta store Trino needs is Apache Hive Standalone (which uses a DB — Postgres to store the meta-data) Data goes into S3.

You can follow the POC in a bare-metal-based Kind Kubernetes cluster by following this — https://github.com/alexcpn/presto_in_kubernete or this https://medium.com/techlogs/prestosql-trino-with-s3-in-kubernetes-baremetal-eb11cab1b51f. The same repo contains all the Kubernetes resources you need to get the system up and running.

What we are building is below;

This is similar to the comparable Cloud service provider. Note that I have not added the LakeFS part yet in the POC, and that is for versioned data i n S3.Could be handy if you are using this for storing a lot of training data for ML/AI use cases.

Data LakeHouse Use-cases

To understand Data LakeHouse use-cases better; let's try something;

Trying to use it as a DataBase to Create, Insert and Update data

Create and Insert Works, Update does not work nor is it intended!

CREATE TABLE test4.employee (id int, name varchar, salary int) WITH (    format='ORC'    );  INSERT INTO test4.employee VALUES (1, 'Jerry', 5000), (2, 'Tom',   8000), (3, 'Kate',  6000);Select * from test4.employee Limit 10;

Note — we are executing this via trino-cli; You can also do this via Redash or similar

kubectl exec -it trino-cli /bin/bash 
[trino@trino-cli /]$ /bin/trino --server trino:8080 --catalog hive --schema default
trino:default> select * from test4.employee;

For Update, you need to create a Table with `transaction` as True (ACID transaction). Even then with Presto/TrinoSQL you cannot now Insert or Update.

Note that Presto cannot create or write to Hive transactional tables yet. You can create and write to Hive transactional tables via Hive or via Spark with Hive ACID Data Source plugin and use Presto to read these tables. https://trino.io/blog/2020/06/01/hive-acid.html

update test4.employee  set description ='Employee Jerry description' where id=1;Query 20220622_105039_00034_7tuyy failed: Hive update is only supported for ACID transactional tablestrino:default> CREATE TABLE test4.employee5 (id int, name varchar, salary int)
            -> WITH (
            ->    format='ORC',
            ->    transactional=true
            ->    );
CREATE TABLE (works)trino:default> ALTER TABLE test4.employee5 ADD column description varchar;
ADD COLUMN (works)trino:default> INSERT INTO test4.employee5 VALUES
            -> (1, 'Jerry', 5000,  'test'),
            -> (2, 'Tom',   8000, 'test'),
            -> (3, 'Kate',  6000, 'test');
INSERT: 3 rowsQuery 20220623_043713_00027_7tuyy, FAILED, 3 nodes
Splits: 8 total, 8 done (100.00%)
10.69 [0 rows, 0B] [0 rows/s, 0B/s]Query 20220623_043713_00027_7tuyy failed: Invalid method name: 'alter_table_req'//Test delete from a non transactional tabletrino:default> delete from test4.employee2 where id=1;
Query 20220629_195118_00005_q8zcg failed: Deletes must match whole partitions for non-transactional tables

Maybe they will eventually make this work, but if you read their blog on how they are making it ACID-compliant and virtual row id and directories in S3, compaction etc it seems pretty complex.

The moral of the story is that don’t treat data lakehouse as a database. For Delete there may be some support needed and added in the future for GDPR compliance.

Here is what Databricks of the DeltaLake offering says about storing GDRP complaint data in Data Lake

Data lakes are inherently append-only and do not support the ability to perform row level “delete” or “update” operations natively, which means that you must rewrite partitions of data. Typical data lake offerings do not provide ACID transactional capabilities or efficient methods to find relevant data.As a result, organizations that manage user data at this scale often end up writing computationally difficult, expensive, and time-consuming data pipelinesHow to address such challenge --
“Pseudonymization,” or reversible tokenization of personal information elements (identifiers) to keys (pseudonyms) that cannot be externally identified.
Storage of information in a manner linked to pseudonyms rather than identifiers
https://docs.databricks.com/security/privacy/gdpr-delta.html

Now on to some positive use cases.

Data Analysis Use Case — Load Parquet data to S3 and analyze via SQL (and BI dashboards via GUI)

Before we start SQL, let’s clear the air regarding ML and AI workloads. For those use cases, the fastest way to get data is locally (read_parquet); and the next would be the get it directly from S3 into a Pandas data frame using an HTTP/S3 library.

You can of-couse query with SQL type libraries -SQLAlchemy for Python and insert the resulting list to a Pandas Dataframe. But that is just a roundabout and slow and inefficient way of doing things. For example, this use-case from Kaggle for this dataset -https://www.kaggle.com/c/nyc-taxi-trip-duration, is much easier with Pandas and Scikit and maybe a custom simple Neural net than with SQL functions.

So for ML/AI workloads Data lake is sufficient, with some data versioning on top of S3 like LakeFS; Data LakHouse is not needed.

Let’s use the publicly available NYC Taxi dataset for the data analysis use case (https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page). Earlier these were available as CSV files and it was pretty big; now they are storing this as Apache Parquet format which is pretty compressed.

The data is got from https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

yellow_tripdata_2022-01.parquet We first create a Schema and then create the following path in S3 and upload the data there

CREATE SCHEMA nyc_in_parquet;

Upload yellow_tripdata_2022-01.parquet to S3 at a particular location,a location where we will create a Table to access it.

s3a://test/warehouse/nyc_in_parquet.db/tlc_yellow_trip_2022/yellow_tripdata_2022-01.parquet

Let’s inspect this file

$ pip install parquet-tools$ parquet-tools inspect yellow_tripdata_2022-01.parquet ############ file meta data ############
created_by: parquet-cpp-arrow version 7.0.0
num_columns: 19
num_rows: 2463931
num_row_groups: 1
format_version: 1.0
serialized_size: 10386
############ Columns ############
VendorID
tpep_pickup_datetime
tpep_dropoff_datetime
passenger_count
trip_distance
RatecodeID
store_and_fwd_flag
PULocationID
DOLocationID
payment_type
fare_amount
extra
mta_tax
tip_amount
tolls_amount
improvement_surcharge
total_amount
congestion_surcharge
airport_fee############ Column(VendorID) ############
name: VendorID
path: VendorID
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
compression: GZIP (space_saved: 47%)etc

We use the above data types to create the right SQL data types in the CREATE Table below, and specify the location where we store the file in S3

CREATE TABLE nyc_in_parquet.tlc_yellow_trip_2022 (
    vendorid INTEGER,
    tpep_pickup_datetime TIMESTAMP,
    tpep_dropoff_datetime TIMESTAMP,
    passenger_count DOUBLE,
    trip_distance DOUBLE,
    ratecodeid DOUBLE,
    store_and_fwd_flag CHAR(1),
    pulocationid INTEGER,
    dolocationid INTEGER,
    payment_type INTEGER,
    fare_amount DOUBLE,
    extra DOUBLE,
    mta_tax DOUBLE,
    tip_amount DOUBLE,
    tolls_amount DOUBLE,
    improvement_surcharge DOUBLE,
    total_amount DOUBLE,
    congestion_surcharge DOUBLE,
    airport_fee DOUBLE)
WITH (FORMAT = 'PARQUET',
    EXTERNAL_LOCATION = 's3a://test/warehouse/nyc_in_parquet.db/tlc_yellow_trip_2022')
;

That’s it; Let’s Query the data and see

SELECT count(*) FROM  nyc_in_parquet.tlc_yellow_trip_2022 ;
  _col0  
---------
 2463931 
(1 row)Query 20220625_100612_00061_7tuyy, FINISHED, 2 nodes
Splits: 4 total, 4 done (100.00%)
13.60 [2.46M rows, 16KB] [181K rows/s, 1.18KB/s

Cool 2.5 million rows in this small 37M parquet file, and that too form S3.

Let’s run a complex query

Select dayofweek, AVG(fare_amount) FROM
 ( select fare_amount, date_format(tpep_pickup_datetime, '%a') as dayofweek 
     from nyc_in_parquet.tlc_yellow_trip_2022)
  group by dayofweek ;   dayofweek |       _col1        
-----------+--------------------
 Wed       | 12.290740850208424 
 Mon       | 13.422371442169597 
 Tue       | 12.531063256070471 
 Thu       | 12.408311232285685 
 Sat       |  12.51759075286025 
 Fri       | 13.581231240928696 
 Sun       | 13.811130821498017 
(7 rows)Query 20220625_230058_00076_7tuyy, FINISHED, 2 nodes
Splits: 7 total, 7 done (100.00%)
17.92 [2.46M rows, 11.4MB] [138K rows/s, 652KB/s]

The output in a few seconds; Not bad from 2 million rows. This is because Paquet has a columnar format; and instead of varchar, we have proper data types for numeric fields.

More queries here https://github.com/alexcpn/presto_in_kubernetes/blob/main/testqueries.md .

Here below is another to check the trip distance and how many occurrences and the average fares — classic data analysis. There are some bad values in the data that I am excluding (< 1 > 400k etc)

select t.range, count(*) as "Number of Occurance", ROUND(AVG(fare_amount),2) as "Avg",
    ROUND(MAX(fare_amount),2) as "Max" ,ROUND(MIN(fare_amount),2) as "Min" 
  from (
    select 
    case 
        when trip_distance between  0 and  9 then ' 0-9 '
        when trip_distance between 10 and 19 then '10-19'
        when trip_distance between 20 and 29 then '20-29'
        when trip_distance between 30 and 39 then '30-39'
        else '> 39' 
    end as range ,fare_amount 
    from nyc_in_parquet.tlc_yellow_trip_2022) t
    where fare_amount > 1 and fare_amount < 401092 
  group by t.range; range | Number of Occurance |  Avg   |  Max  | Min  
-------+---------------------+--------+-------+------
  0-9  |             2260865 |  10.28 | 720.0 | 1.11 
 30-39 |                1107 | 104.28 | 280.0 |  5.0 
 10-19 |              126136 |   43.8 | 413.5 |  2.0 
 > 39  |               42556 |  39.11 | 668.0 | 1.99 
 20-29 |               19133 |  58.62 | 250.0 |  2.5

and easy to create BI-type dashboards using tools like Redash

We can do all the same stuff in Python or Go; Python code here https://github.com/alexcpn/presto_in_kubernetes/blob/main/python/test.py

So for Data Analytics and BI, Data LakeHouse makes a good fit. Basically, any data that does not change like counter data or log data could be dumped here and one could either use the Data Lake part alone for ML/AI or use the SQL engine part for SQL-based analysis.