From Data Lakes to Data Reservoirs

Create Clean, Beautiful, Protected Data Resources with Apache Spark and Delta Lake

Scott Haines

Published in

Towards Data Science

12 min readJul 31, 2020

Water crashing together. 3 states of water, from peaceful to disorderly. Intersecting to create beauty in chaos. — There is beauty in the chaos. Photo Credit: Unspalsh @powwpic

Data comes in all shapes and sizes

It is amusing that when we talk about data the best analogy is typically rooted in water. This makes sense in order to fathom the idea of data — which comes in all shapes and sizes— people tend to embrace the abstract. One could say a single data record could be likened to a droplet of water, many droplets of the same kind (record type) cluster and coalesce into isolated data pools (tables/directories), and it is these pools which reside in the larger construct which is the body of the data lake at large.

A Data Lake is an ecosystem that enables the coexistence of many types of records, across myriad use cases, but when looked at from a high enough vantage point appears to be a simple vessel for storing an almost unbounded amount of data.

Embrace the abstract

Much of what we do as engineers follows along the lines of taking a large concept or maybe a more loosely assembled “idea” and disassembling it into its smaller constituent parts in order to understand the approach to building a system or framework that “acts like” that something.

This same tactic can be applied with your approach to Data Modeling and more generally how you define the structured data which represents the Events, Metrics, and Entities and even more specifically the Behavior and Trends you are looking to bubble up from within your data stores. This data is your point-in-time lens to what occurred in the past, what is happening now, and can be used as a frame of reference to predict what “could” happen in the future.

However, the perfect data, stored in an imperfect system can lead to large scale, even catastrophic failure down the line. I like to think of this as polluting the Data Lake. No one wants that, and many new frameworks have been released over the past years to attempt to fix the heartache which can be associated with the many pitfalls and unknown unknowns that come hand-in-hand with operating a Data ecosystem on a large scale (or even small scale).

So here we are. We know the following to be true

Data can be complex and wild.
Data stored “can” be used to solve future problems and provide insight into what occurred in the past.
Data changes over time. This can lead to corruption of historic data. Don’t poison the well!
Water based analogies for data are almost as fun as bad puns!

Given the preceding points. Really everything other than #4, I’d like to dive into the heart of this post which is centered around the concept of converting what you know as a Data Lake into what I like to think about as a Data Reservoir and I’ll also introduce some of the frameworks that are in use today that can help add structure and love to your Data Lakes eg. Apache Parquet, Apache Spark, DeltaLake

What is the point of the Data Lake?

Abstract lighting and atmosphere behind a grid like structure with exploding glass. — The Data Lake. It can feel structured but built upon a broken idea. Photo Credit: fabioha Unsplash

The Data Lake has one main goal. This goal is to act as an inexpensive, horizontally scalable (or infinitely scalable) centralized silo of raw, semi-structured or structured data loosely organized based on file system layout “best practices”. For example, some well know best practices are centered around a.) limiting the total number of files based in individual directories so listing operations are quick, and b) following a directory scheme that can be used to group data into understandable entities.

For example hdfs://insights/hive/tableName/year=2020/month=7/day=30. These two best practices team up and allow the file system to be used to quickly skip non-relevant data without the use of an index and easily list the total number of files within a directory.

Inexpensive Storage

Cloud Computing was transformative in that it enabled companies of all sizes to offload the management of their servers and networks to a few cloud providers who were really good at running servers. These same cloud vendors (Amazon, Microsoft, etc) then started offering other services built on-top of their cloud offerings like Amazon’s S3 and Microsofts Blob Storage. It took a while but eventually, there was a general consensus that the cloud was secure enough to store precious data — without the threat of losing it to bad backups, bad hardware, bad racks/switches etc—and this novel idea started to rise up. Companies realized they could store all of their data at a fraction of the cost they were currently spending with in house solutions.

“What if we could store all* of our data for the same cost as we currently spend on only a fraction of all our data”

Simpler Developer Operations

This idea led to a shift from more expensive Hadoop clusters, and the dev-ops teams that came along with scaling Hadoop for large-sized enterprises, to a notion of commodity storage and almost infinite scalability.

The Data Lake took the form of this centralized data storage tier which could be used as a unified staging ground for *all data within a company or organization to feed data, via Extract Transform Load (ETL) pipelines/processes into stricter and more standardized Data Warehouses; traditionally running some variant of SQL (Oracle, MySQL, Postgres, SQL Server) which yielded many orders of magnitude better response times due to the analytical nature of these OLAP (online analytics processing) engines.

Essentially you could just defrost your old data when you were ready, reread it and reshape it to solve many problems, but this was not typically the case.

Flow Diagram of the Extract, Transform, Load process. Extract raw rows of data and transform into record, insert into a db — The Data Lake solved the Data Silos problem. ETL was king: Photo Credit: Microsoft

Pipe dreams

The real pipe dream was that the Data Lake could also be used to solve future problems that were yet to be defined. What exactly does this mean? Well in the absence of a solid idea many companies would store data in a hopes of eventually using it however, around 73% of the data stored by companies is never accessed for analytics and I would be willing to hedge a bet that more than 50% of the data stored is never accessed even once for many companies.

73% of the Data Stored by companies is never accessed for Analytics…

Many companies enacted their own policies around how data was fed into these Data Lakes, with the notion in mind to preserve the future processing capabilities of their data by taking some upfront design time and thinking about how data would be accessed 1 / 3 / 5 years down the road. While this may have felt like magical thinking, there were just as many companies who decided to wing it writing semi-structured data or unstructured data (aka JSON or CSV) into many individual files or large new line (\n) separated files that could be read during bulk operations. But without schemas, read/write policies, governance comes eventual disorder…

For many companies it was the Wild West

Like many things, what starts off with the best of intentions in time tend to transform and take on a new shape without rules and standards. The Data Lake can easily become an inexpensive, horizontally scalable, centralized data mess which operates without any rules or order much like the “Wild West”. While we can all get behind cowboys and at times probably dream of a life without rules and regulations — in our heart of hearts we most likely arrive at the similar conclusion

“Nothing without rules maintains any sense of order or balance”.

So how do we fix the potential mess?

Structured Data Lakes: Policies, Frameworks and Governance

Order in an endless fields of particles of light. Could also just be christmas somewhere! — The emergence of Structured Data Lakes is leading towards a new Data Enlightenment! Photo Credit: sortino via Unsplash

The Emergence of Standards

Good ideas take hold and quickly spread like wildfire. Recently the data community has standardized on at least one core data format that is good enough to get behind. That is the file storage format Parquet and we are going to learn a little more about why this is such an excellent choice for our data at rest. Data at rest just means it isn’t currently in active memory.

Thanks Google for the gift of Parquet.

Parquet is a file format that consists of a header, row and columnar data, pages, and a footer. — What makes Parquet Parquet? Credit: Apache Parquet docs

Parquet, while looking like some kind of voodoo is actually just a really well executed columnar data storage primitive.

When you break down the file format you will see you have rows and columns and types.

This allows for large collections of records encapsulated by common format (schema) to be efficiently compressed or compacted into a much smaller footprint.

For example, If I have 40 million records stored as JSON with an overhead of between 1–2kb per record, I could reasonably have around 40gb of files.

Now if you were to take that JSON and convert it to parquet you can expect around a 50–80% savings in terms of file overhead cost. But in addition to the cost savings comes the ability to easily partition you data and compress things further using snappy, gzip, lz4, or any binary splittable compression format.

I can say from experience that around 40–80 million records of medium complexity (eg. more than 20 fields) can be broken down across 128 partitions a day, and stored between 12-24gb! This includes the parquet headers and footers as snappy compressed partitions and enables distributed reading of the dataset which fits nicely back into many of our use cases which all center around a Spark centric data platform.

Benchmark showing the compression size of parquet, vs avro vs csv — Cloudera Benchmarking (parquet/avro/csv). Photo Credit: Cloudera: Parquet wins

However reduction in size on disk isn’t the only big win. Once your data is stored as parquet you have the ability to do row or column level efficient skipping within the data — so retrieval of 1–2 records within say 40 million may only taker around 5 seconds. Parquet is able to read just the footers to check if some content lives inside a block of parquet prior to reading the actual content. It uses columnar statistics to help speed up this process of intuitive row/column level extraction and filtering. You can see this benchmark from Cloudera if you want to see how it compares to say Avro.

Lastly, parquet has been adopted as the defacto standard for many of the most widely used Data Engineering and Data Analytics frameworks and platforms in operation today like Apache Spark (SparkSQL natively interoperates with Parquet out of the box), Presto, DeltaLake and Apache Hudi, and many many more.

Apache Spark is the data Catalyst

The emergence of Apache Spark can be thanked for holding the baton and leading the march in terms of organizing and unifying an entire community of data practitioners. Databricks, the company behind the Spark community has gone to a lot of trouble to keep the community engaged and to continue to produce amazing new technology that live and run within the Spark ecosystem. One of the newer additions to the core SparkSQL engine that adds Atomic operations, Upserts (Insert or Update), Merge, Delete and additional conditional SQL logic directly into the Spark DSL is a framework named DeltaLake. This subtle enhancement written to work with existing Data Lake architectures, brings best practices including file system layout, operational observability as well as nice audit trail support for operations done to the records existing within a Delta Table.

The DeltaLake Architecture: From Delta.io homepage — The Delta Lake Ecosystem: Photo Credit: Databricks/Delta

Remember before when I brought up the problem of poisoning the Data Lake and how things needed some kind of order and governance in order for them to evolve into something greater. I called it the Data Reservoir and I believe that DeltaLake has standardized exactly how a Data Lake should work with the correct safe guards in place to ensure that the Data Lake remains clean and beautiful over time. Let’s dive into how this is possible.

Delta Schemas and File Layout

From the first write into a delta table, Spark is using the underlying StructType associated with the DataFrame it is writing into your Data Lake to ensure that the schema will be preserved and that new writes adhere to that strict schema or do a “mergeSchema=true” on write operation. Below is an example of writing a DataFrame to Delta.

Spark code showing how to write to DeltaLake with mergeSchema on — Writing a DataFrame in batch mode to a Delta.

This operation will be stored with an Atomic Commit log.

Screenshot of the File System layout of a DeltaTable. Showcasing the commit logs — Example of the _delta_log from our operation above.

Screenshot of the File System layout of a series of reports saved as a DeltaLake table — see that it is just Parquet data — The rest of the DeltaLake Table. Partitioned by year,month,day.

This schema and file location metadata is stored in what is called the _delta_log and this exists in the root of your Delta Table. This little directory exists along side the root of the file system hierarchy which encapsulates your “Table”.

Example of the Delta Commit Log. Credit: Self

Now any additional operations that occur to the Delta Table will have information about the files mutated, the version of the current table, the schema associated with the table. This gives you eyes and ears into the operations that have been applied to the data stored in this Data Lake (DeltaLake).

Reading from DeltaLake is a Breeze

Reading and Transforming the data for Analytics is simple thanks to Delta. Credit: Self

As you can see from the above code snippet. Reading from delta is simple. As a neat fun fact, in the query above we don’t reference a specific path in our load command to the data and we can use what is known as partition push down in order to skip directories and only scan the contents of the month of june (col(“month”).equalTo(6) in our where clause above. This takes advantage of the file system so you don’t have to load all table entities into place. Saves on time and $$$.

Batch Write and Full Continuous Streaming Write are supported

Delta handles Batch Read / Write and Streaming Read / Write. Sweet. Credit: Self

Above is an example of creating a readStream off of the Delta table we just created with the batch process before. The novel thing that Delta brought to the table was the mixed use case support of batch read/write, streaming readStream/writeStream. Think about Airflow which solves some difficult use cases — airflow is basically just triggering child jobs when a parent job completes, however DeltaLake allows you to generate sophisticated streaming DAGS (Directed Acyclic Graphs) of spark jobs which should make you drool just a bit if you love Spark. Spark to the rescue to solve all your Spark related needs.

Shows the Zeppelin Graph of a point in time query off of a Streaming DeltaTable. Use this for live graphs — Zeppelin View: Shows how to use a Batch query to read from a Streaming Table (see query above to view the windowed writeStream)

You may have a few upstream jobs that must complete to enable downstream jobs to do their work, and on and on like a traditional data conveyer belt. Well now you can just orchestrate these kinds of experiences utilizing full end-to-end streaming system — consider it your data lineage pipeline on steroids — our your Data Engineering job just getting simpler.

Given that the Data Lake in todays data ecosystems need to be capable of near real-time insert/upsert/merge, accurate historical analysis, efficient ETL to machine learning jobs and into faster online feature stores, and also be a pillar of solid trustworthy data then solving all of those use cases can be done with DeltaLake alone. I think of this as the swiss army knife for Data Engineering and Analytics, and as a solid foundation for ETL into faster — in-memory data stores like Redis or ElasticSearch depending on your data retrieval SLAs.

Conclusion

The traditional Data Lake is usually unstructured or semi-structured, the quality of the data is as good as the team who operates publishing into that Data Lake. That process can easily become polluted and break down over time, causing the dreaded poisoning of the Data Lake.

You can solve your future unknown unknowns by relying instead on the systems you are probably already using (S3, Apache Spark, maybe Parquet has come up or you are already using parquet), DeltaLake just makes it easier for you to move fast in a standard and structured way. Bring joy back to the table when working with data across your whole organization.

Cheers.