The world’s leading publication for data science, AI, and ML professionals.

How to Ingest and Consume Data from Azure Data Lake

Analysis on ingestion/consumption patterns including delta lake PoC

1. Introduction

A lot of companies consider setting up an Enterprise Data Lake. The idea is to store data in a centralized repository. The main stakeholders of the data lake are the following two entities:

  • Data producer: Entity that ingests data to the data lake. This is typically the entity that does not profit from the data lake directly and prefers an easy way of ingesting data without much overhead (kind of fire and forget)
  • Data consumer: Entity that uses the data to create business value. This is typically the entity that profits most from the data lake and prefers that data can easily be consumed without much rework (data is partitioned correctly, not many small files, etc)

In this blog post, four different patterns of data ingestion are discussed in chapter 2. Subsequently, two consumption patterns are elaborated on in chapter 3. In chapter 4, a proof of concept is discussed using ADF and delta lake using this git repo [adf-deltalake-ingestion-consumption](https://github.com/rebremer/adf-deltalake-ingestion-consumption), see also picture below. Finally, a conclusion is drawn in chapter 5.

2. Data producer: Ingestion patterns

In this chapter, four types of ingestion patters are distinguished using the following two dimensions:

  • Raw data versus end of day aggregations: In case raw data is used, all source data is ingested to target. In case end of day is used, source data is aggregated in a meaningfull form to target
  • Snapshots versus deltas: In case snapshots are used, all source data is ingested to target every day. In case deltas are use, only mutated source data is ingested to target.

In the remaining of this chapter, the four patterns are discussed which are combinations of the above two dimensions. To clarify further, an example data set is used in which cards 1 and 2 are produced on the first day and then two card transactions are done on card 1 on the second day, see below.

In the next sub paragraphs, the patterns are discussed with the pro and the cons.

2.1 Pattern P1: raw data – snapshot

In the snapshot pattern, the full raw data set is sent every day. For example data set above, this would mean the following on day 1 and day 2.

Pro/con analysis is as follows:

  • (pro) Easiest pattern to adopt for data producer; Only a data dump needs to be done, not necessary to keep track of changes
  • (con) Infeasible for large data sets; costly to send dataset over every day. Large dataset will also have a performance hit (hours of copying). This is also true for backups.
  • (con) Data consumer needs to keep track of changes. Since data consumers typically have less knowledge than data producer on data, this can be error-prone. This can especially be challenging when data is copied to own consumers own environment (in case no copying is done and data virtualization is used, delta lake can help)

2.2 Pattern P2: raw data -delta

In the delta pattern, new data is inserted and existing data is updated. For immutable datasets, new data will always be inserted since no updates can occur. In case a dataset is mutable, then updates can occur. In this, it is key that a unique ID is known in both the source and target such that data can be matched. For the example data set, pattern 2 can be applied as follows (inserts only):

Pro/con analysis is as follows:

  • (pro) More efficient pattern. No cost of sending over large datasets, performance is likely better, less possibility on errors/time outs when copying for large data sets.
  • (con) Data consumer must have availability on all previous data increments. In case one increment is missing, data set is corrupt.
  • (pro) Data consumer can easily identify changes and do upserts in their own environment. Deltas can also easily be handled by ADF dataflows and delta lake.

2.3 Pattern P3: end of day – snapshot

In the end of day-snapshot pattern, the goal is not the synchronize the raw data source and data sink. Instead, an aggregation is created and sent end of day. This can either be done because of the following:

  • Consumers are only interested in the end results (and are NOT interested in N mutations that lead up to this end result)

For the example data set, it could be that consumers are only interested in total_amount of card, see below.

Pro/con analysis is as follows:

  • (con) data loss occurs since only aggregated data is available for consumers.
  • (pro) data pattern can be more efficient, for instance, in case data consumers are only interested in end results and not in the N mutations that lead up to the end result.

2.4 Pattern P4: end of day – delta

Pattern 4: end of day -delta is almost similar as 2a end of day – snapshot, but only mutated data is sent. For the example data set, this means the following:

Basically the same pro/cons analysis applicable for pattern 3 and pattern 4. Only difference is that deltas can be more efficient in case aggregations are still big.

2.5 Conclusion

All four patterns have its pros and it cons. However, for standardization on the enterprise data lake, it can be a good idea to use the Pattern P2: raw data—delta as default. Reasons are as follows:

  • No data loss occurs when all raw data is sent from source to target. It is hard to predict for producers what aggregations a consumer will need. Instead, a consumer can decide themselves what aggregation they need. Multiple zones in a data lake can also help here, in which a landing zone is used to ingest all raw data and a bottled zone is used in which multiple aggregations can be created that are ready for consumption
  • Delta is a more cost efficient pattern to use since less data copy needs to be copied. For large data sets, deltas can even be the only viable solution.

3. Data consumer: Consumption patterns

In this chapter, two different consumption patterns are discussed as follows:

  • Copy data: Consumer copies data from the data lake to their own environment
  • Virtualize data: Consumer uses data directly on data lake

In the next two paragraphs the pro/cons for each consumer pattern are discussed.

3.2 Pattern C1: Virtualize data

In the data virtualization pattern, a consumer queries directly on the data lake and data is not copied to their own environment.

  • (pro) Single source of truth, no data duplicates are created
  • (pro) Easy model to step in for consumers. This is especially true if consumers don’t have much technological knowledge and just want to create some (Power BI) reports
  • (pro) Delta lake can be used to simplify querying on storage account using SQL
  • (con) Not feasible for teams that have strong performance requirements (SLAs)
  • (con) Can be challenging to join enterprise data lake data with data that is not part of data lake (e.g. data sitting in consumer’s own SQL environment)

3.2 Pattern C2: Copy data

In the copy data pattern, a consumer offloads the data from the enterprise data lake to its own environment.

  • (pro) Consumer has full control of the data. This is especially important if customer has strict performance requirements (serving 24×7 website) or strict SLAs (team serving the data lake may not be available for questions at 03:00 AM)
  • (con) Copying can take a long time, especially if large data sets have to be copied to consumer’s own environment
  • (con) Consumer needs to have technical knowledge to setup own environment (ADF, databases, networking, etc).

3.3 Conclusion

Both patterns have their pros and their cons. However, for standardization on the enterprise data lake, it can be a good idea to use the Pattern C1: virtualize data as default. This can be explained as follows:

  • It is prevented that data is unnecessary duplicated and is most cost efficient.
  • Delta lake can also be leveraged to simplify data consumption which will be elaborated on in the next chapter.
  • It case delta lake is used and a consumer still wants to copy data to its own environment, ADF can be used to copy data from delta lake

4. ADF, Delta Lake and Spark: Proof of Concept

In this chapter, a proof of concept is built using the following architecture:

  • Data producer: Data from SQLDB is ingested using ADF dataflows to delta lake. Data is ingested using the four patterns discussed above (raw data versus aggregated data, full snapshots versus delta increments)
  • Data consumer: Once the date ingested to delta lake, consumers can query data using Databricks and Synapse notebooks using Spark. In case consumer want to copy data to own environment, two ADF pipelines are created can copy the snapshot or latest delta to their own environment

See also image below:

Project can be found in git repo [adf-deltalake-ingestion-consumption](https://github.com/rebremer/adf-deltalake-ingestion-consumption). Execute the following steps for run PoC:

  1. Substitute variables and run [scripts/deploy_resources.sh](https://github.com/rebremer/adf-deltalake-ingestion-consumption/blob/main/scripts/deploy_resources.sh) to deploy ADF and deltalake.
  2. Run different producer pipelines to ingest data to the delta lake
  3. Consumer type 1: Create a Databricks workspace or Synapse workspace and run [notebooks](https://github.com/rebremer/adf-deltalake-ingestion-consumption/tree/main/notebooks)to query data on delta lake
  4. Consumer type 2: Run consumer pipelines to consume data to own storage account

5. Conclusion

A lot of companies consider setting up an Enterprise Data Lake. The main stakeholders of the data lake are data producers and data consumers. Data producers are typically looking for a way to easily ingest data, whereas data producers is the entity that creates business value from the data. In this blog post, the following is argued:

  • Producers shall raw data to the data lake and do this in deltas (increments). Rationale is that raw data prevents data loss and increments are cost efficient and scalable
  • Consumers shall query data directly from the data lake and build data products there. Rationale is that this prevents data duplication and is cost efficient. Possible exception is that when consumers need higher performance/higher SLAs (e.g. serve 24×7 website), data can be copied to own environment
  • Delta lake can help consumers to query data easily, whereas Data Factory support delta as sink that can help producers to automatically add data in a delta lake format. This is put in practice in this git repo: [adf-deltalake-ingestion-consumption](https://github.com/rebremer/adf-deltalake-ingestion-consumption)

Related Articles