
I engage in building different machine learning models at different granularity in my current work, and a few of them with high granularity. More specifically, such models require the evaluation of many hypotheses. In recent times, I had to build 22 different models, where each model was trained using 60 million data points, with 2000 features. It also required multiple iterations of hyper-parameter tuning. As a team, we had various hypotheses, and we need to evaluate each hypothesis. For this, we must train a model and evaluate the accuracy to validate the hypothesis. If that works, we need to apply the same hypothesis to the remaining 21 models.
How did I approach?
Since I received data in parquet format, I was ended up using parquet to train the model. It took days to train a model and weeks to validate a single hypothesis. I tested many approaches to reduce computation time, but none were effective. Finally, I used Delta as a data format instead of parquet.
In this article, we will analyze the specific features that help reduce computation time to train to model.
Delta lake
Delta lake is an open-source storage layer that enables ACID properties in a distributed environment. I use a short introduction here to explore more about the delta format in upcoming sections.
A common concern from most folks
The purpose of delta lake is to support ACID properties – What do you do with that to reduce the computation time to train models?
This is the very first question I was asked. Yes, Databricks developed this storage layer for a specific customer who wanted to write and read files simultaneously by multiple clusters.
Delta Lake is not limited to the above; it also takes away a few other obstacles faced by Data Scientists and Data Engineers.
- Data Skipping: With the Delta file, you need not scan the entire data. As new data is inserted into a Databricks Delta table, file-level min/max statistics are collected for all columns, which help filter files effectively.
- Z -order: In addition to data skipping, Z-order enables data skipping at multi-dimensional.
- Effective caching: The Delta cache accelerates data reads by creating copies of remote files in nodes’ local storage using a fast-intermediate data format.
- Scalable Metadata Handling: In the big data world, even meta-data have the characteristics of big data. Delta treats meta-data as big data (i.e., it enables spark to process meta-data in a distributed manner).
- Time travel: Delta enables data versioning.
- Unified platform for batch and streaming: Table in the delta lake can ingest (also handle) batch and streaming data.
- Updates and deletes: Delta Lake supports Scala / Java APIs to merge, update, and delete datasets.
The delta file format has many features. However, in this article, we deep dive into the following, which enables faster data processing:
- Effective caching
- Data Skipping
- Z- order
Data stored using delta cache is much faster than Spark cache
We all use caching as part of query optimization, so what is the difference in delta caching? Azure databricks provide two caching types.
1) Apache Spark caching
It uses spark in-memory. It impacts other operations that run within spark due to limited in-memory available.
2) Delta Caching
It uses a local disk. Since it does not use in-memory, other operations run within spark do not get impacted. Though delta uses a local disk to cache data, it still remains super effective due to its unique decompression algorithms. This can be further optimized by using the "Delta Cache Accelerated" worker. Delta cache is enabled by default, and SSDs in workers are configured to use delta cache effectively. The following screenshot elaborates How "Delta Cache Accelerated" enabled worker is selectable in databricks environment. You must select L type workers, as shown below.

Data Skipping
In Parquet
Typically, the table store as multiple parquet files. Assume a scenario where you have a table called "sales."
- It has four columns as follows: "item," "location," "date_id," and "sales_quantity."
- It has 60-million data points distributed among 2000 parquet files.

Assume a query to filter sales from January 1st to 6th.
This query execution scans all 2000 files to filter the results.
In Delta
Assume we store the above file using delta format. Each file will have a minimum and maximum value for each column in such a scenario, an inherent feature of the delta format.

The same query run on delta format uses these min/max values to skip files. As a result, less than 2000 parquet files have to be scanned (most probably less than 500 parquet files). Hence, it drastically reduces the computation time.
Z – Ordering
Z-ordering is an approach to cluster parquet files in multi-dimensional (based on multiple columns). It enables us to skip more scanning.
Where does Z -ordering works better?
Returning to my problem statement, I was required to build 22 different models, but the model’s granularity is the same (e.g., the model predicts sales for a given item at a given date in a given location). In simple terms, Spark will use my training data multiple times at the same granularity.
You can simply optimize Z- order by using the following code:
The above code collocates related information (here, the similarity is decided based on the following columns – "item," "location," and "date_id") in the same set of files.
The same query re-runs with "ZORDER" optimization skips more files, and the query executes much faster than before.
Final Thoughts
Though databricks developed Delta Lake to enable ACID properties, it includes additional features like effective caching, data skipping, and Z-order optimization. These additional features help to process data in a distributed manner in less time. This article focused on how delta lake accelerates data processing, and thus, I skipped other interesting features offered by delta lake, such as Scalable Metadata Handling, Time travel, Unified platform for batch and streaming, and Updates and deletes. The upcoming articles will provide detailed discussions of these features.