
What are Delta Lake Clones?
In Databricks delta lake, Clones are simply copies of your delta tables at a given snapshot in time, they have the same schema, structure, and partitioning as your source table. Once you create a clone the changes made to it do not affect the source table and vice-versa. This is a feature available in Databricks 7.2.
Databricks Delta Lake supports two types of clones
- Shallow Clones: Shallow clones are simplistic copies of the delta table, meaning it is only a copy of the metadata of the source table and not the actual data itself. Because of that, it’s super fast and easy to create a shallow clone.
- Deep Clones: As the name suggests it’s a true copy of the source table.
So, what is the point of clones?
Cloning can open up a lot of doors in Delta Lake, primarily clones can play a significant role in archiving data and running short-lived test cases on your data set, the kind of experiments you want to keep secret from your source tables.
For instance, you can test a workflow on a production table by creating a shallow clone and run your tests on the clone without corrupting the source data. Since a shallow clone is an instant copy you can quickly test optimization experiments to improve the performance of your queries and those changes remain on the clone.
Another great use case for shallow clones, most often you would want to backfill or recalculate a large part of the historical data in your table. This can be easily realized using a shallow clone, run your backfill or recalculation on the clone, and then simply replace the original table with the clone when you are happy with it.
How to create and manage Delta Lake Clones?
Enough talk! Let’s get cloning and understand what goes on behind the scenes.
Databricks offers a free community edition which is a really cool and powerful feature. As confessed by the authors themselves the entire book "Spark The Definitive Guide" was written using the community edition.
Like most features, you can try out Delta Clones for free on the community edition. If you have not got a chance to set it up already, sign up today. It hardly takes a few minutes to sign up and spin up a cluster.
Let’s set up some sample data for running our experiments.
%sql
Create database Person;
Drop table if exists Person.Address;
Create table Person.Address (
addressId int,
address string,
customerId int,
startDate string,
endDate string
)
USING DELTA
LOCATION 'dbfs:/FileStore/DeltaClones/Address/'
Insert some data into the table.
%sql
insert into Person.Address
select 1 as addressId, "1 downing" as address, cast(rand()*10 as integer) as customerId, '2020-11-05' as startDate,null as endDate
union
select 2,"2 downing",cast(rand()*10 as integer), '2020-11-05',null
Taking a quick look at the file structure, we should now have a delta table created with some parquet files to store the data and a _delta_log to track the transaction log.
%fs
ls dbfs:/FileStore/DeltaClones/Address/

Shallow Clones
Let’s start by creating a shallow clone of our sample table.
%sql
Create or replace table Person.AddressShallow
SHALLOW CLONE Person.Address
LOCATION 'dbfs:/FileStore/Clones/AddressShallow/'
If we remember correctly from the definition, shallow clone just duplicates the metadata and not the actual data itself, let’s verify that claim.
%fs
ls dbfs:/FileStore/Clones/AddressShallow/

Sweet! There are no files holding any data just the _delta_log, let’s check if we can query it by running a select.

Shut the front door! We have a copy of our source table that we can easily query but, in reality, it’s just a copy of the metadata. Using the input_file_name() function in our select query, it’s quite evident that the clone is reading files from the source table.
Shallow clone essentially creates a new _delta_log that points to the original table files. If you analyze the transaction log of the shallow clone, you will see that the operation performed was a clone and there are literal pointers to the original table’s files.
Now, let’s test the claim about isolation. Let’s make an update on our shallow clone.
%sql
update Person.AddressShallow
set endDate = '2021-01-01'
Looking into the delta directory, we see that once you make a change to the clone data files start to appear.

You can verify the same by running the select query over it. Now, the data is being read from the files that belong to the clone rather than the source table.

So, that proves the theory of isolation. The shallow clone is the quickest way to duplicate your source table and run tons of experimentation on it without having to worry about destroying your source data. Now, go figure the use case!
Deep Clones
Let’s create a deep clone of our source table
%sql
Create or replace table Person.AddressDeep
DEEP CLONE Person.Address
LOCATION 'dbfs:/FileStore/Clones/AddressDeep/'
It may not be evident with the size of the table we have got here, but if it were a large table it would have taken a while to deep clone.
Let’s do the same routine and check out the delta directory.

You can instantly notice the difference between a shallow and deep clone, deep clones essentially make a true copy of your source table. Well, technically not a mirrored copy because a deep clone only clones the latest version of your source delta table. You do not get the history of transactions in your deep clone. You can verify that by describing the history of the deep clone and compare it with the history of your source table.

For the remainder of it, deep clones behave exactly like shallow clones. Any or all changes made to the deep clones are isolated.
Conclusion
Cloning in delta lake is not as difficult or controversial as cloning in genetics.
There is so much potential with shallow clones, it makes working with large datasets slick, clean, and cost-effective. Yes, deep clones add a nice touch but I do not foresee see a huge need for it especially when you can use the "Create Table as Select" operation.
What’s your take on Clones? Would love to know your thoughts and insights, especially on the kind of use cases that would justify using a deep clone.
All the source code can be found here in this notebook. Do check it out!