See my follow-up blog for a full data lake recovery solution: https://towardsdatascience.com/how-to-recover-your-azure-data-lake-5b5e53f3736f
1. Azure Storage backup – Introduction
Azure Storage always stores multiple copies of your data. When Geo-redundant Storage (GRS) is used, it is also replicated to the paired region. This way, GRS prevents that data is lost in case of disaster. However, GRS cannot prevent data loss when application errors corrupt data. Corrupted data is then just replicated to other zones/regions. In that case, a backup is needed to restore your data. Two backup strategies are as follows:
- Snapshot creation: In case a blob is added or modified, a snapshot is created from the current situation. Because of the nature of blobs, this is an efficient O(1) operation. Snapshots can be restored quickly, however, restoring cannot always be done (e.g. deleted blobs), see next strategy.
- Incremental backup: In case a blob is added or modified, an incremental backup is created in another Storage account. Copying blobs is an expensive O(N) operation, but can be done asynchronously using Azure Data Factory. Restoring can be done by copying blobs back.
In this blog, it is discussed how snapshots and incremental backups can be created from a storage account, see also overview depicted below.
To support the creation of automatic snapshots and incremental backup of your storage account, three types of scripts are used and discussed in the remaining of this blog:
- Event based script triggered by Producer to create snapshot and incremental backup requests once data is ingested/modified
- Queue trigger script that creates incremental backups using ADFv2
- Time based script triggered by Admin to reconcile missing snapshots/incremental backups
Notice that blob snapshots are only supported in regular storage accounts and are not yet supported in ADLSgen2 (but is expected to become available in ADLSgen2, too). Scripts are therefore based on regular storage accounts, detailed explanation of the scripts can be found below. Also notice that scripts deal with the creation of snapshots/backups, not with restoring it.
2. Event based triggered snapshot/incremental backup requests
In a data lake, data is typically ingested using Azure Data Factory by a Producer. To create event based triggered snapshots/incremental backups, the following shall be deployed:
- Deploy following script as Azure Function in Python. See this link how to create an Azure Function in Python. See my other blog how to secure Azure Functions.
- Add Azure Function as Linked Service in your Azure Data Factory Instance and add the Azure Function as last step in your ingestion pipeline.
- Create two storage accounts as source storage and backup storage. Also create a storage queue to handle backup request messages.
Now every time when new data is ingested using ADFv2, an Azure Function is called that creates a snapshot and sends an incremental backup request for new/changed blobs, see also below.
The internal working of script HttpSnapshotIncBackupContainerProducer can be explained as follows:
-
- Init snapshot/backup script checks for modified/new blobs in a container of a storage account by comparing ETag of a blob to previous snapshots (if any). In case it detects a modified/new blob, the following 2 steps are done:
- 1a. Create snapshot from the modified/new blob
- 1b. Send incremental backup request message to a storage queue. Backup request message only contains metadata of the modified blob.
The core of the script is as follows:
# Get all blobs in container
prev_blob_name = ""
prev_blob_etag = ""
blob_list = container.list_blobs(include=['snapshots']) for blob in blob_list: if blob.snapshot == None:
# Blob that is not snapshot
if prev_blob_name != blob.name:
# New blob without snapshot, create snapshot/backup
create_snapshot_backup(blob.name, blob.etag)
elif prev_blob_etag != blob.etag:
# Existing blob that has changed, create snapshot/backup
create_snapshot_backup(blob.name, blob.etag)
prev_blob_name = blob.name
prev_blob_etag = blob.etag
Notice that only the Producer ADFv2 Managed Identity and this Azure Function Managed Identity have write access to this container. Blob triggers do not work in this scenario since no events are fired when blobs are modified.
The asynchronous incremental backup creation is discussed in the next chapter.
3. Incremental backup creation
Once a new incremental backup request is added to the storage queue, this message shall be processed such that incremental backup is created. In this, the following shall be deployed:
- Deploy the following script as Queue triggered Azure Function in Python. The same App Service Plan as in step 2 can be used
- Make sure that all required app settings are filled in correctly, see here
- Deploy an ADFv2 pipeline that creates incremental backups by copying blobs from the source storage account to the backup storage account
- Make sure that you assign a Managed Identity of your Azure Function and that the Azure Function has the RBAC role "Contributer" to run it
Now every time a new incremental backup request message is received on the storage queue, an Azure Function is triggered that calls an ADFv2 pipeline that creates an incremental backup, see also below.
The working of script QueueCreateBlobBackupADFv2 **** can be explained as follows:
- 2a. Storage queue receives a new backup request message for a blob. Notice that messages can be processed in parallel
- 2b. Azure Function picks up message from the storage queue and checks if backup request is not outdated. It then triggers a ADFv2 pipeline using the REST API and the Azure Function Managed Identity
- 2c. ADFv2 starts copy activity by copying the blob from the source storage account to the backup storage account
In step 2b, it can also be decided to run blob_lease mode which exclusively locks the file and guarantees the correct version of the file is added to backup storage account. Whether or not using blob lease depends on a lot of factors (e.g. max lease time allowed, file size, number of jobs, immutability).
The core of the script is as follows:
USING_BLOB_LEASE = True
# check if source blob has not changed in the meantime
if source_blob_changed(blob_name, blob_etag) == True:
return
# Start copying using ADFv2
try:
if not USING_BLOB_LEASE:
# Copy without locking the file
copy_adf_blob_source_backup(blob_name, blob_etag)
else:
# Copy with blob lease locks the file
copy_with_lease(blob_name, blob_etag) except:
logging.info("copy failed")
In the previous two chapters, it was discussed how snapshots and incremental backups can be created when a producer adds new data to the data lake using ADFv2. However, there can also be a need to trigger snapshots/incremental backups that are time-based. This will be discussed in the next chapter.
4. Time-based triggered snapshot/incremental backups requests
In chapter 2, it is discussed how a Producer can create event-based snapshots/incremental backup requests. However, there is also a need for a reconciliation script that can create missing snapshots and/or missing backups. This can happen when a Producer forgets to add the Azure Function to its ingestion pipeline and/or the script failed to run. To create a time based function, the following shall be deployed:
- Deploy following script as Azure Function in Python. The same App Service Plan as in step 2 can be used.
- Make sure that all required app settings are filled in correctly, see here
Now an admin can configure the reconciliation script to be run periodically such that snapshots are created and sends an incremental backup request, see also below.
The internal working of script HttpSnapshotIncBackupStorageReconciliation can be explained as follows:
-
- Init snapshot/backup time based script checks for modified/new blobs in a storage account by comparing ETag of a blob to previous snapshots (if any). In case it detects a modified/new blob, the following 2 steps are done:
- 1a. In case it detects a modified/new blob that does not have a snapshot yet, a new snapshot is created:
- 1b. In case it detects a modified/new blob that does not have an incremental backup yet, a new backup request message is sent
The core of the script is as follows:
# Get all containers in storage account
container_list = client_source.list_containers()
for container in container_list:
# Get all blobs in container
prev_blob_name = ""
prev_blob_etag = ""
blob_list = container.list_blobs(include=['snapshots'])
for blob in blob_list:
if blob.snapshot == None:
# Blob that is not snapshot
# 1. Check if snapshot needs to be created
if prev_blob_name != blob.name:
# New blob without snapshot, create snapshot
create_snapshot(blob.name, blob.etag)
elif prev_blob_etag != blob.etag:
# Existing blob that has changed, create snapshot
create_snapshot(blob.name, blob.etag)
# 2. Check if incremental backup needs to be created
# Check if blob exists in backup
blob_exists = check_blob_exists(blob.name, blob.etag)
if blob_exists == False:
# Not in backup, put backup request message on queue
queue_json = "{" + ""container":"{}",
"blob_name":"{}", "etag":"{}""
.format(container.name, blob.name, blob.etag) + "}"
queue_service.put_message(queue, queue_json)
prev_blob_name = blob.name
prev_blob_etag = blob.etag
5. Conclusion
Azure Storage always stores multiple copies of your data and Geo-redundant Storage (GRS) additionally stores copies in a paired region. However, GRS storage does not protect data being corrupted because of application errors. Corrupted data is then just replicated to the other zones and regions. Two measurements that can protect against data corruption are as follows:
- Snapshot creation: Once a blob is added or modified, a snapshot is created for the current situation.
- Incremental backup: Once a blob is added or modified, an incremental backup is created of the blob to another storage account.
In this blog it is discussed how synchronous snapshots and asynchronous incremental backups can be created using scripts in this github. See also extended high level overview depicted below.