The world’s leading publication for data science, AI, and ML professionals.

Disaster recovery scenarios for Azure storage, SQL, Cosmos DB, Synapse

Learn how to plan for disaster recovery for data services in Azure

Server in data center, image by Kvistholt Photography on Unsplash
Server in data center, image by Kvistholt Photography on Unsplash

0. Introduction

Disaster recovery aims for the continuation of IT systems after disruptive events. Data services are a vital part of every IT system and shall be protected against these events. Most Azure PaaS data services have service tiers that support zone redundancy. This implies that when disaster is limited to a single zone, data services are not impacted or impact is minimized. However, some disruptive events require more planning than just selecting the right tier. These events are grouped to three major scenarios as follows:

  • Regional disaster: Outage in multiple data centers in a region caused by a natural disaster, e.g. loss of power, earthquake.
  • Global disaster: Upgrade errors or unanticipated issues that occur during planned infrastructure maintenance.
  • Customer error: Data corruption or deletion caused by an application bug or human error.

These scenarios are used to plan for Disaster Recovery for the following services: Azure data lake storage, SQLDB, Synapse SQL pools and Cosmos DB. As a reference, an overview is already provided below.

Overview of disaster recovery scenarios for Azure PaaS data services - image by author
Overview of disaster recovery scenarios for Azure PaaS data services – image by author

In this overview, it can be seen that regional and global disaster recovery are jointly discussed. Rationale is that paired regions should be used to plan for disaster recovery in case of regional disaster. A benefit of paired regions is that sequential updates are used to prevent that both regions are hit by upgrade errors or unplanned maintenance errors. Also, region recovery order **** is used such that one of the two regions is prioritized for recovery in case of an outage.

Finally, disaster recovery shall be planned for the entire IT system and not only for a data service in isolation. Rationale to limit the scope of this article to data services is as follows:

  • Data services are a vital part of every IT system. One could start with disaster recovery planning for the data services and from there extrapolate what is required for the adhering applications.
  • Enterprises often doubt what data service to use for what use case. This article may help to decide what data service best fills in the disaster recovery requirements.

In the remaining of this article, disaster recovery is discussed for Azure Blob Storage, Azure Data Lake Storage, SQLDB, Synapse Sql pools and Cosmos DB.

1a. Disaster recovery – Azure Blob Storage

Azure Storage is optimized for storing massive amounts of unstructured data. Geo-redundant storage is replicated to at least three times in the primary region and then to a secondary region. In case regional/global disaster occurs, data already sitting on a storage account is never lost. Regarding availability (read/write data) and customer errors, see next paragraphs.

1a.1. Regional/global disaster

The following measurements can be taken:

  • In case an application only needs to read data (and writing data can be postponed or staged to another storage account), an application can use the secondary endpoint in the paired region of the storage account. See this git project how an application can detect a region is down and starts using the secondary endpoint
  • For writing, a failover can be initiated to the secondary account. Notice that after failover, the secondary account becomes LRS. Also, the failover is not automatically initiated and will take some time. Alternatively, a custom solution shall be creating in which an application writes data to a storage account/queue in a different region.

1a.2. Customer error

The following measurements can be taken:

  • Point in time restore recovery (PITR) is supported for blob storage which can be done both on file level and container level, see here. However, PiTR uses the same storage account which protects a customer from errors, but not from a malicious entity encrypting data using malware. In that case, a backup is needed, see next bullet. See my follow-up blog for a full datalake recovery solution: https://towardsdatascience.com/how-to-recover-your-azure-data-lake-5b5e53f3736f
  • To protect a storage account from a malicious entity, a custom incremental backup to another storage account can be created. Creating a backup can be time based using a Data Factory or event based using blob triggers. See here for an example git project.

1b. Disaster recovery – Azure Data Lake Storage

Azure Data Lake Storage is a storage account using a Hierarchical File Structure rather than Object Based Storage. Using a true folder structure, management tasks (e.g. renaming folder) can be done easily. It has the same durability as regular Azure Storge, however, PiTR and failover are not yet supported. See next paragraphs regarding disaster recovery.

1b.1. Regional/global disaster

The following measurements can be taken:

  • Similar as in regular Azure Blob Storage, a secondary endpoint can be used for reading.
  • For writing, failover is not supported for Azure Data Lake Storage. A custom solution shall be creating in which an application writes data to a storage account/queue in a different region. In case the primary regions is up again, it shall be merged later with the storage account in the primary region. Alternatively, data ingestion can be paused in rerun once the Azure Data Lake Storage is up and running again.

1b.2. Customer error

The following measurements can be taken:

  • PiTR is not yet supported in Azure Data Lake Storage, however, file snapshots can be used to create a custom PiTR solution
  • To protect storage account from a malicious entity, a custom backup solution can be created as discussed for regular storage in the previous paragraph and this example git project.
  • Since Azure Data Lake Store is often to ingest and store data from Azure Data Factory pipelines in raw/curated/bottled zone, it can also be decided to rerun pipelines. Condition is the source systems keep the data for some period and pipelines can run idempotent.

2. Disaster recovery – Azure SQL

Azure SQL Database is the intelligent, scalable, relational database service built for the cloud and is optimized for Online Transaction Procesing (OLTP). See next paragraphs regarding disaster recovery.

2.1. Regional/global disaster

The following measurements can be taken:

  • In case an application only needs to read from a database and the primary region is down, an application can access a secondary database for read-only operations, see here
  • For writing, automatic failover is supported in Azure SQL. This way, it is possible to use the secondary database in a different regio for writing when disaster occurs in the primary region. Once the primary regions is up and running again, the data is replicated back from the primary region to the secondary region. Data conflicts can occur and are solved depending on the policies configured.

2.2. Customer error

The following measurements can be taken:

  • PITR can be achieved by leveraging the automated database backups functionality. Automated database backups create full backups are every week, differential backups every 12–24 hours, and transaction log backups every 5 to 10 minutes. When you restore a database, the service determines which full, differential, and transaction log backups need to be restored.

3. Disaster recovery -Synapse SQL pools

Azure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Within Synapse Analytics, SQL pools are used to store relational data. It can also used with external tables to query data on Azure Data Lake Storage. See next paragraphs for disaster recovery planning.

3.1. Regional/global disaster

The following measurements can be taken:

  • For reading, a custom solution can be created to restore backups every day in a different region, pause the SQL pool and only run the SQL pool in the different region once the primary region is down.
  • In case only exteral tables are used on a data lake storage, is also possible to create a SQL pool in a different region and point the SQL pool to the secondary endpoint of the Azure Data Lake storage account in the different region.
  • Automatic failover is not supported in Synapse SQL pools. For writing, a custom solution shall be creating in which an application writes data to a storage account/queue in a different region. It can also be decided to pause data ingestion pipelines until the primary region is up again, see next paragraph.

3.2. Customer error

The following measurements can be taken:

  • Snapshots go back to 7 days and can be restored once data get corrupted. To prevent data loss, data pipelines shall be run again, see next bullet.
  • Synapse SQL pools are often used in data lake context, in which Azure Data Factory or Synapse pipelines are used to ingest data. Data is either ingested directly to SQL pools or to the Azure Data Lake storage account belonging to the Azure Synapse workspace and uses as external table in Synapse SQL pools. It can also be decided to rerun pipelines once primary region is up again. Condition is the source systems keep the data for some period and pipelines can run idempotent.

4. Disaster recovery – Cosmos DB

Azure Cosmos DB is a fully managed multi-database service. It enables you to build highly responsive applications worldwide. As part of Cosmos DB, Gremlin, Mongo DB, Cassandra and SQL APIs are supported. See next paragraphs for disaster recovery planning.

4.1. Regional/global disaster

The following measurements can be taken:

  • For reading, an application can use the secondary endpoint in the paired region.
  • Automatic failover is supported in Azure Cosmos DB. In case the primary region is down, the database in the secondary regions becomes the primary and can be used for writing data. Data conflicts can occur and are solved depending on the consistency level of Cosmos DB.

4.2. Customer error

The following measurements can be taken:

  • Azure Cosmos DB creates automatic backups. In case a database is corrupted or deleted and needs to be restored, this can be requested, see process here
  • Azure Cosmos DB change feed provides a persistent log of records within an Azure Cosmos DB container. Items in a the change feed can be replayed in a second container up to the point before the data corruption occured and then persisted back to the original container.

5. Summary

In this article, we discussed planning for Azure data PaaS services in case disaster occurs. Although the production ready tiers of PaaS services and paired regions already solve a lot, a customer still needs to take measurements to plan for disaster. In this article, measurements are discussed for Azure storage, SQL, Cosmos DB and Synapse, see overview below.

Overview of disaster recovery scenarios for Azure PaaS data services - image by author
Overview of disaster recovery scenarios for Azure PaaS data services – image by author

Related Articles