Azure Data Lake Storage and Azure Databricks are unarguably the backbones of the Azure cloud-based data analytics systems. Azure Data Lake Storage provides scalable and cost-effective storage, whereas Azure Databricks provides the means to build analytics on that storage.
The analytics procedure begins with mounting the storage to Databricks distributed file system (DBFS). There are several ways to mount Azure Data Lake Store Gen2 to Databricks. Perhaps one of the most secure ways is to delegate the Identity and access management tasks to the Azure AD.
This article looks at how to mount Azure Data Lake Storage to Databricks authenticated by Service Principal and OAuth 2.0 with Azure Key Vault-backed Secret Scopes.
Caution:Microsoft Azure is a paid service, and following this article can cause financial liability to you or your organization.
At the time of writing, Azure Key Vault-backed Secret Scopes is in ‘Public Preview.’ It is recommended not to use any ‘Preview’ feature in production or critical systems.
To access resources secured by an Azure AD tenant (e.g., storage accounts), a security principal must represent the entity that requires access. A security principal defines the access policy and permissions for a user or an application in the Azure AD tenant. When an application is permitted to access resources in a tenant (e.g., upon registration), a service principal object is created automatically.
Let’s begin by registering an Azure AD application to create a service principal and store our application authentication key in the Azure Key Vault instance.
Register an Azure AD application
Find and select Azure Active Directory on the Azure Portal home page. Select App registrations and click + New registration.
Azure Active Directory: Register a new application (Image by author)
On the Register an application page, enter the name ADLSAccess, signifying the purpose of the application, and click Register.
Azure Active Directory: Name and register an application (Image by author)
In the ADLSAccess screen, copy the Application (client) ID and the Directory (tenant) ID into notepad. Application ID refers to the app we just registered (i.e., ADLSAccess), and the Azure AD tenant our app ADLSAccess is registered to is the Directory ID.
Next, we need to generate an authentication key (aka application password or client secret or application secret) to authenticate the ADLSAccess app. Click on Certificates and secrets, and then click + New client secret. On the Add a client secret blade, type a description, and expiry of one year, click Add when done.
Azure Active Directory: Create a client secret (Image by author)
When you click on Add, the client secret (authentication key) will appear, as shown in the image below. You only have one opportunity to copy this key-value into notepad. You will not be able to retrieve it later if you perform another operation or leave this blade.
Azure Active Directory: Client secret (Image by author)
Grant service principal access to ADLS account
Next, we need to assign an access role to our service principal (recall that a service principal is created automatically upon registering an app) to access data in our storage account. Go to the Azure portal home and open the resource group in which your storage account exists. Click Access Control (IAM), on Access Control (IAM) page, select + Add and click Add role assignment. On the Add role assignment blade, assign the Storage Blob Data Contributor role to our service principal (i.e., ADLSAccess), as shown below.
Resource group: Assign role to service principal (Image by author)
Add application secret to the Azure Key Vault
Go to the Azure portal home and open your key vault. Click Secrets to add a new secret; select + Generate/Import. On Create a secret blade; give a Name, enter the client secret (i.e., ADLS Access Key we copied in the previous step) as Value and a Content type for easier readability and identification of the secret later. Repeat the creation process for the Application (client) ID and the Directory (tenant) ID we copied earlier. Your vault should have three secrets now.
Azure Key Vault: Add new secrets (Image by author)
Select Properties, copy the Vault URI and Resource ID to notepad; we will need them in the next step.
Azure Key Vault: Properties (Image by author)
Create an Azure Key Vault-backed Secret Scope in Azure Databricks
If you’ve followed our another article on creating a Secret Scope for Azure SQL Server credentials, you don’t have to perform this step as long as your key vault and Databricks instance in question remains the same.
Go to https://<DATABRICKS-INSTANCE>#secrets/createScopeand replace with your actual Databricks instance URL. Create a Secret Scope, as shown below.
This URL is case sensitive.
Azure Databricks: Create a Secret Scope (Image by author)
Mount ADLS to Databricks using Secret Scope
Finally, it’s time to Mount our storage account to our Databricks cluster. Head back to your Databricks cluster and open the notebook we created earlier (or any notebook, if you are not following our entire series).
We will define some variables to generate our connection strings and fetch the secrets using Databricks utilities. You can copy-paste the below code to your notebook or type it on your own. We’re using Python for this notebook. Run your code using controls given at the top-right corner of the cell. Don’t forget to replace the variable assignments with your storage details and secret Names.
Further reading on Databricks utilities (dbutils) and accessing secrets:
We can override the default language of a notebook by specifying the language magic command at the beginning of a cell. The supported magic commands are %python, %r, %scala, and %sql. Notebooks also support few additional magic commands like %fs, %sh, and %md. We can use %fs ls to list the content of our mounted store.
Azure Databricks: Magic command (Image by author)
Don’t forget to unmount your storage when you no longer need it.
# Unmount only if directory is mounted
if any(mount.mountPoint == mountPoint for mount in dbutils.fs.mounts()):
dbutils.fs.unmount(mountPoint)
Azure Databricks: Unmounting ADLS Gen2 in Python (Image by author)
Congratulations! You’ve successfully mounted your storage account to Databricks without revealing and storing your application secrets and access keys.
Conclusion
We looked at how to register a new Azure AD application to create a service principal, assigned access roles to a service principal, and stored our secrets to Azure Key Vault. We created an Azure Key Vault-backed Secret Scope in Azure Dataricks and securely mounted and listed the files stored in our ADLS Gen2 account in Databricks.
Next Steps
If you’re curious to know how we got those CSV files in our storage, please head to our other article to set up an innovative Azure Data Factory pipeline to copy files over HTTP from a GitHub repository.