The world’s leading publication for data science, AI, and ML professionals.

Power BI Connects to Azure Databricks

Deriving the Spark URL and creating user token that is used by PowerBI to authenticate into Azure Databricks.

Photo by Carlos Muza on Unsplash
Photo by Carlos Muza on Unsplash

Microsoft PowerBI is becoming more and more popular recently as a Data Analytics tool. Also, It is ubiquitous for a company to have a whole bucket of Microsoft products that include Azure.

Azure Databricks is one of the most popular services in the Azure platform. It leverages Apache Spark to process data in a distributed environment, which can expedite the performance dramatically. Azure Databricks also support Delta Lake that is an open-sourced storage layer in a distributed environment. It can be used very similarly to most of the traditional database management systems because it supports ACID transactions.

https://delta.io

This article will be demonstrated how to connect to Azure Databricks tables (Delta Lake) from PowerBI using its built-in connector. The tricky part is going to be the "Spark URL" which will be emphasised later on.

Sample Data Table Preparation

Photo by Mika Baumeister on Unsplash
Photo by Mika Baumeister on Unsplash

Since the purpose of this tutorial is to introduce the steps of connecting PowerBI to Azure Databricks only, a sample data table will be created for testing purposes. To begin with, let’s create a table with a few columns. A date column can be used as a "filter", and another column with integers as the values for each date.

Let’s firstly create a notebook in Azure Databricks, and I would like to call it "PowerBI_Test".

Create a database for testing purposes.

%sql
CREATE DATABASE TEST_DB;

Then, import necessary libraries, create a Python function to generate a Pandas Dataframe with the columns above-mentioned.

from datetime import datetime, timedelta
import numpy as np
import pandas as pd
# Create date df in pandas
def create_date_table(start='2020-01-01', end='2020-03-01'):
  df = pd.DataFrame({"date": pd.date_range(start, end)})
  df['day'] = df['date'].dt.weekday_name
  df['date'] = df['date'].dt.date
  df['value'] = np.random.randint(100, size=df.shape[0])
  return df

Now, we can generate the Spark Dataframe from the Pandas Dataframe and save it into Delta Lake.

date_pdf = create_date_table()
sample_table = spark.createDataFrame(date_pdf)
sample_table.write.format('delta').mode("overwrite").saveAsTable('TEST_DB.sample_table')

Here is the preview of the table:

Azure Databricks URL & User Token

Photo by Susan Holt Simpson on Unsplash
Photo by Susan Holt Simpson on Unsplash

To connect to Azure Databricks from PowerBI, we need two important "keys", which are the URL and the User Token.

How to get the spark URL?

  1. From the left navigation, go to the "Cluster" tab.
  2. From the cluster list, click to select the cluster you want to use. This will navigate the page to the Cluster Edit page.
  3. In the Cluster Edit page, click "Advanced Options" to expand the section.
  4. Select "JDBC/ODBC" tab. There will be a lot of information here, but we only need to focus on the "JDBC URL".
  5. Copy the whole URL to some text editor. This URL CANNOT be directly used, and we need to "derive" the correct URL from it.

Here is the URL I copied from the "JDBC URL" text field:

jdbc:spark://australiasoutheast.azuredatabricks.net:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/***************/****-******-*******;AuthMech=3;UID=token;PWD=<personal-access-token>

The components we need from this URL is shown in the square brackets:

jdbc:spark
[://australiasoutheast.azuredatabricks.net:443/]
default;transportMode=http;ssl=1;httpPath=
[sql/protocolv1/o/***************/****-******-*******]
;AuthMech=3;UID=token;PWD=<personal-access-token>

We also need to add a protocol at the front, which is HTTPS. So, the final URL will become as follows:

https://australiasoutheast.azuredatabricks.net:443/sql/protocolv1/o/***************/****-******-*******

How to get the User Token?

We need to create a user token for authentication.

Firstly, find the user icon in the top right corner. Click the icon, and then select "User Settings" in the dropdown menu.

In the "User Setting" page, click the button "Generate New Token". In the pop-up window, input the comment field which will be used to remind yourself in the future what is this token used for.

The "Lifetime" will determine after how many days the token will be revoked automatically. Please be noted that if the token lifespan is unspecified, the token will live indefinitely.

After clicked the "Generate" button, a token will be generated.

IMPORTANT You will not be able to retrieve the token once you have clicked the "Done" button. So, please make sure you copy the token and save it to a secured place now.

PowerBI Connect to Azure Databricks

Photo by Isaac Smith on Unsplash
Photo by Isaac Smith on Unsplash

Now, we can get data from Databricks in PowerBI.

In PowerBI, click "Get Data" > search "Spark" > Select "Spark" in the list > click "Connect" button.

In the pop-up window, fill the URL [https://australiasoutheast.azuredatabricks.net:443/sql/protocolv1/o/***************/****-******-*******](https://australiasoutheast.azuredatabricks.net:443/sql/protocolv1/o/***************/****-******-*******) into the "Server" text field.

Then, select "HTTP" for the protocol. For the "Data Connectivity mode", we choose "DirectQuery". This is because we are very likely to deal with a large dataset when we need Azure Databricks in practice, so importing all the data into PowerBI will not be a good idea in that scenario.

After clicking the "OK" button, we will be able to see all the tables in the current Azure Databricks Cluster. If there are many tables in the cluster, we can search the table in the navigator. Then, click the "Load" button to add the table as a data source. Of course, you can import multiple tables at one time in practice.

Now, we can create visualisations from Delta Lake.

Summary

Photo by Aaron Burden on Unsplash
Photo by Aaron Burden on Unsplash

This article has demonstrated how to connect to Azure Databricks from Microsoft PowerBI. The most critical steps are getting the Spark URL and User Token from Azure Databricks.

Hopefully, this article can help Data Scientists/Engineers to create visualisations from Azure Databricks directly.

Join Medium with my referral link – Christopher Tao

If you feel my articles are helpful, please consider joining Medium Membership to support me and thousands of other writers! (Click the link above)


Related Articles