Using Python in Power BI
Since April 2019 Python integration has been generally available in Power BI. Just as with the R support you can now use Python for importing data, data transformation and data visualization. In this article we will step by step prepare a Python environment, enable Python in Power BI, import data and apply clustering to the data and create custom visuals in Power BI using Python. All material needed to replicate this example is available in GitHub. The article assumes that you have a working Python environment and Power BI installed.
Creating the data import script in Python
The dataset that we’ll be working with is the Boston Housing dataset, which is available in scikit-learn. The dataset is most often used for examples in regression but we’ll take a different approach and use it for clustering. We will use Principal Component Analysis to reduce the dimensions in order to visualize the data in a 2-dimensional space. After this we will apply k-means clustering to try to identify any homogeneous groups within the data.
I will not go through the steps taken to create the import code as the main focus of this article is to showcase using Python in Power BI. The data preparation phase is available as an Jupyter Notebook for those who are interested to see how it was created. I recommend creating standalone scripts of the code that you are going to use in order to test and debug any problems prior to using it in Power BI. The error messages provided by Power BI are not always that informative.
Enabling Python in Power BI
Set-up Python virtual environment with required libraries
There are several ways to manage your Python (virtual) environments. In addition to virtualenv
and pipenv
you can also use the conda
distribution.
In this example I’m using pipenv to manage the Python environment that Power BI will use. In order to get started open a Command Prompt
or Power Shell
console and navigate to the folder that you want to use as a working directory for your Python code. Go to the location where you have cloned/downloaded the example code from my GitHub repository if you are using that. We will run the following command in order to create the Python environment and install the required libraries.
pipenv install numpy pandas matplotlib seaborn scikit-learn
In case you are using my code from GitHub it is enough to run pipenv install
as it will automatically install the required libraries from the existing Pipfile
.
Lastly, we want to check the actual Python path for this virtual environment since we will need to refer to that from Power BI.
pipenv --venv
The location of my virtual environment was C:\Users\tomra\.virtualenvs\python_and_powerbi-0r4DNLn9
so I copy that path to the clipboard.
NOTE: You can use conda
to manage a virtual environment for Power BI. In case you've used similar conda installation/configuration as me, the conda environments are installed in a system folder in Windows %userprofile%\AppData
. As default you cannot navigate to this folder from the Power BI to choose your Python home directory. A workaround for this is to show hidden files in Windows Explorer. Just be aware of the implications that you don't accidentally remove any system files in regular Windows usage. Another option is to place the conda environments in a non-system folder.
Configure Power BI to use Python
Start Power BI and go to the Options
where you should see the Python scripting
section on the left. Click on that to open the Python script options
. As default Power BI lists the Python environments is has been able to detect in the system. We will need to change these settings since we created a separate virtual environment for Power BI. From Detected Python home directories
choose the Other
option. Set the Python home directory to the Scripts
folder in the path where your virtual environment exists. In my case it is C:\Users\tomra\.virtualenvs\python_and_powerbi-0r4DNLn9\Scripts
.
Now we are ready to utilize Python code in Power BI so get ready!
Power BI
Importing data using a Python script
We go to the Home
tab in the ribbon and click Get data
and choose the More
option to start our data import. Go to the Other
pane where you should find the Python script
option. Select that and click Connect
. A Python script dialog opens where you can add your own code. Copy and paste the below code to the Script
text area and click OK
.
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans# utilize the sklearn.datasets package to load the Boston Housing dataset
boston = load_boston()# scale the data to same value range first since PCA
# is sensitive to the scaling of data
sc = StandardScaler()
X = sc.fit_transform(boston.data)# create PCA with n_components=2 to allow visualization in 2 dimensions
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)# divide data into 5 clusters (refer to .ipynb for motivation)
kmeans = KMeans(n_clusters=5, init='k-means++', max_iter=300, n_init=10)
y_kmeans = kmeans.fit_predict(X_pca)# create pandas dataframe of the housing data for Power BI
columns = np.append(boston.feature_names, ['MEDV', 'PC1', 'PC2', 'CLUSTER'])
data = np.concatenate((boston.data,
boston.target.reshape(-1, 1),
X_pca,
y_kmeans.reshape(-1, 1)),
axis=1)
df_housing = pd.DataFrame(data=data, columns=columns)
# we need to convert all columns as string because of different
# decimal separator in Python (.) and Finnish locale (,) that Power BI uses.
# comment out below line if Power BI uses dot as a decimal separator.
df_housing = df_housing.astype('str')# create pandas dataframe of the pca data for Power BI
columns = np.append(boston.feature_names, ['VARRATIO'])
data = np.concatenate((pca.components_,
pca.explained_variance_ratio_.reshape(-1, 1)),
axis=1)
df_pca = pd.DataFrame(data=data, columns=columns, index=['PC1', 'PC2'])
df_pca = df_pca.astype('str')
In the next window we are able to choose which datasets to import. Select both the df_housing
as well as the df_pca
datasets and click Load
. Next we will go and make the final adjustments to our import data. Click Edit Queries
in the ribbon.
Because of the possible difference in your locale settings in Python and Power BI I chose to save all columns as strings in the Python script. The correct data types needs to be set and depending on your locale/settings we might need to take different routes. The target is to convert every column to numbers.
Power BI with decimal point
Although I have not tested this, you should be able to remove the Changed Type
step, choose all columns and let Power BI Detect Data Type
from the Transform
tab in the ribbon in order to get the correct data types. Repeat for both datasets.
Another option is to modify the import script and remove/comment the rows where we set the data type as strings df_housing = df_housing.astype('str')
and df_pca = df_pca.astype('str')
. Power BI should identify the data types during import correctly.
Power BI with decimal comma
We need to replace the decimal separator from dot (.) to comma (,) in order to set the correct data type. Choose all columns and click Replace Values
from the Transform
tab in the ribbon. Replace dot with comma and click OK
.
After this let Power BI detect the correct data type by clicking Detect Data Type
. Repeat these steps for both datasets.
The final step is to add an Index Column
to the df_housing
dataset. This can be done from the Add Column
tab in the ribbon. When that has been done go to the Home
tab and click Close & Apply
.
Create custom visualizations using Python
We go back to the Reports
view to start working on our visualizations. The plan is to visualize the clusters on a chart using the principal components as axes. A scatter plot is a good alternative for this visualization. Click on the Scatter chart
to add it to the page. Drag the df_housing
columns to the visualizations pane in the following way:
- Drag
PC1
to theX Axis
. - Drag
PC2
to theY Axis
. - Drag
CLUSTER
to theLegend
to visualize how the clusters are grouped. - Drag
Index
toDetails
as this will remove the aggregation in the chart.
The end result should look something similar to the picture below. You might have a different cluster order (coloring) due to the random selection of initial centroids in the k-means clustering.
Next we will visualize how each feature affects each of the principal components. This information is available in the df_pca
dataset. We will show this information through a heatmap. The Seaborn
Python library provides an easy way to create a heatmap so we'll add a Python visual
to the page. Power BI might warn you about script visuals, click Enable
to continue. Each feature needs to be separately dragged to the Data fields
area. Drag each column from the dataset except VARRATIO
. Copy the following code snippet to the code area and click Run script
.
import matplotlib.pyplot as plt
import seaborn as snsdataset.index = ['PC1', 'PC2']
plt.figure(figsize=(8, 2))
plt.xticks(rotation=45)
data = dataset.loc['PC1', :].to_frame().sort_values(by='PC1').transpose()
sns.heatmap(data,
cmap='plasma',
square=True,
annot=True,
cbar=False,
yticklabels='')
plt.show()
You should now see a heatmap similar to the figure below. Depending on your screen resolution you might have to hide the script pane to see the visual.
Repeat the same step to create a heatmap for the second principal component but use below code snippet instead to use the data from the second principal component and make a vertical visualization that can be placed on the left side of the scatter plot.
import matplotlib.pyplot as plt
import seaborn as snsdataset.index = ['PC1', 'PC2']
plt.figure(figsize=(2, 8))
data = dataset.loc['PC2', :].to_frame().sort_values(by='PC2', ascending=False)
sns.heatmap(data,
cmap='plasma',
square=True,
annot=True,
cbar=False,
xticklabels='')
plt.show()
I recommend testing your Python visuals in some other tool e.g. using Jupyter Notebook especially in cases where the visualization is complicated or you want to fine-tune all the small details of the visualization. It’s also worth considering to use the same fonts in your Python visualization as in the Power BI report to have a unified look & feel. This is quite easy to achieve with Matplotlib and its siblings.
NOTE: Some python packages only generate HTML, to overcome this limitation, the plot is saved as an image and then displayed using matplotlib. To write to png you need to install chrome driver and e.g. place the exe in C:\chromedriver
. Once you have done this you should this folder in the PATH
environment variable. This step is not needed in our case.
We have now visualized the identified clusters of data with regards to the two principal components that together explain the majority of the variance in the data. The heat maps display which features affects each principal component in a positive or negative manner with regards to the principal component value. Let’s make some final touches to make the report look a little nicer:
- Change one of the gray cluster colors to purple.
- Add a heading and reference to data.
- Add a visualization that shows the average median price per cluster.
- Add a text box explaining the features.
- Add a visualization stating the variance explained for each principal component.
This is how my version ended up looking.
To top it off we’ll just create one more report where we take a closer look at the relation between a chosen feature, median housing value (target) and the identified clusters. By looking at the visualization we can conclude that the first principal component alone explains almost half of the variance in the data. In addition, we can see from the charts that lower principal components values translate to a higher median house value on average. The Distance
feature has the biggest impact for lowering the PC1 value by a small margin, so let's have a closer look at that to see how it behaves in the different clusters. We create a new page and do the following changes:
- Add a descriptive heading.
- Add a
Scatter chart
withX Axis
value isDIS
column,Y Axis
value isMEDV
andLegend
andDetails
fields are set up like the previous scatter plot. - Right click the
DIS
column and chooseNew group
and create bins of size 1. - Add a
Stacked column chart
showing the distribution of distance values by using the newly created binned column ofDIS
. Sort the chart in ascending order by the Axis instead of the count.
This is how my version of the report turned out.
Summary
In this article we have set-up a Python virtual environment and installed the required libraries for data transformation and visualization. You can install additional libraries to the environment if there is some Python library you want to utilize. Next we set-up Power BI to utilize Python scripts. This was a one-time setup, which means that you don’t have to repeat these same steps with your next report. Finally, we utilized Python scripts for importing data and creating custom visuals. There is also possibilities to use Python scripts for data cleaning & transformations for datasets that have already been imported to Power BI. It remains to be seen in what use cases it makes sense to utilize Python scripts within Power BI versus preparing the data outside Power BI and importing a dataset that is ready for visualization. Also the performance of the Python visuals could be better. But keeping in mind that this is still a preview feature it looks very promising! What use cases do you see for using Python in Power BI? Join the discussion and share your thoughts.