The world’s leading publication for data science, AI, and ML professionals.

Interactive simulation with Tableau and Jupytab

Quickly create and deploy simulation tools and dashboards connecting directly Tableau to your Python Jupyter notebooks

Quickly create and deploy simulation tools and dashboards, by connecting directly Tableau to your Python Jupyter notebooks, and spend your time on the data, not on building visualizations

Multi-Dimensional overview of the Iris Dataset
Multi-Dimensional overview of the Iris Dataset

Overview

This article shows how to let Tableau, a leading interactive data visualization tool, access dynamic data instead of static data. This opens up new applications, such as dashboards that can display real-time data, or the ability of users to display dynamic predictions obtained by machine learning models.

We start with a gentle introduction to dynamic data access in Tableau, both through Tableau’s native language and through Python (via Analytics Extensions). Since this is a hands-on article, we describe how to install the necessary components (Tableau, and Jupyter notebooks) and then illustrate how to configure them and quickly obtain a Tableau dashboard that displays dynamic data based on machine learning.

Tableau

Tableau is a Business Intelligence tool widely used by the industry (along with PowerBI and QlikView). It allows non-technical people to create and deploy web application with a drag and drop interface.

The only drawback is that by default, the data need to be readily available in a supported database, which cannot itself be updated through the interface. You can however create blazingly fast interactive dashboards that connect to numerous data sources, and quickly get an overview of your data at scale.

Dynamic data in Tableau

Tableau’s scripting language

Tableau offers an Excel-like language that allows users to create custom calculations and manipulate different types of data (numeric, string, boolean). However the limitations of this scripting language can be very frustrating as there is no data manipulation like pivot tables, and even some simple transformations (like splitting a field into multiple columns) often require numerous steps.

Tableau’s scripting language is mainly there to transform values at the cell level, or to operate on the full data set for filtering or computing an aggregation. That said, your options are rather limited and you will have to mainly prepare your data outside of Tableau and create a precomputed dataset will all the data that you want to display.

But what if you want to create a simulation or apply some machine learning techniques and visualize the resulting changes after updating some parameters? The way Tableau works with data source simply does not allow this kind of operation, or that would force you to code your algorithm without any library, which is thoroughly impractical.

Analytics Extension to the rescue

To achieve real-time analytics with Tableau, or at least on-the-fly computations using custom algorithms, a very interesting feature of Tableau is hidden in HelpSettings and PerformanceManage Analytics Extension Connection…. It allows you to achieve what we would expect from a web application.

Analytics Extension - Image by Author
Analytics Extension – Image by Author

Here is where Analytics Extension Connections fit in the Tableau data ecosystem:

Data Sources in Tableau - Image by Author
Data Sources in Tableau – Image by Author

Python in the Tableau ecosystem

The two places in Tableau where we can expect Python code are in data source creation, and on-the-fly computations (based on the displayed data).

Data source creation

The data source is often an existing database. It is possible to write a custom data injector that writes to this database. This approach should be fast, but it obviously requires a full-blown database, and also a scheduler that regularly invokes the data injector program.

Another approach consists in connecting directly to a custom data source. Tableau thus allows us to create custom Web Data Connector (WDC) and use the WDC API. A fair number of connectors are available, some of which may fit your need. If this is not the case you will have to create a custom connector, but be warned: the learning curve is steep.

Another alternative is that illustrated in this article: by using Jupytab, you can directly export your Pandas Dataframes to Tableau from a Jupyter Notebook. (This approach uses the WDC API under the hood, so that you don’t have to wrangle with it.) By using Jupytab, you also get scheduled data refreshes for free, as you can use the Tableau internal scheduler to poll data on demand. A minor limitation of this approach is the maximum size of data sources: data sets with a billion rows typically do not fit easily in a Pandas Dataframe!

On-the-fly computations

Besides Jupytab, not many other Tableau tools perform on-the-fly data source creation and computation. We must however mention TabPy, which is Tableau’s own tool for interactive analytics in Python. It creates a server that allows you to use Python code in Tableau, by executing code on demand.

Jupytab serves exactly the same purpose, but with a few key differences:

  • All your Python code comes from a single notebook: you create the data source and expose it for Tableau there.
  • You don’t write Python code in Tableau, you always call a Python function from your notebook.
  • The set of available libraries is linked to the notebook. This means that you can create multiple execution environments with different sets of libraries and use all of them in Tableau, possibly from multiple notebooks.

If your main development tool is a Jupyter notebook with Pandas and some data science libraries, and you want to quickly create interactive dashboards ready for production, Jupytab is a tool that is worth a try.


Jupytab installation

We assume that you are familiar with Python, Conda (or virtualenv), Jupyter and use Linux/macOS for your development. Jupytab works on Windows but is not as well tested—using WSL (Windows Subsystem for Linux) works with this article, though. We will be using conda for the installation, but it should work with pip as well.

Jupytab has two components, jupytab and jupytab-server, that need to be installed in their own environment:

  • jupytab: a very simple API that exposes dataframes and custom functions from a notebook. It needs be installed in the notebook environment, and only requires Pandas as a dependency.
  • jupytab-server: provides the Web Data Connector, spawns kernels, manages the configuration and so on… Do not install this component in the same environment as your notebook, so as to prevent conflicts.

On the notebook side

We first create a virtual environment using Python 3.7 to keep everything clean and reproducible (you can also skip this part if you are ready to follow along in a Conda environment that you already have):

(base) % conda create -n jupytab-notebook-env python=3.7
(base) % conda activate jupytab-notebook-env

Then we install the latest jupytab version and the ipykernel library so as to make our Jupyter kernel available in notebooks:

(jupytab-notebook-env) % conda install jupytab=0.9.11

Then we install the kernel in Jupyter:

(jupytab-notebook-env) % conda install ipykernel
(jupytab-notebook-env) % python -m ipykernel install --user --name jupytab-simulation-demo

You can now create, in Jupyter, a notebook that will use the newly installed kernel.

Jupytab Simulation kernel - Image by Author
Jupytab Simulation kernel – Image by Author

It is best to check that Jupytab has been properly installed by checking its version, for example by doing:

import jupytab
print(jupytab.__version__)

[Output]: 0.9.11

On the server side

We also create a new virtual environment based on Python 3.7. You can open a new terminal or just deactivate your previous environment using the command conda deactivate:

(base) % conda create -n jupytab-server-env python=3.7
(base) % conda activate jupytab-server-env

Then we install the latest jupytab-server version and… that’s all!

(jupytab-server-env) % conda install jupytab-server=0.9.11

Connect Tableau to the notebook

Write the notebook

The notebook and dashboard that inspired this article are freely available: don’t hesitate to download them and use them to follow along.

For our simulation we will only use Pandas and scikit-learn but you may of course install your own libraries. We start by installing scikit-learn in the jupytab-notebook-env (from the base environment, so either in a new terminal or after conda deactivate):

(base) % conda activate jupytab-notebook-env
# No need to install pandas, it is already installed with jupytab:
(jupytab-notebook-env) % conda install scikit-learn

We create a new notebook named jupytab-simulation-notebook

Notebook creation— Image by Author
Notebook creation— Image by Author

We will use the well known iris data set, which is included in scikit-learn, and load everything in Pandas dataframes:

import pandas as pd
import jupytab
from sklearn.datasets import load_iris
iris = load_iris()
iris_data_df = pd.DataFrame(columns=iris.feature_names, data=iris.data)
iris_target_df = pd.DataFrame(columns=['target'], data=iris.target)
iris_target_class_df = pd.DataFrame(columns=['target_name'], data=iris.target_names)
iris_data_df.sample(2)

[Output]:

Sepal sample from Iris Dataset - Image by Author
Sepal sample from Iris Dataset – Image by Author

Then we load the dataframe in a Jupytab Tables dictionary, to indicate that we want to expose those data to Tableau:

tables = jupytab.Tables()
tables['iris'] = jupytab.DataFrameTable("Iris DataSet", iris_data_df, include_index=True)
tables['iris_target'] = jupytab.DataFrameTable("Iris Classification Target", iris_target_df, include_index=True)
tables['iris_target_class'] = jupytab.DataFrameTable("Iris Classes", iris_target_class_df, include_index=True)

A DataFrameTable defines how to expose data to Tableau:

Jupytab DataFrameTable documentation - Image by Author
Jupytab DataFrameTable documentation – Image by Author

Here, we simply exposed static DataFrames along with their index.

In order to allow jupytab-server to retrieve data, we need to add two cells.

The first cell generates a schema that declares all our Dataframes. It needs to be exactly like the cell below (you can copy and paste directly in your notebook):

# GET /schema
tables.render_schema()

Executing this cell prints the schema exported to Tableau.

_[Output]: [{"id": "iris", "alias": "Iris DataSet", "columns": [{"id": "index", "dataType": "int"}, {"id": "sepal_lengthcm", "dataType": "float"}, {"id": "sepal_widthcm", "dataType": "float"}, {"id": "petal_lengthcm", "dataType": "float"}, {"id": "petal_widthcm", "dataType": "float"}]}, {"id": "iris_target", "alias": "Iris Classification Target", "columns": [{"id": "index", "dataType": "int"}, {"id": "target", "dataType": "int"}]}, {"id": "iris_target_class", "alias": "Iris Classes", "columns": [{"id": "index", "dataType": "int"}, {"id": "targetname", "dataType": "string"}]}]

The second cell is where data is exported:

# GET /data
tables.render_data(REQUEST)

Executing this cell generates a harmless error in the notebook: the REQUEST variable is only available when the notebook is executed by the Jupytab server:

[Output]: NameError Traceback (most recent call last)

in 1 # GET /data – → 2 tables.render_data(REQUEST)

If you don’t want to get the error during execution, you can wrap render_data() with a simple try block:

# GET /data
try:
    tables.render_data(REQUEST)
except NameError:
    print("Not available outside jupytab context")

That’s it! You may also find several Jupytab notebook examples in the Jupytab GitHub repository under the jupitab-server/samples directory.

Configure and start the Jupytab server

In the Jupytab server environment, we need to create a configuration file that will allow us to configure a few parameters like the server port, a secret token and of course the list of notebooks that the server must exposes.

The config file may be created using a simple cat command (under Unix), or by writing a config.ini file using any text editor:

(base) % conda activate jupytab-server-env
(jupytab-server-env) % cat << EOF > config.ini
[main]
listen_port = 8123
notebooks = JupytabSimulator

[JupytabSimulator]
path = /path/to/your/notebook/jupytab-simulation-notebook.ipynb
EOF

More configuration options are also available.

When you are done, you only need to start the Jupytab server:

(jupytab-server-env) % jupytab --config=/path/to/your/config.ini
Starting Jupytab-Server 0.9.11
SSL not enabled
Start notebook /path/to/your/notebook/jupytab-simulation-notebook.ipynb on 127.0.0.1:35453
You have no defined token. Please note your process is not secured !
        Please open : http://computer_name:8123
INFO:[KernelGatewayApp] Kernel started: a6abe896-2cb8-403a-8661-3236e16d8def
INFO:[KernelGatewayApp] Registering resource: /schema, methods: (['GET'])
INFO:[KernelGatewayApp] Registering resource: /data, methods: (['GET'])
INFO:[KernelGatewayApp] Registering resource: /_api/spec/swagger.json, methods: (GET)
INFO:[KernelGatewayApp] Jupyter Kernel Gateway at http://127.0.0.1:35453

This will run the notebook and create a server allowing you to access it.

Please note that you need to wait for the registration of your notebook resources to appear in the log, otherwise you will not be able to access it from Tableau. This is the part below:

INFO:[KernelGatewayApp] Registering resource: /schema, methods: (['GET'])
INFO:[KernelGatewayApp] Registering resource: /data, methods: (['GET'])

If your notebook takes time to compute, this may take a few minutes to appear.

Import the data in Tableau

If you don’t have Tableau installed already, you can download it and try it for free. This article was tested with Tableau 2020.2, so you might have to adapt the instructions slightly as needed.

Select the Web Data Connector from the Start screen:

Tableau Connectors - Image by Author
Tableau Connectors – Image by Author

You can then input the web data connector URL that was printed in the terminal:

Web Data Connector Browser - Image by Author
Web Data Connector Browser – Image by Author
Start notebook /path/to/your/notebook/jupytab-simulation-notebook.ipynb on 127.0.0.1:35453
You have no defined token. Please note your process is not secured !
        Please open : http://computer_name:8123

You should be able to select your first notebook and view all available actions:

Welcome screen for Jupytab UI - Image by Author
Welcome screen for Jupytab UI – Image by Author

Proceed to Explore in Tableau and create your data set using the provided tables. Pay attention to the order and join between the tables:

DataSource relations in Tableau - Image by Author
DataSource relations in Tableau – Image by Author

We now have a data set to work with, imported from a Jupyter notebook.


Create the Iris simulator

We will create a Tableau simulator based on the Iris data set. It is often used as a common machine learning example. The task at hand consists in classifying 3 types of iris according to only four dimensions (sepal and petal length/width).

The goal of this simulator is only to get a glimpse of the creation of dashboards using Tableau. They can be created without writing a single line of code and, more importantly, in only a few minutes.

Update the notebook to add a prediction function

We will use a Multi-Layer Perceptron (MLP) classifier and train it with the full content of the data set. Our goal is not to achieve any kind of high-accuracy predictions, but instead to showcase how you can dynamically make predictions from within Tableau, thanks to Jupytab.

As we did with jupytab.Tables, we need to add a jupytab.Functions dictionary for registering our predictor method:

from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(max_iter=600).fit(iris.data, iris.target)
def predictor(sepal_length_cm, sepal_width_cm, petal_length_cm, petal_width_cm):
    class_predict = clf.predict([[sepal_length_cm, sepal_width_cm, petal_length_cm, petal_width_cm]])
    return iris.target_names[class_predict][0]  # We return the only prediction
functions = jupytab.Functions()
functions['predict'] = jupytab.Function('A predictor for the Iris DataSet', predictor)
predictor(0.5, 5, 4, 2)

[Output]: ‘virginica’

We also need to define a new endpoint that will be used by the Jupytab server:

# POST /evaluate
try:
    functions.render_evaluate(REQUEST)
except NameError:
    print("Not available outside jupytab context")

[Output]: Not available outside jupytab context

We must then save the notebook and restart the kernel from Tableau so that our changes are taken into account by Tableau.

Available Jupytab actions for notebook - Image by Author
Available Jupytab actions for notebook – Image by Author

Predict dynamically in Tableau

Connect to Jupytab through an Analytics Extension

Before using the predictor function in Tableau, we need to register our Jupytab server as an analytics extension in Tableau.

This is easily done from the HelpSettings and PerformanceManage Analytics Extension Connection item from the Tableau Desktop menu.

Analytics extension setup - Image by Author
Analytics extension setup – Image by Author

Check that everything works fine by using the Test Connection button:

Succesful connection for Analytics extension - Image by Author
Succesful connection for Analytics extension – Image by Author

Dashboard creation

To display a data point for each specimen, we need to transform our Iris DataSet Index to a dimension:

Dimension properties in Tableau Desktop - Image by Author
Dimension properties in Tableau Desktop – Image by Author

Then drop this value as Detail:

Marks edition in Tableau Desktop - Image by Author
Marks edition in Tableau Desktop – Image by Author

We get the following result:

Default plot with Index dimension - Image by Author
Default plot with Index dimension – Image by Author

We will then display a grid to visualize all dimensions against each others. Select all measures from the Iris DataSet:

Measures selection - Image by Author
Measures selection – Image by Author

And drop in Columns:

Iris Dataset measures in columns— Image by Author
Iris Dataset measures in columns— Image by Author

Do the same with Rows:

Iris Dataset measures in rows - Image by Author
Iris Dataset measures in rows – Image by Author

Your visualization is almost ready!

Multi-Dimensional plot in Tableau Desktop - Image by Author
Multi-Dimensional plot in Tableau Desktop – Image by Author

Just drop the Target Name from Iris Classes in the Color marks.

Use Target Name for colored clusters - Image by Author
Use Target Name for colored clusters – Image by Author

The output allows you to visualize the result as a grid graph comparing all dimensions:

Image by Author
Image by Author

Now we would like to check that our model correctly predicts the Iris type. We want to call our predictor for each Iris sample and display misclassified data in the graph above. We need to create a Calculated Field to achieve that:

Image by Author
Image by Author

The following formula returns the predicted Iris type:

SCRIPT_STR("JupytabSimulator.predict", AVG([Sepal Length Cm]), AVG([Sepal Width Cm]), AVG([Petal Length Cm]), AVG([Petal Width Cm]))
Code editor for calculated field in Tableau Desktop - Image by Author
Code editor for calculated field in Tableau Desktop – Image by Author

JupytabSimulator is the name of our registered notebook in our config.ini file, and predict the name of our registered function. The join of the two values is the target. This means that you can target different notebooks, each one using its own virtual environment and set of libraries.

Function parameters are the four parameters that you defined in your predictor function, and of course they need to be inserted in the same order.

We also need another function that will compare the real Iris type with the predictor and return False if there is a mismatch. The ATTR function is required, as Iris Predictor is already an aggregated data:

ATTR([Target Name]) = [Iris Predictor]
Boolean indicator for valid predictions - Image by Author
Boolean indicator for valid predictions – Image by Author

We will then replace the current Clusters color using Prediction Matching and choose Circle as the representation.

Prediction Matching used as Color Mark— Image by Author
Prediction Matching used as Color Mark— Image by Author

Tableau will communicate with Jupytab for a few minutes, as it needs to predict the Iris class for all the displayed marks (i.e., data points):

Analytics Extension request to Jupytab - Image by Author
Analytics Extension request to Jupytab – Image by Author

This is an important point of the Tableau Analytics extension: you need to minimize the number of marks displayed in your Dashboard to guarantee quick results and a more interactive experience.

Incorrect predictions can then be displayed thanks to the Prediction Matching calculation above:

Prediction errors in visualization - Image by Author
Prediction errors in visualization – Image by Author

By comparing this graph with the previous one, we see how mis-predicted iris types tend to happen at the frontier between iris groups.

A visualization that only gives us the number of mismatch, would drastically the number of marks, while still providing some important informations. We only need to remove all measures from Columns & Rows to replace them by our Prediction Matching calculation. This new visualization will load in a few seconds:

"Flower graph" for the Iris Dataset - Image by Author
"Flower graph" for the Iris Dataset – Image by Author

Last but not least, we would like to understand how the number of iterations of our MLP training impacts our results. This will require us to use another cool feature of Tableau: Parameters.

We start by updating our predictor function in the notebook, adding a new optional variable that will train (again) our MLP. We only need to cache the result to avoid training with the same number of iterations at each invocation:

from sklearn.neural_network import MLPClassifier
last_max_iter = 0
clf = None
def train(request_max_iter):
    global clf, last_max_iter
    if not clf or last_max_iter != request_max_iter:
        clf = MLPClassifier(max_iter=request_max_iter, random_state=1).fit(iris.data, iris.target)
        last_max_iter = request_max_iter
    return clf
def predictor(sepal_length_cm, sepal_width_cm, petal_length_cm, petal_width_cm, request_max_iter=1000):
    clf = train(request_max_iter)  # We now train as needed

    class_predict = clf.predict([[sepal_length_cm, sepal_width_cm, petal_length_cm, petal_width_cm]])
    return iris.target_names[class_predict][0]
functions = jupytab.Functions()
functions['predict'] = jupytab.Function('A predictor for the Iris DataSet', predictor)
predictor(0.5, 5, 4, 2)

[Output]: ‘virginica’

Do not forget to restart the kernel in Tableau through the Jupytab UI so as to apply the changes.

On the Tableau side, we will create a parameter with a range from 10 to 1000 in order to visualize the impact of the number of training iterations on our predictions:

Parameter Creation in Tableau Desktop - Image by Author
Parameter Creation in Tableau Desktop – Image by Author
Parameter Settings - Image by Author
Parameter Settings – Image by Author

We then add this parameter to our view:

View parameter in Tableau Dashboard - Image by Author
View parameter in Tableau Dashboard – Image by Author

We can now dynamically change the maximum number of training iteration of our MLP:

MLP Max Iteration settings - Image by Author
MLP Max Iteration settings – Image by Author

We also need to update the Iris Predictor calculation and add this new parameter:

SCRIPT_STR("JupytabSimulator.predict", AVG([Sepal Length Cm]), AVG([Sepal Width Cm]), AVG([Petal Length Cm]), AVG([Petal Width Cm]), [MLP Max Iteration])
Iris Predictor calculation updated with MLP Max Iteration - Image by Author
Iris Predictor calculation updated with MLP Max Iteration – Image by Author

Just after clicking on apply, the result is immediately updated with our default value of 10 iterations.

MLP Training 10 Iterations

MLP Training 10 iterations - Image by Author
MLP Training 10 iterations – Image by Author

MLP Training 100 Iterations

MLP Training 100 iterations - Image by Author
MLP Training 100 iterations – Image by Author

Conclusion

Tableau is a very powerful tool for data visualization. Using Jupyter notebooks to compute data and get an almost immediate feedback is very convenient.

We have barely scratched the surface of Tableau features, and there is almost no limit to the kind of visualisations and interactions that you may create in your dashboard.

The Jupytab project is open source and accessible on GitHub under the MIT license.

You can also peruse online many impressive dashboards done with Tableau.

Acknowledgments

This is the first project I have ever published on GitHub and it would probably never have left my computer without the strong support of CFM’s Open-Source Program and the help of people working there.

I would especially like to thanks Jean-Sébastien Dieu, Eric Lebigot and Florent Zara for their support on various aspects of Jupytab and for helping me to publish this article.


Related Articles