A Primer on pantab

How to combine pandas and Tableau in a data workflow

Chris Nguyen
Towards Data Science

--

The pandas library in Python is an essential tool for data analysis, but did you know you can combine it with the Tableau Hyper API to make your pipeline from raw data records to visualizations easier? This is the exact purpose for the pantab library developed by Will Ayd at innobi, which is a Python wrapper around Tableau’s Hyper API. Here, I will go through a guided example of how to use pantab in a data workflow.

Prerequisite Materials

  • Install pantab by simply running pip install pantab in the terminal of your chosen environment
  • Install the Tableau Hyper API by running pip install tableauhyperapi in the same environment
  • Clone my example repo from Github because it has all the data for the exercises we will go through (the data is just some starter Pokemon stats I scraped online)

Why Use pantab?

Before we even get started, why should we even use pantab and the Hyper API in the first place? The Hyper API allows you to create functions to automate your data processing with Tableau Hyper extract files. Hyper is Tableau’s in-memory data engine technology and the extracts are represented with the “.hyper” extension. So basically, the Hyper API allows you to create “.hyper” files for use with Tableau. You’ll be able to create and update extracts using a language like Python, Java, or C++. However, it can get a little hairy working with the API at times. For example, here is a Python code snippet to transform data from a CSV to a Hyper file:

Inserting data from a CSV to a Hyper file. What a pain!

That’s a lot of effort for including just 3 rows of data in a Hyper file! (in fact, it took me more time and effort to write and debug this code than the entire rest of the code in this article combined)

The good news is that pantab makes this process much easier by combining it with pandas and abstracting the Hyper API to simplify it. If you are already a pandas practitioner, this is even better as this will provide you with a familiar framework to work with data and generate your Hyper files.

An example use case of this is something that I run into from time to time in my current position: sometimes running extracts take a very long time, potentially hours overnight, but if there’s something wrong with it then I want to investigate that. However, I want to investigate an exact version of the extract by connecting to the extract as a data source and doing EDA on it to check some assumptions before the data is overwritten the next day. This would be easiest using tools like Python/pandas as Tableau itself is primarily a data visualization tool but I want to check certain groupings or aggregations in my investigation in a more programmatic way.

pantab would also be a useful tool in an ETL data pipeline to pipe out Hyper files when the data you need has to be transformed from one or more sources. Sort of like an alternative to Tableau Prep (or a replacement should you not have access to it).

At the time of this article, there are only 4 functions you can call in the pantab API:

  • frame_to_hyper: Write a DataFrame to a .hyper extract
  • frame_from_hyper: Extract a DataFrame from a .hyper extract
  • frames_to_hyper: Writes multiple DataFrames to a .hyper extract
  • frames_from_hyper: Extracts multiple DataFrames from a .hyper extract

Let’s go through each of these as examples using the data from my repo.

Writing a DataFrame to Hyper

The frame_to_hyper function simply converts pandas DataFrames to tables within a Hyper file. After importing required libraries, you just need to read in the data as a pandas DataFrame and then use frame_to_hyper to write the DataFrame to a specified Hyper file as shown below:

frame_to_hyper example
Resulting table in Hyper extract

That is two lines of code to do the exact same thing as the code using the Hyper API directly does. Quite a reduction in code! And now you can use pandas to directly transform the data or do any checks with it before writing it to a Hyper file.

Note that I actually created two DataFrames above. I did this to demonstrate that you can save tables in Hyper files in different schemas (for more background info on this, consult the Hyper API reference). The default schema is the “public” schema but you can be explicit and name a schema, just as I did with the “Extract” schema above. To do this, you must import the “TableName” object from the tableauhyperapi library (so you can’t really100% avoid the Hyper API after all, but you can certainly reduce how much you need to use it directly!).

Reading from Hyper to a DataFrame

Why did I bother to save my data twice but in different schemas? I did it to demonstrate that not all tables in Hyper extract files will be in the public schema so you will need to check where they are sometimes. (In particular, if the Tableau version used to create the extract was before version 10.5, it may be in the Extract schema by default instead of the public schema.) To demonstrate this, take the pokemon_extract.hyper file and try to read it into a pandas DataFrame using the frame_from_hyper function with default table location. It errors out with message “HyperException: Specified table does not exist: pokemon.” because the “pokemon” table is not in the default location for Tableau 10.5 and above. You need to specify the schema using the TableName object like the second code line in the below code snippet:

Specifying the schema in the frame_to_hyper function

You can easily check that the default schema when writing a DataFrame to Hyper is the public schema because these lines have the exact same resulting file:

The public schema is the assumed default if no specific schema is given
Notice that this says “public.pokemon” on the left
Notice that this says “Extract.pokemon” on the left

How to Read Multiple Tables to DataFrames and Check for Existing Tables and Schemas

For an existing Hyper file, there might be tables in existing schemas already. How do we check for these existing tables and schemas? We can use the frames_from_hyper function to help us with that. The frames_from_hyper function returns a dictionary object where the schema/tables are stored as dictionary keys and the pandas DataFrames are the values. Looking at the keys will reveal any pre-existing schemas and tables.

Return what schemas and tables we already have

Here, we can see that public.pokemon exists in the pokemon_public.hyper file while Extract.pokemon exists in the pokemon_extract.hyper file.

Write Multiple DataFrames to Hyper

Finally, we can write multiple pandas DataFrames to a Hyper file by using the frames_to_hyper function. It takes a dictionary of DataFrames as input to write out each of them out to a table in a Hyper extract like so:

Writing multiple DataFrames to a Hyper file

Note however that this will create three separate tables in the Hyper file, with no way to combine or union them into a single table:

All separate and can’t use union or combine them

But what if we do want to combine all of the tables into one? Here, all of our tables have the same shape and fields so we do want to combine them in the end. We can do this in two ways: either combine them as a single pandas DataFrame before writing out to a Hyper file or use the append mode (table_mode = ‘a’) in frame_to_hyper:

Two ways to combine DataFrames into one table in Hyper
Result of appending DataFrames
Result of appending Hyper tables

The pandas library can make many things easier, including working with Tableau workflows. I hope this primer helped you learn more about how to use pantab and generate some use cases for your data projects!

--

--