Sur La Tableau

Alex Mitrani
Towards Data Science
10 min readNov 11, 2019

--

A picture of a Sur La Table storefront edited with elements from the Tableau visualization software

Visualizations are one of the main methods Data Scientists have to clearly communicate their findings to a wide audience. Prior to my journey into Data Science my main visualization tools were the Microsoft Office Suite and a collection of simple image editors (believe it or not mainly MS Paint and Mac Preview). These tools were more than enough for my needs at the time and were the only tools that my high school and college focused on. Once I entered the professional world I quickly learned that these tools alone are not enough to thrive in any analytical career and that if I didn’t build up my skill set I would fall short of my goals. The modern work environment requires advanced data tools beyond the capacity of Excel which is why Business Intelligence (BI) software, like Tableau, is becoming more popular with cross-functional teams.

I was motivated to learn Data Science techniques because I began to work with large external data sets that Excel and Google Sheets took far too long to compute aggregation columns with. I learned how to manipulate Python’s Matplotlib, Seaborn, Folium, and Plotly libraries to create powerful visualizations from multiple data sources. These libraries require a thorough understanding of each type of chart’s specific parameters to create professional level visualizations, which can be time consuming and creates a barrier to entry for those without coding experience. There are BI tools that merge the power and customizability of these libraries with the relative simplicity of Excel and Paint. I decided to acquaint myself with Tableau after networking with several practicing members of the field who use the software in addition to various visualization libraries.

Tableau is a BI tool that can easily combine data from multiple sources to create professional interactive visualizations. Two of Tableau’s advantages are that users can create in-depth visualizations without having to adjust numerous parameters of code and that users can merge tables from multiple sources with the same logic of SQL queries. Tableau is now sold on a subscription basis and has experienced growth over the last few years. According to the company’s 10-K SEC filing from 2018, “As of December 31, 2018, we had over 86,000 customer accounts compared to over 70,000 as of December 31, 2017 and over 54,000 as of December 31, 2016.” However, it is up to you to decide if it is worth the $70 monthly user subscription fee.

I completed Tableau’s Creator section’s online tutorial videos and practice assignments to familiarize myself with the software. I then recreated some visualizations from my previous data science projects to compare some Python visualization libraries to Tableau and detailed my experiences here.

Part 1: Getting Started

Tableau is offered on a monthly subscription basis with numerous price points based on your needs. You can download a free trial for 14 days and if you are a student or teacher you can get one free year of Tableau. I used the two week trial to follow along with the tutorial videos.

A New Tableau Workbook

Like Excel, Tableau files are called workbooks and a workbook can contain multiple sheets. A Tableau sheet is where you can create numerous types of graphs to visualize parts of your data in different ways. These graphs are inherently interactive; you can select what parts of data you’re interested in looking at and use Tableau’s GUI to create custom groupings of your data. The data pane of on the left of the screen separates your data between categorical and continuous columns. You can then drag and drop these columns to the visualization’s x and y axes.

Tableau workbooks can also contain dashboards and story points. A dashboard is a view of one or more interactive visualizations. You can set these visualizations to initially appear consistently to all other users. For example, let’s say you have a dashboard with a “sales by country” graph. This graph can be adjusted several ways, the user can change this graph to “sales by product type.” You can set this visualization to default to the by country view for every person that you share the workbook with to ensure that you are sending the consistent message that you intended. Tableau story points are a collection of ordered and commented dashboards that together tell a story that supports your analytical conclusion.

Tableau’s server connection options with the MongoDB BI menu open

Tableau allows you to import data from local files and connect directly to external servers; you select the type of database that is hosted by the server (there are numerous SQL and NoSQL database options listed) and then make queries to select the data that you wish to analyze.

There are numerous options to save your Tableau workbooks. Packaged workbooks (file extension .twbx) contain the underlying data to your analysis. Non-packaged workbooks (file extension .twb) do not have the underlying data included in the file which means that the recipient is required to have access to the same data sources to view the file. Both of these file types retain the interactivity of the visualizations. You can also share the data and visualizations as other filetypes like Excel or PDF, however these will not retain the interactivity of a Tableau workbook. You may also share these files with Tableau’s cloud service. Users with Tableau Desktop can view and edit the workbook if given the proper permission and users with Tableau Viewer ($12 monthly, minimum 100 users) can only view the workbook.

Part 2: Data Sources

The Join menu open on the Data Source tab in Tableau. Like a SQL query you can join multiple tables on a specific column.

It’s easy to connect multiple data sources within tableau. Once you select the tables you are interested in Tableau’s Join menu pops up. It automatically selects common columns to join on, and you have the option to change both the joined column and the type of join used. Tableau offers a robust GUI which is good for users who are put off by the text based queries of SQL and NoSQL databases.

The Data Source tab contains the raw data and has all of the sorting and aggregation options of a traditional Excel sheet. If you are using an external database you also have the option to keep the data up-to-date with the “Live” option located in the top right portion of the window. If the “Extract” option is selected then the data will update at set intervals of time. I am working with static .csv files so this option doesn’t matter at this time.

Part 3: Analysis and Visualizations

I worked with three datasets; the sample sales dataset from Tableau, data from my logistic rent prediction project, and data from my Naive Bayes Classifier for Natural Language Processing.

The tutorials detail the numerous ways Tableau can view typical internal sales data generated by a fictional furniture company. Tableau’s GUI allows you to click and drag the columns most relevant to your analysis. The Show Me button on the top right portion of a sheet window suggests visualizations that reveal relationships between the selected columns.

A Tableau dashboard that utilizes sample customer data from a fictional furniture company. Note the regression line and tooltip information box open on the bottom left pane.

The sample dataset contains detailed data about the sample company’s customers, vendors, and internal cash flows. I was able to easily create visualizations that broke down data by market location, customer segment, product type, and profit margin. After this tutorial I can see that purchasing Tableau makes sense if your team is consistently generating large amounts of internal data. Let’s see how well Tableau handles visualizations for other Data Science topics.

NYC Apartment Listings Choropleth

The GitHub repository for my rent prediction project is hosted here. I have two Pandas data frames; one with asking rents and information about the listings, the other is a table of NYC’s neighborhoods and corresponding ZIP Codes. Both data frames share the “neighborhood” column. In Python’s Pandas library you can use the .join() or .concat() methods to add the two data frames together. In tableau you can use the interactive Join menu.

My favorite visualization from my rent predictor is the choropleth map of total number of listings by ZIP Code which I created with Python’s Folium library. It took me about two hours to complete this visualization because of the steps involved. I first had to find a guide to create a choropleth map; the guide I followed was for Los Angelas based data. This meant that I had to find a .geojson file that was specific to NYC ZIP Codes. Once I put this all together I was thrown several error messages. It turned out that the Folium library had been recently updated with several important changes to the code. I read the documentation and looked for answers that would help me correct the answers. Once I finally repaired the code I had an interactive choropleth map to support my analysis, a screenshot of the map and a gist of the code is below.

A screenshot of the choropleth map created with Python’s Folium library. Folium generates an HTML interactive map, which can be downloaded from the project repository and rendered with a web browser.
The code used to create the interactive choropleth

Creating a choropleth map in Tableau is much simpler than coding one. You first create a new sheet separate from your raw data. Then you drag and drop the ZIP Code column to the Columns bar and Rows bar. Then you click the right hand drop down menu of the ZIP Code bubble in the Rows bar. You convert the data to a count of entries, where each entry represents a listing. Finally you click Show Me on the top right portion of the window, the first option is for an interactive choropleth map. I didn’t need to search for a .geojson file specific to NYC. This took me two minutes which is much faster than the coding way.

A screenshot of an interactive Tableau choropleth map. Tableau maps move by dragging and scrolling instead of the point-and-click controls of a Folium map.

The Tableau map is heavily adjustable. I quickly made an average rent by ZIP Code map as well and I was able to change the colors with drop down menus instead of adjusting parameters within the code.

PCA Visualizations

I recently completed a Naive Bayes Classifier that could tell the difference between scripts of Curb Your Enthusiasm and Seinfeld. Using the principles of Natural Language Processing and Principal Component Analysis (PCA) I identified three dimensions that would best explain the variance between the two types of scripts. I generated the visualization below utilizing Python’s Matplotlib and Scikit-Learn libraries.

The relevant part of PCA that applies to the following visualizations is that each word in each script is a dimension of the script. If a script contains seven mentions of the name Kramer then the value of that script’s Kramer Dimension is seven. If the script does not contain the word Cat then the value of that script’s Cat Dimension is zero. A script can have as many dimensions as there are words, PCA seeks to reduce the number of dimensions while explaining as much of the variance between scripts as possible. I selected the three principal components that explain the most variance to create the visualization below.

A 3-D representation of the scripts utilizing features identified with Principal Component Analysis. Note the red (Curb) scripts are tightly clustered while the blue (Seinfeld) scripts are more spread out.
The code to create the 3-D representation of the “Curb your Enthusiasm” and “Seinfeld” scripts

I exported the data used to create the above chart to Tableau. I searched various forums to learn how to make a three dimensional chart in Tableau and I quickly learned that there is a debate over whether a third axis is bad practice in the Business Intelligence community (See the Tableau Community Forum conversations here and here). Three dimensional charts are common in the Data Science community because we can explain the mathematical principles used to create the additional dimension.

I created a side-by-side visualization of the scripts’ principal components and added it to a Dashboard.

A Tableau Dashboard visualization of the Principal Components for “Curb Your Enthusiasm” and “Seinfeld” scripts. The visualization contains two side-by-side charts of the first dimension to the second and third dimensions, respectively.

Although Tableau does not support the type of XYZ 3 dimensional graph I wanted to create, the Dashboard clearly shows that there is a tight cluster of Curb Your Enthusiasm scripts compared to the widely spread out Seinfeld scripts. Fortunately Tableau has numerous import options which would enable me to upload the Python generated image to the Tableau dashboard if it was absolutely necessary for my analysis.

Conclusion

Tableau has numerous pros and cons:

Pros

  • Tableau is a BI tool which is designed to unlock insights from customer/client sales, vendor, and internal financial data
  • Tableau has a simple and intuitive GUI
  • Tableau users can quickly create BI visualizations without coding experience
  • There are numerous importing options which allow users to easily consolidate internal and external data sources
  • Joining data sources follows the logic of SQL queries
  • The Dashboard and Story Point views allow you to guide your audience through your analysis with interactive visualizations
  • There are numerous training videos and exercises hosted on their website
  • You can save and share a full Tableau workbook or create different types of files to share static images with non-Tableau users
  • There are numerous publishing options available through Tableau’s cloud services

Cons

  • Tableau is a BI tool which is not designed to create all of the visualizations that are prevalent in the data science field
  • Tableau is expensive, only get subscriptions for team members who will consistently use it
  • There are not as many models and options available in Tableau to Data Scientists as there are in Python libraries like Scikit-Learn

TL;DR

Tableau is a Business Intelligence tool. It is fairly easy to get used to its intuitive GUI. If your team consistently generates a tremendous amount of internal data (i.e. data from clients/customers, vendors, and products/services) then Tableau can be a powerful tool to help you uncover valuable insights. Tableau shines when it comes to drawing insights from consumer behavior, however it fell short when it came to generating 3-D charts for advanced topics like PCA.

Tableau is not an all-in-one software solution for business or data science, it does not have the power and customizability of Python packages. Tableau easily generates visualizations and associated statistical information, make sure you understand what this information means before you present your findings. Tableau is available on a monthly subscription basis; only purchase subscriptions for you and your colleagues if you plan on using the software and if your team generates the data that it was designed to handle.

Only you can determine if Tableau will is worth the price for your team.

--

--

Data scientist with a passion for using technology to make informed decisions. Experience in Real Estate and Finance.