Tableau+Python: TabPy and geographical clustering

An easy way to explore data using K-means clustering.

Andrey Babynin
Towards Data Science

--

Dashboard for clustering AirBNB

This is a short tutorial on how to made K-Means clustering on a map using Tableau extension: TabPy.

TabPy can be downloaded here or, if you use conda:

conda install -c anaconda tabpy-server

There is a number of articles of how to install and run tabpy server, but I personally encounter two problems:

  1. PermissionError: [WinError 5] Access is denied. It is resolved here
  2. Unable to run tabpy server using windows batch file ,startup.bat. The solution is to install tornado-5.1.

As I mention, my goal was to test tabpy functionality and I chosen clustering problem as one of the most common. K-Means algorithm is a great tool for exploring your date without making any assumptions. There are plenty of tutorials on K-Means: some are here and here. The source of the data — Airbnb Listings in New York City published at the Tableau Public Resources.

The table consists of several dimensions:

  • Zipcode
  • Property Type
  • Neighbourhood
  • Room Type, etc.

I used Zipcode for geograpnical mapping. Main dimensions, we may be interested: Property Type, Host Since (time dimension).

At first, I created two control parameters to manage clustering:

Tableau Control Parameters

Clustering Method, as the name outlines, is responsible for a clustering algorithm.

The script is embedded into Calculated field using SCRIPT_REAL function:

One of the pecularities of working with tapby is that it sends data to server as a list, so it is necessary to convert data as np.array and reshape in one column. Another trick is to remove all rows with NAN values which is done in line 16.

_arg5 and _arg6 corresponds to managing parameters.Despite the fact that they are single values they are packed into lists as well, so to unpack them for python zero indexes should be used _arg5[0] and _arg6[0].

Also, sometimes algorithm doesn’t work because table calculations are computied using wrong dimension. In my case I used computations along Zipcode. Default Table Calculations rarely works, so each time check this option before looking for error in your code.

Default Table Calculations

To see if our clustering makes sense, I added several scatterplots to visualize the joint distribution of underlying variables:

  1. Avg. Price vs Number of Reviews
  2. Avg. Price vs Median Review Scores Rating
  3. Number of Reviews vs Sum (Beds)
Scatterplots of joint distribution

It can be seen that clusters is not very well defined so our naive model is as good as random picking, but at least it works without errors 👌.

To add interactivity between graphics and map one may create dashboard actions. I added hovering action in Dashboard > Actions.

It was my first experince using TabPy, so far it is of limited usefulness, but I hope Tasbleau will develop it further.

P.S.

Also, sometimes I write to Telegram (mostly in Russian) about asset management.

--

--