Python

How To Prepare Your Data For Visualizations

Without using Tableau Prep or Alteryx

Stephanie Lo
Towards Data Science
8 min readJun 26, 2023

--

Photo by Robert Katzki on Unsplash

Want to get started on your next Data Visualization project? Start off by getting friendly with Data Cleaning. Data Cleaning is a vital step in any data pipeline, transforming raw, ‘dirty’ data inputs into those that are more reliable, relevant and concise. Data preparation tools such as Tableau Prep or Alteryx were created for this purpose, but why spend money on these services when you can accomplish the task with open-source programming languages like Python? This article will guide you through the process of getting data ready for visualization using Python scripts, offering a more cost-effective alternative to data preparation tools.

Note: Throughout this article we will be focusing on getting data Tableau ready for data visualizations, but the main concepts equally apply to other business intelligence tools.

I get it. Data cleaning just seems like another step in the already lengthy process of bringing your visualizations or dashboards to life. But it’s crucial, and can be enjoyable. It’s how you get comfortable with your data set, by getting an in-depth look at the data that you have and don’t have, and the consequential decisions you have to take to approach your end analysis goals.

Whilst Tableau is a versatile data visualization tool, sometimes the route to get to your answer isn’t clear. This is where processing your dataset before loading it into Tableau may be your biggest secret helper. Let’s explore some key reasons why data cleaning is beneficial before integrating it with Tableau:

  • Eliminates irrelevant information: Raw Data often contains unnecessary or repeating information that can clutter your analysis. By cleaning the data, you can remove the waste and concentrate your visualizations on the most relevant data features.
  • Simplifies data transformation: If you have a clear vision of the visualization you intend to produce, performing these pre-transformations before loading the data into Tableau can streamline the process.
  • Easier transferability within teams: When data sources are regularly updated, new additions can introduce inconsistencies and potentially break Tableau. With Python Scripts and code description (more formally known as markdown documentation), you can effectively share and empower others to understand your code and troubleshoot any programming issues that may arise.
  • Time-saving for data refreshes: Data that needs to be refreshed regularly can benefit from leveraging the Hyper API — an application that produces Hyper file formats specific to Tableau and allows for automated data extract uploads whilst making the data refresh process more efficient.

Now that we’ve covered some advantages of preparing your data, let’s put this into practice by creating a simple data pipeline. We’ll explore how data cleaning and processing can be integrated into a workflow and help make your visualizations easier to manage.

Creating a data pipeline using Python Scripts

Image by Author

The journey our data will take is a pretty simple one: Data Cleaning, Data Processing for Visuals and transforming them into Tableau-Ready Hyper Files for seamless integration.

A final note before delving into our working example is that for the Hyper file conversion you will need to download the pantab library. This library simplifies the conversion of Pandas Dataframes into Tableau .hyper extracts. You can can easily complete this by using the following code in the terminal of your chosen environment (For those less familiar with environments this is a great primer article on what they are and how to install certain libraries):

#run the following line of code to install the pantab library in your environment
pip install pantab

Tutorial: Data Preparation with Python exploring Electric Vehicles Licenses in Canada

The data visualizations we will be aiming on producing focus on the popularity of different electric automakers and models based on Government available data from Statistics Canada.

It’s important to note that this builds upon a dataset previously explored in my prior article: Electric Vehicle Analysis with R. If you’re interested in understanding the initial exploration of the data set and the rationale behind the decisions made, please refer to it for greater detail. This tutorial focuses on building out the Python scripts where at each step following the initial inputs, we will be saving the output of the each Python script into their respective folders, as outlined below:

Image by Author

The folder process ensures that the pipeline is well organized and that we are able to keep a record of each output in the project. Let’s jump into building out our first Python script!

Data Cleaning

The initial script in our pipeline follows the fundamental steps of data cleaning, which for this dataset includes: keeping/renaming relevant columns, removing nulls and/or duplicates, and making data values consistent.

We can start by specifying our input file locations and the destination for the output files. This step is important since it allows us to organize different versions of the files in the same location, in this case we are modifying the file outputs on a monthly basis, so each file output is separated by month as indicated at the end of the file name 2023_04:

The following code reads the original .csv inputs and defines what columns we want to keep. In this case, we are interested in preserving information related to the type of models bought and disregard columns pertaining to the car dealerships or any other irrelevant columns.

Now we can shorten the column names, removing leading or trailing white spaces, and add in underscores for easier comprehension.

Next, after having checked that there are only a few null entries in the dataset, we will remove the null data with the .dropna function. At this point, you would also want to remove duplicates but in the case of this particular dataset we will not. This is because there is a substantial amount of repeated information, and in the absence of row identifiers removing duplicates would result in data loss.

The last final step is to save our data as a .csv file to an appropriate folder location which would be placed in the clean_data folder of our shared directory.

Notice how we referenced the file using __file__, and specified the file directory by using bash commands where ../ indicates the previous folder. This concludes our data cleaning script. Now let’s proceed to the data processing phase!

Access to complete working code and assembled scripts can be found in my Github repository here.

Data Processing for Visualizations

Let’s revisit the objectives of the visualizations we are trying to achieve, which aim to highlight the changes in popularity of electric vehicles registered. To effectively showcase this, we want our final Tableau-ready dataset to include the following features, which we will code out:

  • Absolute counts of vehicles by year
  • Proportional counts of vehicles by year
  • Largest increases and decreases of vehicles registered
  • Ranking of vehicles registered
  • Previous ranking of vehicles registered for comparison

Based on the visuals you aim to produce, the creation of your ideal columns may be an iterative process. In my case, I included the last column after having built out my visualizations since I knew I wanted to provide the viewer with a visual comparison of ranking differences so the Python script was adjusted accordingly.

For the following code we will focus on the model aggregated data set since the other dataset for brands is very similar. Let’s first define our inputfile and outputfile :

Notice how we referred to the inputfile from the clean_data folder, which was the output of our data cleaning script.

The code below reads in the data, and creates a data frame of the aggregated counts by Vehicle_Make_and_Model and Calendar_Year :

The pivot function performs similarly to the pivot table function in Excel where it takes each of the values in Calendar_Year as the column input.

Then the script uses a For Loop to create per_1K inputs. This calculates the proportions of each model to be able to compare each model on the same scale and creates a column for each year:

From calculating the proportions by year we can calculate the largest increases and decreases of each model from the start of the dataset in 2019 until the last full year of data in 2022.

Here, the melt function is used to re-pivot the separated per_1K columns by year back into rows, so that we just have one column for per_1K and their associated values.

The below code allows us to join our absolute counts and the other calculations we just created.

We can now create the rank column using license counts and sort these values by Vehicle_Make_and_Model and Calendar_Year.

Last column to create is the previous_rank column by using the shift function.

Finally we are able to save the output to the clean_model folder path in our pipeline, supplying us withone visual ready data set.

As a friendly reminder, the full python script code, including that for the clean_brand processed data set can be found on my GitHub repository here.

Transforming your final data files into .hyper file formats

The final step in our pipeline is relatively simple since all we have left to do is to convert the .csv processed files we created into .hyper file formats. This should be relatively easy as long as you’ve downloaded the pantab library as referenced earlier.

It’s worth a mention that in Tableau, connected data can either have a live connection or be extracted. A live connection ensures that there is a continuous flow of data, with updates from the source reflected almost immediately in Tableau. Extracted data involves Tableau creating a local file with a .hyper file extension which contains a copy of the data (Detailed description of data sources can be found here). Its main advantage is its fast loading capability where Tableau can access and present the information more efficiently which is particularly beneficial with large datasets.

The code for the hyper file conversion scripts start with loading in pandas and pantab packages, followed by reading in the cleaned_model data set that you would need for Tableau.

The last line of code uses the frame_to_hyper function that produces the .hyper files and saves this to the hyper folder.

As a final step, we can easily load .hyper file formats into Tableau by opening a new workbook, and in the select a file section you can choose the file of your choice to load by selecting more. When we load in our ev_vehicle_models.hyper file it should show as a Tableau Extract such as in the screenshot below where your data is ready to build your visuals upon!

Closing Thoughts

By incorporating thoughtful planning into your visualizations, you can simplify the maintenance of your dashboards through the creation of a straightforward data pipeline. Don’t worry if you lack the resources; open-source coding programs like Python offer powerful capabilities. As a final, friendly reminder, for access to the Python scripts please check out my GitHub repository here.

All images unless otherwise noted are by the author.

References

  1. Salesforce, Tableau Hyper API, 2023
  2. R.Vickery , A Data Scientists Guide to Python Virtual Environments, Jan 2021
  3. K.Flerlage, Tableau Data Sources Part 1: Data Source Types, Jul 2022

--

--