The world’s leading publication for data science, AI, and ML professionals.

The unpleasant work of Data Exploration

Create a stage for your explorative results using streamlit and maximize the effect while minimizing the effort.

How to maximize the effect while minimizing the effort using streamlit

Image by author
Image by author

Everyone who has worked on a data science project knows the struggle of not getting any recognition for the needy work that happens before the data is thrown into a fancy-sounding algorithm or black-box model that generates an output.

As much as we all wish real-life Data Science was about applying algorithms and building neural networks to perfectly aggregated data sources, we all know that is far from true.

The most frustrating part is that during the entire phase of aggregating, cleaning, exploring, and enriching the raw data we don’t get any acknowledgment for our work. And here’s why: no one knows the complexity of what’s happening at your desk as soon as you dive deep into the hidden gems of your data. But who are we to blame them, when all we share after setting up the initial project plan is the final output.

So instead of locking yourself in your room and keeping first insights to yourself, challenge yourself to make an effort to share those insights with your surroundings. Instead of having to prepare presentation slides that will be outdated as soon as you’ve finished them, make those insights accessible on demand. And now comes the best part: you won’t need to learn a new language, get an expensive license, or hire an analyst. All you need is right at your fingertips. Python and the easy to install, open-source library streamlit that allows you to build applications without being a full stack developer.

Diving into the streamlit code

So let’s get started.

For all of you that have never worked with streamlit: Don’t be scared! It is easy to get started and the syntax doesn’t vary too much from what you are already familiar with.

All the code can be found in the _https://github.com/charlotteveitner/streamlit_data_exploration_ repository.

We are using a sample dataset that contains information about advertising spots from different brands that were aired on TV and then lead to website visits (here described as reactions). The provided dataset contains randomised information and its purpose is simply to visualize the setup of the explorative analysis.

Extract of raw dataset with necessary columns to run the streamlit application.
Extract of raw dataset with necessary columns to run the streamlit application.

The question that we want to answer with that dataset is: What drives the success of a TV spot? Is it the channel that the spot was aired on, the hour of the day the spot was played, whether it was prime time or not, or maybe the length of the spot?

The ultimate goal of the project is to be able to predict the optimal setup for a specific spot to maximise its success. But to build a model and setup that can be used to make such predictive analysis we need to know the data from A to Z. And this is where streamlit comes into play.

Using only a python script to build a dashboard is every Data Scientists’ dream. No need to remember excel formulas or figuring out how to work the latest BI tools. Not only do you make your own life easier by being flexible around the explorative analysis of your data. But you also make your analysis accessible to everyone involved in the project. This way you keep everyone in the same boat and additionally create something visually pleasing enough to get recognition for your explorative work. Sounds like a win-win.

Now let’s see what streamlit is really about.

The published code is hosted on a streamlit server. You can access the dashboard under the following URL: https://share.streamlit.io/charlotteveitner/streamlit_data_exploration/main/streamlit_main.py

Before diving into the code I’ll give you a brief introduction to the capabilities of the streamlit application.

The goal is to understand the data we are working with and find correlations and causalities within our dataset. We want to identify the most important drivers of the target value. Therefore, we use a set of explorative graphs and tables. Streamlit can visualize both plotly graphs and pandas dataframes, which means there’s not a whole lot of formatting left to do. Using pivot tables, histograms, correlation matrixes, and scatter plots allow us to generate insights on the fly while sticking to some of the most common visualization methods that are easy to understand from an outside perspective. Check out the example below to see how easy it is it is to create a pivot table and show it within a streamlit application.

The output of the code is a pivot table in the application.

Streamlit output of pivot table function
Streamlit output of pivot table function

Within the streamlit application, the user has the ability to filter the original dataset by a time period, or even compare two different time periods. With a dropdown menu, the user can select the desired visualization method and is then prompted with additional settings for the chosen method.

Play around with the sample set and the different methods. Even with this unrepresentative dataset, you will find yourself exploring all the different capabilities that the tool holds. There are no real insights to unravel with this data but I still empower you, I show you the raw data, the unglamorous part of Data Science. You get the chance to build your own hypothesis and prove them wrong or right with the help of the application. TV spots shown during prime time perform more successfully than those shown out of prime time – Check the pivot tables for prime time and see for yourself.

The script _streamlithelper.py (https://github.com/charlotteveitner/streamlit_data_exploration/blob/main/streamlit_helper.py) includes five important functions _create_histogram, create_correlation_to_target_value, create_correlationheatmap, and _create_scatterplot that contain the aggregation and visualization methods for the different analysis. Depending on the choices the user makes in the dropdown menu the different functions are being called.

The code creates a dropdown menu on the sidebar and produces the chosen analysis in the main window.

Output when Pivot Table is selected from the dropdown menu
Output when Pivot Table is selected from the dropdown menu
Output when Histogram is selected from the dropdown menu
Output when Histogram is selected from the dropdown menu

Make it yours

If you want to tweak the functions or use your own data for the analysis you will have to make some changes to the original application. Clone the repository to your local machine and execute the streamlit app by directing your terminal to the folder and entering the execution command "_streamlit run streamlitmain.py".

Start with changing the title of the app within the main script to check if you see the change. Press the Always rerun button in the streamlit app to be able to instantly see the changes you make in the code.

Set of buttons appears in the top right corner of the application as soon as you make any changes in the code.
Set of buttons appears in the top right corner of the application as soon as you make any changes in the code.

Now the app is all yours. I provide you with a set of functions that can easily be changed to your own dataset. The app is just meant to outline some of the highlights of working with streamlit. The opportunities for visualising and transforming data are endless and depend on your specific use case.

Exploration as one step towards the final product

In the end, the app should be able to contribute to the original question. In our case: What are the drivers of a successful TV spot. The graphs and visualizations you chose should help to build a better understanding of the answering process to the original question. The app enables you to get a general overview of the construction of the dataset and the impact of different features without having to push data from python to excel back-and-forth. Both visualizations and data sources can easily be enriched without having to reconstruct the entire exploration setup.

Overall, building a streamlit app helps you to get to know your data, find causalities and correlations between features, bringing you one step further in your data science project.

I am not trying to say that publishing your data explorations beats a machine learning model. I am trying to give you a perspective on how to involve more people than just yourself and your closest working buddy in the process of getting to know the data. Not only will you be able to empower others, making them feel like they are part of the journey. You will create a level of transparency that builds the foundation for them to trust your work. And most importantly you will make the unpleasant parts of being a Data Scientist – exploration, cleaning and explaining your final model – much more enjoyable.

Co-author: Felix Weißmüller, Sören Lüders

Contact: [email protected]


Related Articles