The world’s leading publication for data science, AI, and ML professionals.

Creating Interactive Jupyter Notebooks and Deployment on Heroku Using Voila

Analyzing Covid Tweets With Python, Pandas, & Plotly

Image by Author (Created using Google Drawing), A Collapsed view of the result
Image by Author (Created using Google Drawing), A Collapsed view of the result

Recently I was assigned the task to analyze a Covid tweets dataset containing around 44k entries and this blog describes all the steps I took to create some beautiful visualizations and at the end Deployment to the cloud platform Heroku. The whole Jupyter notebook was deployed which eliminated the need for any front-end coding.

The Dataset

The data contains a datetime column and this was the perfect choice for the index as I can group data and its subset based on date data. Here is what df.info() returned considering df is the panda’s data frame object that read the CSV file:

All the features of Dataset
All the features of Dataset

There are 19 columns (index was set to "created_at" which was initially part of the dataset) and most of them have non-null values except the hashtags and user_description column which makes sense as not everybody tweets with hashtags and not all of them have a profile description.

After a bit of munging, I found that the language column only contains ‘en’ and therefore I dropped it. Similarly ‘id’, ‘tweet_url’, ‘user_screen_name.1’ were dropped. ‘user_screen_name’ and ‘user_screen_name.1’ was the same columns.

Data PreProcessing

After this skimmed overview, it’s time for converting the data into better forms:

TimeZone Conversion: I live in India and my timezone is GMT+5:30. The dataset has a UTC timezone and it makes it difficult to perform date-time analysis. For instance, without conversion, the peak tweeting hour came out to be 6 am but actually, it was 11 am. Here is what I did:

UTC to IST conversion
UTC to IST conversion

Cleaning The Tweet: The tweets usually contain hashtags, mentions, links, and photos. While fetching data from the Twitter API, the photos are replaced with a link. All these extra materials need to be removed for a good word cloud and efficient sentiment detection. I created the following functions to remove this noise, stop words and tokenize:

Tweets Cleaning Functions
Tweets Cleaning Functions

Getting Coordinates of Places: There is a column called ‘place’ that tells about the location from where the tweet was made. This gave me an excellent opportunity to plot them on the map but the only problem was that I needed the exact latitude and longitude of the places that can be passed to any Mapbox library. After some searching, I found a website that provides the coordinates by requesting their endpoint. Here is the simple function for that:

Function to get coordinates of any address
Function to get coordinates of any address

Making Interactive Visualizations

In the final notebook, I made a lot of visualizations, and describing each of them here would take a long time. I would explain some of them which are different from each other. All the charts, figures were made using Plotly.

Charts: It is much easy to generate a static bar chart that represents the counts of qualitative data. Plotly provides a wide range of customizations for the charts. To make this more interactive, I used ipywidgets to get the user input and control the flow of the chart. For instance, to display top places from where tweets were made as a bar chart, the following code was used:

The displayPlaces function takes the parameter ‘top’ that controls how many top X places will be taken into consideration for plotting the bar chart. The interact function is the ipywidget that automatically creates the user interface controls. See the documentation for more functions. The gif below is for a better understanding:

GIF showing Bar chart interactivity
GIF showing Bar chart interactivity

Map Plotting: I personally like this visualization as it challenged me to explore more about Plotly. Initially, I did this plotting with folium and the results were satisfactory but with Plotly, one can add more layers to the base map and make it more astonishing. Getting the coordinates was one big task that was solved at the time of preprocessing. Here is what it takes to get a geographical map plot:

px.scatter_mapbox takes the coordinates of the locations, the optional color scales, zoom, and height parameters. The update_layout mapbox_style "carto-darkmatter" makes the map dark. See the resultant map:

GIF representing Map Plot
GIF representing Map Plot

You can refer to Plotly documentation for more types of plots.

Word Clouds: These are the images with a bunch of words that describe the overall trend in a textual information/paragraph. These pictures look cool at first glance but most of them are rectangular images with words. To make it more appealing, we can put these words on an image! That’s right, you can use an image as a mask on which these words will be imposed and its implementation is easier. Load the image as a numpy array and pass it to the word cloud function as a mask parameter. The word cloud functions come from the library word cloud.

Example Word Cloud
Example Word Cloud

I increased the stakes and made it user-based. The user can choose the masking (available options: Modiji, Trump, and India Map) and the corpus (tweets data or user’s description) from which the words will be picked.

Deploying the Jypter Notebook

If you have read the "Analyzing WhatsApp Group Chats & Building the Web App" article of mine, then you must be aware that I had to create the whole frontend for this project that could be avoided if knew Voila.

Voila is a Python package that turns our Jupyter notebooks into a standalone web application. This means that the notebook will be presented as a website to the user and the cells of the notebooks will be executed prior. It also gives the functionality to choose a theme and I choose dark. Here is the procedure to test the voila locally:

pip install voila
pip install voila-material (optional)

After installation, simply run the command in any terminal (Assuming that you navigated to the folder where the notebook is saved, else put the whole path with notebook name):

voila nameofnotebook.ipynb 

And boom! you have a website running on your localhost which is your notebook itself! You can add HTML code as markdown in your notebooks to make it more creative. To change the theme of the website, simply pass the theme=dark while running the voila command:

voila --theme=dark nameofnotebook.ipynb 

To deploy this on the cloud so that others can view your project, there are many options:

Heroku: Simply create a new app in Heroku, connect your GitHub repo where your notebook resides. In the Procfile, you can use this command to avoid any errors:

web: voila --port=$PORT --no-browser --template=material --enable_nbextensions=True nameofnotebook.ipynb

Binder: The more easy way is to use the service called Binder which creates a docker image of your repo and every time you want to run your notebook, you can simply launch it on Binder.

Screenshot from mybinder.org
Screenshot from mybinder.org

One important thing to mention here is that in the Path to a notebook file, select the URL from the dropdown and paste the voila endpoint as:

voila/render/path-to-notebook.ipynb

"Do note that you need to have a requirements.txt file in your Git repository if your notebook is using any external library, which can be installed via pip. Absence of this will give you errors such as 404: Notebook not found"

Final Results

The notebook deployment is made easy because of voila support. I have deployed my final analysis notebook on both platforms and one thing I want to mention here is that as my notebook is resource-heavy, it usually overflows the Heroku memory limit (512MB) and when it reaches 200%, the notebook is not rendered properly. I didn’t face many issues with the binder but it takes a bit more to load the image. In any of the platforms, here is what my final notebook looks like:

GIF by Author
GIF by Author

The whole preprocessing, dashboard notebook files, and deployed notebook links are available in my Github Repository:

kaustubhgupta/Covid-Tweets-Analysis-Dashboard

Conclusion

There are a lot of functionalities that are built to split the workload. This deployment can be useful to share the results with the outer world more interactively and easily.

This was all about this article and I hope you learned something new from this article. Bye!

My Linkedin:

Kaustubh Gupta – Machine Learning Writer – upGrad | LinkedIn

Other Popular Articles:

ColabCode: Deploying Machine Learning Models From Google Colab

Run Python Code on Websites: Exploring Brython

Rebuilding My 7 Python Projects

Build Dashboards in Less Than 10 Lines of Code!


Related Articles