The world’s leading publication for data science, AI, and ML professionals.

Data Storytelling with Animated Word Clouds

Animated word clouds turn classic word clouds into a dynamic visualization. Learn more about telling data stories in Python.

Introduction

An animated word cloud displays absolute frequencies of n-grams (contiguous sequences of text sample items) over time as a sequence of images in a video file. It **** gives greater importance to words that appear more frequently in a source text. The bigger and bolder the n-gram displays, the more frequently it appears in the text. It builds on the intuitive logic of classic word clouds and adds a time perspective to the visualization.

As many text datasets are collected these days as text observations over multiple periods, there is a particular challenge to visualize the changes in the data over time. Instead of making summary tables or graphs for many different periods, let’s prepare an MP4 video that tells the story, attracts the audience, and gives a "wow" effect to the presentation.

This article will describe the generation of animated word clouds from text data in Python. Here are some unique features of the AnimatedWordCloud library:

  • Provides n-gram frequency visualization of all Latin-alphabet languages
  • Cleans text dataset from punctuation, numbers, and stopwords included in the NLTK lists of stopwords
  • Generates yearly or monthly n-gram frequencies.

How to use it

To use the library, follow these steps:

1. Installation

Create a new Python 3.8 virtual environment for the project to avoid any dependence conflicts. AnimatedWordCloud relies on Python 3.8 because of its visualization requirements. To install using pip, use:

pip install AnimatedWordCloud

It was tested with Pycharm community ed. It’s recommended to use this IDE and run the code in a .py file instead of a jupyter notebook.

2. Generate frames

We will focus on the European Central Bank (ECB) communication and figure out the concepts the Bank Board members discussed over 1997–2023. The dataset is from the ECB website, released with a flexible license.

The data contains 2846 rows and has NaN values that AnimatedWordCloud can effectively process. It looks like this:

Let’s import the data.

import pandas as pd

data = pd.read_csv('dataset.csv')

And then import the _animated_wordcloud method.

EDIT Dec 2023: AnimatedWordCloud has been constantly updated with new parameters. Check PiPy for the current release.

It offers reading data in US (MM/DD/YYYY) or European-style (DD/MM/YYYY) date and datetime formats. It automatically cleans data from punctuation and numbers on input. It can also remove the standard list(s) of stopwords for languages in the NLTK corpus of stopwords.

from AnimatedWordCloud import animated_word_cloud

animated_word_cloud(text=data['contents'],  # Read text column
          time=data['date'],                # Read date column
          date_format = 'us',               # Read dates in US date format
          ngram=1,                          # Show individual word frequencies
          freq='Y',                         # Calculate yearly frequencies
          stopwords=['english', 'french',   # Clean from English, French,
                     'german', 'spanish'])  # German and Spanish stop words

The code generates 90 PNG frames per period and creates a .postprocessing/frames folder in the working directory to store the images.

3. Create a video from images

The last step is to make a video file from individual frames. This step will be automated in future releases, but for now:

Download the ffmpeg folder and the frames2video.bat file from here and place them into the postprocessing folder. Next, run frames2video.bat, which will generate a wordSwarmOut.mp4 file, which is the desired output.

Let’s tell a story of the Eurozone through the lenses of central bankers:

  • in 1999 – 2002, the key topic was the EURO introduction ("accession", "euro")
  • the bankers in 2003 – 2006 discussed mainly standard monetary policy implementation issues ("monetary", "financial", "market", "policy")
  • with the upcoming financial crisis, the key topics in 2008 – 2012 were "liquidity", "crisis" , and "banks"
  • important periods came in 2021 with the COVID-19 economic impacts ("pandemic") and the war in Ukraine when inflation was the major topic.

These developments are obvious to anyone interested in the history of the Euro. But to present them is a challenging task. We can use, for example, (1) n-gram frequency analysis of keywords and produce many frequency tables of keywords, or (2) display a heatmap (or a matrix graph) with a period on the x-axis, word on the y-axis, and the word frequency the item in the matrix. Another option (3) is to produce many word clouds for each period. None of these is perfect for larger datasets, and the animated word cloud gives you another option for delivering the message.

Practical applications

The library has a primary use in presentations and teaching. Text Mining is now blending other disciplines like Economics, Politology, or Business, and teachers, analysts, and students can now use a different, more appealing way to present the facts.

  • Are you a historian interested in the history of science? Then, try downloading text datasets such as article headlines or journal abstracts from platforms like Constellate, prepare a video file, and tell your students about the history of AI in the published research since its very beginnings. Some inspiration might be a research trends analysis project with economics data spanning 1900–2018, which we have described in this article.
  • Do you want to show your marketing team what the customers think about your product? Then, use the product reviews from external platforms (e.g., Amazon) and explain which words customers mention. Is it "fast" and "delivery", with "good", and "great" ? Or the frequent words are "poor", "bad", and "quality" ? Tell a story to deliver the message and see how it changes through time.

Our earlier TDS article with Jarko Fidrmuc and David Štrba outlined how word clouds are useful for summarization and exploratory text data analysis. In a dynamic form, it is possible to present the structure of text datasets collected as time series in a better, easily understandable way.

Other interesting use cases, like modeling COVID-19-related discussions or US presidential debate analysis with X-tweets, might also be interesting to check in a bachelor or seminar thesis.

On a technical note

In the WordsSwarm project, Michael Kane developed the core framework for the animation of word frequencies for the library. AnimatedWordCloud, which I have created with an Apache-2.0 license, makes the codes efficiently work on various text datasets of the Latin alphabet languages.

It uses one of my earlier projects, the 𝐀𝐫𝐚𝐛𝐢𝐜𝐚 library, to make the processing (text cleaning and word frequency aggregation. It relies on rather archaic visualization requirements that suit this project very well. PyBox2D is used for physics and clash detection of words in the swarm. Pyglet, cross-platform windowing and multimedia library, and PyGame are used to create animations.

It shows absolute word frequencies but scales the data to display word clouds on datasets of different sizes. In this way, datasets with large frequencies are scaled by a constant so that the word clouds can still be displayed in a video. It handles missing values and also treats mojibake errors with ftfy.

On the other hand, AnimatedWordCloud might have difficulties running in a Jupyter notebook and other IDEs than Pycharm. I will have a look at these shortcomings in the next releases. In future releases, AnimatedWordCloud will also provide bigram frequencies that will give a more detailed look at the narratives and topics people discussed in the texts. Also, I will extend the library’s documentation.

PS: Let me know how it worked on your projects! 🙂

If you enjoy my work, you can invite me for coffee and support my writing. You can also subscribe to my email list to get notified about my new articles. Thanks!


Related Articles