The Aesthetics of Wordclouds

An elegant way to represent word frequency in texts — and how to make yours without having to write a single line of code

Luca Giovannini
Towards Data Science

--

Wordclouds are a nice way to summarise texts and give an idea of how much single words feature within them, and can be used for practical (e.g. marketing-oriented), educational and artistic purposes. In this piece I’ll walk you through an handy utility to generate beautiful wordclouds without any notion of programming. Let’s go!

As a matter of fact, drawing wordclouds is a pretty straightforward task, and a quick web search will return tons of web apps which will do it for you; these services, however, leave often little space for customisation to the user. Conversely, the WordCloud for Python library by Andreas Mueller et al. offers several options to shape a truly unique wordcloud, setting parameters such as the number of words, their colour or the cloud’s shape — but it obviously requires some basic programming skills.

The strength of this library becomes apparent when employed for one of my favourite uses for wordclouds, i.e. the visualisation of literary works. Look at this ‘semantic portrait’ of Robert Louis Stevenson’s Treasure Island (1883):

Credits: graphic elaboration on original image by Open-Clipart-Vectors/Pixabay.

Or, if you’re more into pop culture, enjoy this triquetra-shaped wordcloud made with the episode titles of Netflix’ cult series Dark (2017–2020):

Credits: graphic elaboration on original image by Madboy74/Wikimedia Commons.

At this point, you may want to experiment with wordclouds by yourself, but if you’re reading this you’re likely to ignore how to run code ...

… and that’s why, with you in mind, I set up a little utility which makes the Mueller script ‘interactive’, allowing you to easily build custom wordclouds. The only requirement to use this tool is having a Google account, since the script runs on a Google Colab notebook from my Github repository. I reckon the optimal solution would have been making a small web app with Django or Flask and hosting on some cloud platform like Herokuapp, but I’m afraid this is still beyond my skills.

Anyway, the process is quite straightforward: you’ll be asked to upload a text and an image (which will serve as the cloud’s shape) from an url or from your local pc and then define some parameters. In a matter of seconds, you’ll get an automatically generated wordcloud, and if you don’t like the results you can just run the program again.

Before you get started, let me give you some suggestions from my experience with the WordClouds library:

  • Text selection: be sure to choose plain text formats (.txt/.html) to avoid messy results; if you need inspiration take a look at websites like Project Gutenberg. You can also copy some text from a webpage, paste it in a notepad on your pc and do some manual cleaning before uploading it manually — sometimes just pasting a link would give unsatisfying results (e.g. the words extracted would include also or only HTML tags).
  • Image selection: for optimal results choose large, black/white images with high contrast, like vectors; if they have a transparent background, you can remove it here. If you plan to make your results public, be sure to use images with a licence allowing for commercial/non-commercial reuse: you can use the proper search function in Google Images or head to websites like Pixabay, Pexels or Unsplash.
  • Stopwords: Wikipedia defines them as “ words which are filtered out before or after the processing of natural language data” because of their high frequency and low semantic potential, such as articles, pronouns, prepositions or auxiliary verbs. While the utility filters out stopwords for most languages, it make sense to run the script a first time and then re-run it adding some user-specific stopwords: in the case of dramatic texts, for example, one may remove the character names in order to improve the cloud visualisation. Compare these two Hamlet wordclouds without and with character names as stopwords:
Credits: graphic elaboration on original image by Clker-Free-Vector-Images/Pixabay.
  • Parameters: although Mueller’s library allowed to customise tens of features, my utility focuses on modifying the three main ones: the cloud’s background color, the words’ colormaps (what is a colormap again?) and the image contour (if the image shape is hard to recognise, you may want to contour it). If you want more personalisation, check the source code in the WordCloudBuilder function or refer to the original library.

Ready to go? Here’s the link to the EasyWordCloud utility.

I’ll leave you with some more examples of the visualisation power of the WordClouds library. The following pictures are extracted from seven literary works which, according to some statistical research from my masters’ dissertation, are often considered the seven most ‘canonical’ books of literature (if you are interested in this topic, let me know and I may write something about it).

Credits: graphic elaborations on original images by Clker-Free-Vector-Images, Open-Clipart-Vectors, Gordon Johnson, Annalise Batista/Pixabay and CactusCowboy/OpenClipart.

In case you’ve not recognised the books from the silhouettes, they are (in order) Cervantes’ Don Quixote, Joyce’s Ulysses, Tolstoy’s War and Peace, Proust’s In Search of Lost Time, Nabokov’s Lolita, Dostoevsky’s The Brothers Karamazov and Dante’s Divine Comedy. No additional stopwords have been added, except a couple for Cervantes and Dante, and knowing no Cyrillic I can’t vouch for the effectiveness of the Russian stopwords filter.

That’s all, folks. As always, comments and suggestions are very welcome. See you next post!

…and if you are interested in other graphic elaborations of data extracted from cultural products, check out my previous piece on Data Science meeting “Murder, She Wrote” (including a wordcloud too!)

--

--

PhD student in Comparative Literature & Computational Criticism, University of Potsdam (Germany)