The world’s leading publication for data science, AI, and ML professionals.

How to Create Beautiful Word Clouds in Python

Learn all the details to create stunning visualizations for text data and your NLP projects in Python!

A guide to creating stunning visualizations for your next NLP project

Word Cloud by Author | Trevi Fountain Image by James Lee via Unsplash
Word Cloud by Author | Trevi Fountain Image by James Lee via Unsplash

Natural Language Processing, or NLP, is a very popular subfield in Data Science at the moment because it allows computers to process and analyze human language. Siri and Alexa, spam filters, chatbots, auto-complete, and translate apps are all examples of everyday technology that use NLP.

As a Data Scientist, working with text data is a bit trickier than other types of data. Why? Because words are not numbers! This makes the Exploratory Data Analysis and the data cleaning and preprocessing steps a bit different in the Data Science workflow. Text data generally requires much more cleaning (removing stop words and punctuation, lowercasing, stemming or lemmatizing, etc). It also requires tokenizing or vectorizing the text (deriving meaningful numbers from words). As for exploring and analyzing the data, there are not as many ways to visualize text. However, text does open up one new kind of visualization technique that you have probably seen before – word clouds.

During my latest Data Science project, I got to delve into the world of NLP. Along the way, I learned all about creating word clouds in Python, and I wanted to write this piece to share my knowledge for anyone looking to create some beautiful visualizations for text data.


Creating a Basic Word Cloud

To create a basic word cloud (or any word cloud in Python), you will need the following libraries:

Method 1: generate_from_text

There are two main ways to build the word cloud. The first and simplest method is to create the word cloud from a corpus of text, such as an article, book, etc. This corpus should be in the form of a string.

In the example below, I have taken a list of attractions from TripAdvisor in the city of Rome. I will group them all into one body of text (a corpus), and then create a basic word cloud.

Example attractions in Rome from TripAdvisor | Image by Author
Example attractions in Rome from TripAdvisor | Image by Author

Before I grouped the attraction texts, I did a little cleaning by lowercasing, removing basic stop words (i.e. "a", "the", "is", etc), and lemmatizing. I did the same for a list of 12 cities (specifically, 12 of their Top World Destinations for 2021). Rome’s grouped corpus is highlighted in the DataFrame below.

Grouped, cleaned and lemmatized attractions per city | Image by Author
Grouped, cleaned and lemmatized attractions per city | Image by Author

As I mentioned before, we will create the word cloud from the text corpus for Rome. To do that, we will isolate that one corpus by taking a slice of this grouped DataFrame:

Rome text corpus | Image by Author
Rome text corpus | Image by Author

From here, we can create the basic word cloud. We will start by instantiating the WordCloud object from the wordcloud library, and use the generate_from_text method to feed it our text corpus. Finally, we will use plt.imshow() to display the WordCloud object. Be sure to use plt.axis(‘off’) to make sure it only displays the word cloud, not axes and their values.

The resulting word cloud is below. We can see that by default, the word cloud uses bi-grams (pairs of words) instead of single words. If needed, we can turn this off when we instantiate the WordCloud object by changing the parameter ‘collocations=False’.

Basic Rome Word Cloud (from text) | Image by Author
Basic Rome Word Cloud (from text) | Image by Author

Method 2: generate_from_frequencies

The second method is to create a word cloud from a document term matrix. This is a commonly-used matrix for NLP, which has a separate column for each word in the corpus vocabulary, and the word frequency in each row. For example, below I have depicted the first columns of a document term matrix (dtm) for the 12 cities I showed before. Notice that the dimensions of the dtm 12 x 8676, indicating the 12 cities, and 8,676 words in the entire corpus vocabulary.

Document Term Matrix using Count Vectorization AKA dtm | Image by Author
Document Term Matrix using Count Vectorization AKA dtm | Image by Author

If you have a document term matrix, we can easily feed this data into the word cloud object using the .generate_from_frequencies() method. First we will need to isolate the data we want to use for Rome. We will need to transpose this matrix for it to be in the correct format for the word cloud. Wea also want to get the most frequent words, so we will sort the values in descending order. You can see an example of the data we want to use in the photo below.

Image by Author
Image by Author

To actually create the word cloud, we will use pretty much the same code as above, but use the generate_from_frequencies method instead.

The resulting word cloud is below. Notice that because we isolated single words (not bi-grams) in the data, we did not need to tell the WordCloud object to turn off the ‘collocations’ parameter. It automatically had the vectorized words as single words instead of bi-grams.

Basic Rome Word Cloud (from frequencies) | Image by Author
Basic Rome Word Cloud (from frequencies) | Image by Author

Finally, now that we understand how these word clouds are made, we can manipulate some of the parameters to create a nicer version of our basic word cloud. Let’s go back to our first example with the rome_corpus variable (generating a word cloud from text). Notice that words like ‘private tour’ and ‘skiptheline’ come up as some of the most frequent words. We can tell our word cloud a custom stop words list to get rid of these. I will also customize the dimensions of the word cloud, and make the entire figure bigger with a figsize parameter. We change the colormap, and add a title. See all these changes in the function below.

Remember that you can always change the function above to use the generate_from_frequencies method instead! The resulting word cloud will look like this:

Final Basic Word Cloud | Image by Author
Final Basic Word Cloud | Image by Author

Wow! Looks much better. It’s still very basic though. Let’s turn it up a notch by changing the word cloud shapes with masks.


Change the Word Cloud Shape

One way to make your word cloud visually stunning is to add a mask. A mask is an image you can use to change the shape of your word cloud. We can manipulate the mask very easily with the mask parameter when we instantiate the WordCloud object.

Once you choose the perfect image to use for your mask and save it (more on that in a minute), we can use the Image function from the PIL library and numpy to get it into the correct format for the WordCloud object.

Now that our mask is in the correct format, we can write a similar function to generate the nicer word cloud. The function below is very similar to our final basic word cloud function with a few exceptions:

  • Using the mask parameter makes the width and height parameters void, because it will take on the size of the image itself
  • I am using a ‘scale’ of 3, which makes the computation faster for larger word cloud images
  • I changed the ‘colormap’ to ‘RdYlGn’ because it felt more Italian (think Italian flag, and homemade pasta!)
  • The ‘background_color’ is now ‘white’ for a nicer look against the colormap
  • The collocations (bi-grams) will show up this time since we are using the generate_from_text method, with the collocations parameter set to ‘True’

    Our final word cloud is below… isn’t it bellísima?!

Final Rome Word Cloud | Image by Author
Final Rome Word Cloud | Image by Author

Choosing the Perfect Mask

The trickiest part of using a mask is that the image you choose must have the following requirements:

  • a white background (it must be #ffffff, not off-white or transparent)
  • a definitive shape that is NOT white

Here is an example of a mask that did not work so well. The issue here is that its shape is not definitive enough for you to tell what it is.

Colosseum Image by Panda Vector via Shutterstock | Word Cloud by Author
Colosseum Image by Panda Vector via Shutterstock | Word Cloud by Author

Therefore you want to choose something that has a very recognizable shape like the example below.

Italy Map Image by David Petrik via Shutterstock | Word Cloud by Author
Italy Map Image by David Petrik via Shutterstock | Word Cloud by Author

There may even be times where you find the perfect photo for your mask, but it has a few marks in the background that are throwing off the word cloud shape. In order to fix this, you can always open the photo in Photoshop and edit the marks out. If you don’t have Photoshop, a method you can use to hack this (from my Marketing days) is Canva. You can upload the mask image into Canva, place white squares on top of your marks, and then download the edited image. Voila! Your mask will be ready to go with that #ffffff background.


Additional Finishing Touches

The last step to making your word clouds beautiful is to use your own keen eye for those little aesthetic details. Changing the colors, using borders (AKA contour), and manipulating the font and plot size are just a few ways to customize your word clouds. Here are some ways to customize these parameters:

Change the Colormap

Changing the colors of the word is as easy as changing the ‘colormap’ parameter in the WordCloud object instantiation. See all the matplotlib colormap options HERE.

There is also a way to create a colormap of your own based on the colors in your mask image. To do this, use the ImageColorGenerator function in the wordcloud library, and use the colors generated in the color_func parameter when you instantiate the WordCloud object. The color_func parameter will overwrite the colormap parameter.

Here I used the colosseum again since the Italy map was all one color. Using this custom colormap is definitely a nice touch, even though we still can’t really tell its shape!

Image by Author
Image by Author

Use Contour

Not like the Kardashians… but kinda. Contour gives your word cloud mask an outline. You can customize it when you instantiate the WordCloud object with the contour_color and contour_width parameters. Contour colors can be indicated with strings (use a simple color word, or a color code).

Below is an example of our same function to generate the word cloud, but with a thin black contour (width of 1).

As you can see, the resulting word cloud is a more distinguishable shape due to the contour.

Image by Author
Image by Author

Looks pretty cool! The contour here is a bit squiggly because of the detailed shape of Italy. The contour line would be smoother with a more basic mask shape.

We can also try a colored contour to make this a bit nicer. Since there are dark red words in the word cloud, I found a nice wine red color code, #5d0f24, to replace the black contour, and increased the width to 3.

The resulting word cloud is below.

Image by Author
Image by Author

Even the colosseum mask from before looks better here with a little contour! Contour in this case helps us to see the actual shape of the mask.

Image by Author
Image by Author

Whether or not to put contour, and deciding on the color and width of the contour is up to you based on the aesthetic detail you want to include in your word clouds.

A Note on Word Cloud Size

One thing you should always do is to change the size of the word cloud with plt.figure(figsize=(10,8)). You can change it to whatever size you need, but I have found that changing the size of the overall plot works better than trying to resize the WordCloud object itself. I have included this in all the word cloud functions, and you can notice the difference in the word cloud sizes when we jumped from basic word clouds to the word clouds created with functions.

More Fun Customizations!

The possibilities are endless when it comes to customizing your word clouds. Here are some examples:

  • Manipulate the font of your word cloud text with the font_path parameter
  • Change the min and max font sizes (min_font_size, max_font_size parameters)
  • Change the background_color of the word cloud
  • Reduce the max_words parameter to show only the top 20 or 50 words, which sometimes makes the word cloud easier to read

All of these small details can impact your word cloud visualizations. Since each mask and each text corpus will look different, it’s ultimately up to you to play around with them to truly make your word clouds look beautiful!

Conclusion

Thanks for following along with me on this little journey to Roma! I hope this story taught you a thing or two about word clouds and creating beautiful NLP visualizations. If you have any more tips, I’d love to hear about them in the comments below!


Note: This was only a small piece of my Data Science capstone project, where I used NLP to classify text for different cities. You can see the full project repo on my Github account.


Related Articles