An Alternative Approach to Visualizing Feature Relationships in Large Datasets

How to make those crowded scatterplots more informative

Published in

Towards Data Science

5 min readSep 28, 2023

The easiest way to understand relationships between data features is by visualizing them. In the case of numeric features, this usually means producing a scatterplot.

This is fine if the number of points is small, but for large datasets, the problem of overlapping observations appears. This can be partially mitigated for medium-sized datasets by making the points semi-transparent, but for very large datasets, even this doesn’t help.

What to do then? I will show you an alternative approach using the publicly available Spotify dataset from Kaggle.

The dataset contains audio features of 114000 Spotify tracks, such as danceability, tempo, duration, speechiness, … As an example for this post, I will examine the relationship between danceability and all other features.

Let’s first import the dataset and tidy it up a bit.

#load the required packages
library(dplyr)
library(tidyr)
library(ggplot2)
library(ggthemes)
library(readr)
library(stringr)

#load and tidy the data
spotify <- readr::read_csv('spotify_songs.csv') %>% 
  select(-1) %>%
  mutate(duration_min = duration_ms/60000, 
         track_genre = as.factor(track_genre)) %>%
  mutate(across(c(2:4, 20), toupper)) %>%
  relocate(duration_min, .before = duration_ms) %>%
  select(-duration_ms)

❔The issue

As I mentioned previously, the simplest way to visualize two-variable relationships is by drawing scatterplots with each point representing a single song. The first four columns contain track id information, so I left them out. I also renamed the features so that the first letter is uppercase and then reshape the data to prepare it for plotting.

spotify %>%
  select(5:19) %>%
  mutate(across(everything(), as.numeric)) %>%
  rename_with(str_to_title) %>% #capitalize first letters of feature names
  rename("Duration (min)" = Duration_min,
         "Loudness (dB)" = Loudness,
         "Time Signature" = Time_signature) %>%
  pivot_longer(-Danceability, names_to = "parameter", values_to = "value") %>%
  ggplot(aes(value, Danceability)) +
  geom_point(col = "#00BFC4", alpha = 0.5) + #reduce point opacity with the alpha argument
  facet_wrap(~ parameter, scales = "free_x") +
  labs(x = "", y = "Danceability") +
  theme_few() +
  theme(text = element_text(size = 20))

Even though a decreased point opacity was used (alpha = 0.5 as opposed to the default value of 1), the overlap is still too high. Although we can detect some general trends, the charts aren’t that informative since there are too many overlapping points.

We can try pushing this further by reducing the opacity to alpha = 0.05.

This improved things, and some might advocate that the chart is informative enough now. However, I disagree as I still have to focus too much to extract the trend and value distribution information.

💡 The alternative

We can see from the above scatterplots that the dataset contains both ordinal (Explicit, Mode, Key, Time Signature) and numeric features. In the case of categorical features, the simplifying solution is obvious — we can just use a summary plot such as a boxplot.

But what about the numeric ones? Well, the idea is to group the points into multiple equally wide bins, effectively turning them into ordinal features, and enabling the use of the approach described above.

Although boxplots are the most commonly used summary plots, I will also add violin plots in the background. This way, aside from summary stats of the binned points provided by the boxplot (median, interquartile range), we can also easily see the distribution of values within a specific binned range.

This powerful combo reduces chart clutter and enables much easier identification of trends.

Let’s try this out for the Acousticness-Danceability plot and then visualize all the features at once afterwards.

#turn numeric features into ordinal by binning with the cut function
spotify_plot <- spotify %>%
  mutate(time_signature = as.numeric(as.character(time_signature))) %>%
  mutate(across(-c(1:4, 7, 8, 10, 12, 19, 20), ~ cut(., breaks = 15))) %>%
  mutate(across(c(7, 10, 12, 19), as.factor)) %>% #turn thr existing ordinal variables to factors
  select(-c(1:4, 20)) #remove unnecessary track id columns

spotify_plot %>%
    select(danceability, acousticness) %>%
    ggplot(aes(acousticness, danceability)) +
    geom_violin(fill = "#00BFC4", alpha = 0.5) +
    geom_boxplot() +
    labs(x = "", y = "Danceability") +
    theme_few() +
    theme(text = element_text(size = 20), 
    #tilt the axis labels to prevent overlap
    axis.text.x = element_text(angle = -45, hjust = 0, vjust = 1))

Much better! The decreasing trend in song danceability with an increase in acousticeness is now much more easily visible. We can also see that danceability is pretty normally distributed due to the addition of violin plots in the background, and the zero danceability outliers are identified by the boxplot function and thus plotted as separate points .

Let’s extrapolate our solution to all other features and create a complete feature relationship panel.

spotify_plot %>%
                   rename_with(str_to_title) %>%
                   rename("Duration (min)" = Duration_min,
                          "Loudness (dB)" = Loudness,
                          "Time Signature" = Time_signature) %>%
                   pivot_longer(-Danceability, names_to = "parameter", values_to = "value") %>%
                   arrange(parameter, value) %>%
                   ggplot(aes(value, Danceability)) +
                   geom_violin(fill = "#00BFC4", alpha = 0.5) +
                   geom_boxplot() +
                   facet_wrap(~ parameter, scales = "free_x") +
                   labs(x = "", y = "Danceability") +
                   theme_few() +
                   theme(axis.text.x = element_blank(),
                         text = element_text(size = 20))

Although we lost some detail compared to the original scatterplot panel, it is much easier to detect the underlying trends. Notice that, unlike with the single chart above, I removed the bin ranges on the x-axis in order to reduce clutter when visualizing all the features at once.

Considering these visualizations, the most impactful features should be Duration, Energy, Loudness, Tempo, Time Signature and Valence. Acousticness, Explicitness, Liveness, Popularity and Speechines show some effect, but not as much as the formerly mentioned ones. There seems to be no strong effect of Key, Instrumentalness and Mode features on the danceability rating.

🗒️Conclusion

That’s all for this post. I’ve shown how to bypass the clutter of overlapping points when analyzing large datasets. I hope you will find the proposed solution useful whenever you encounter large datasets and wish to easily identify the trends in the feature relationships. If you have any comments, questions, suggestions, or requests for other custom plots, please let me know in the comments.

And, of course if you liked the post, follow me for more similar content😉.

An Alternative Approach to Visualizing Feature Relationships in Large Datasets

How to make those crowded scatterplots more informative

❔The issue

💡 The alternative

🗒️Conclusion

Written by Zvonimir Boban