How Temperature Affects Music Preferences

Analyzing how a country’s temperature impacts listener preferences with Spotify Data.

Ben Plotnik
Towards Data Science

--

Photo by sgcdesignco on Unsplash

This report is the result of a final project for the Computational Analysis of Big Data Course at DIS Copenhagen. Team members are Michael Datz, Ji Man Kim, and myself.

Check out the GitHub here: https://github.com/brp616/Spotify-Temperature-Project/ if you want to explore the data yourself.

Introduction

Metal from Scandinavia— Dance music from Latin America — Hip Hop from the United States. These are just some of the associations we make between music genres and their regions of origin. As avid music lovers, we wanted to explore whether these popular conceptions bare out in the data. Delving deeper, we wanted to know whether local temperature could play a significant role in peoples’ listening preferences. After all, it only makes sense that Danes holed up in their homes during the cold winter would enjoy some angsty rock or calming folk while someone dancing the night away on a hot, sandy beach is going to want to hear pop or EDM.

To illustrate this, we decided to use Spotify data on top music genres in different countries to build an interactive map and provide visuals illustrating trends in global listenership by temperature. Before we get to the good stuff we’ll show you how we got there, so you have a better understanding of how we got our results, the potential pitfalls of our findings, and how a basic data science project is performed!

Prerequisites and Definitions

New to Data Science? Here are some names and terms we’ll be throwing around that you should know before going ahead.

Python: Very popular programming language for data science that we used to get and analyze our data.

NumPy, Pandas: These and any other names you may not recognize are different libraries we bring into Python that allow us to perform powerful tasks with ease and not have to build all of our systems from scratch.

Data Scraping: The process of importing information from a website into a local file saved on your computer. We had to scrape some data in this project because it wasn’t all available in ready-made data sets.

API: Application Programming Interface, basically the language a data scientist uses to communicate with a website or app when they are requesting their data. In our case, we used Spotify’s specific API to get genre information for various songs.

And if anything else confuses you, just look here on Medium or google your answer! There are plenty of great resources for data science and I, the one writing this, was able to teach myself the basics pretty quickly and you can too.

Data Collection and cleaning

Data cleaning, 2020 (colorized)

There are 3 main data sources we used in this project, which I’ll explain further one by one. They are:

Temperature data

Spotify dataset from Kaggle

Scraped genre data from Spotify

Temperature Data:

To get temperatures for the countries in our analysis, we used this climate change dataset from Kaggle that is essentially a repackaged version of data from the Berkeley Earth Surface Temperature Study, which has compiled over 1 million data points from 16 archives. For our analysis, we took data points from a series of days across the year 2013 (last complete year in the set) and averaged them to get a country’s temperature index. One potential weakness of this approach is that by matching top songs from each country with their nationwide temperature average, we leave out regional differences in musical preferences that could coincide with temperature with large countries like the US.

Spotify Dataset:

While none of data sets with Spotify listener data was perfect, this was the best available. Spotify does not release any direct information about geographic listenership, even within their API, so we used the next best thing — global song popularity rankings for 53 countries. This dataset holds track names, artists, Spotify reference numbers, chart date, and country code for the chart. This is an extremely large dataset that proved unwieldy, particularly when retrieving and matching genres with tracks (see scraped data section). Time simply would not have allowed us to work with the whole set. So we instead took a snapshot of the year by creating a new data frame with top charts from the first day of every other month. This data frame was the foundation of our analysis, as it is from here that we matched country code with temperature data and retrieved track info like genre and tempo.

Scraped Data:

This is the part that took some ingenuity. While our Kaggle data gave us info on track names and artists, it did not tell us anything more about the music itself. For this, we had to go to the source — Spotify API. With the help of the SpotiPy library, we were able to easily grab what we needed, with a few (big) caveats. While Spotify does hold genre data, it assigns genres to artists, not songs, so we had to use SpotiPy to pull up the markup for a track, get the artist from the markup, then retrieve all the genres associated with that artist that would then be paired with the song. Spotify’s API also doesn’t like you using too much of their bandwidth, so they cut you off from their server after an hour and make you request access again. And, after every day, you need a whole new request. And yes, this took over a day to run, even with the limited dataset. After much trial and error, we worked around this by splitting the scraping process into a few steps and keeping a count of the number of tracks analyzed that would trigger a periodical new request for API access. This allowed us to grab genre info and music metrics (tempo, danceability, and acousticness) for each song. Look below to see the process for how we made our initial API request to get track markups, which we repeated with slightly different requests to get all the data we needed:

Data Storage and Manipulation

Matching music info with countries and temperature:

While we originally experimented with a bag of words matrix holding genres and audio features for each song, the memory required to work with this data format made it nearly impossible to work with because loading the data file would either take over a day to load or crash my coding Notebook.

We instead used 2 much simpler formats built into Python, lists and dictionaries. To do this for genres, we built a dictionary with each unique track in our dataset as a key and a list of genres as the value. We then looped through each track in our original data frame and added a column with the genres for each track. From here, we went through every country code in the data frame and counted the number of times each genre appeared to get a new dictionary with just what we are after — each country and their most popular genres, as well as the top genres in the world for comparison. In this dictionary and another where counts were converted to proportions of all songs in a country’s chart, we simply replaced each country’s name with their average temperature to make up the basis of our heatmap and interactive map.

For audio features, we used a similar procedure, but simply averaged tempo, danceability, and acousticness for each country instead before substituting country names for temperature.

This gist shows how we built the dictionary with genre counts for each country.

Building the interactive map:

The interactive map shows the average global temperature for given countries around the world and compares that with the top genres for each given nation. To create this map, we had temperature data gathered from Kaggle and used that with a country shapefile that could be read by Geopandas- the library we used to generate this map.

After cleaning the data to only include the most recent temperature data, we had to merge the temperature data with the shapefile so that it could be output by Geopandas. Merging these two files gave us a map of average temperatures around the world, which is only half of what we wanted to look at.

We then used the genre data gathered from Spotify- a dataset multiple gigabytes in size, and parsed that to get the top genres for each country. It should also be mentioned that Spotify data was not available for many countries on the map, so we had to leave those out of the analysis. We repeated the merging process with our Spotify data and fed that into Geopandas, giving us our output using the Bokeh Python library- an interactive map showing top genres around the world as compared to weather data. After looking at the data however, we realized because of Spotify’s genre categorization system, the top genre in nearly every country was “pop,” so we cut that out of the dataset and generated our results based on the remaining genres. The result is the interactive map that you now see below, with data from various climates and areas of the world.

Results

A. Interactive Map

Screenshot of our Interactive Map. Showing top genre for the United States.

Take a look at the code to rebuild the map and play with the interactive element. As you can see- there wasn’t much correlation that we could distinguish given the genre data from Spotify. It seems that pop music is simply popular all around the world, with regional preferences that correlate little with temperature.

B. Heatmap of Genre Popularity

This graphic displays what share of the most popular songs in various different climates are associated with the top 20 most popular genres across the globe, along with the most popular genres organized by country for comparison. Warmer colors indicate a higher proportion of songs, hence higher popularity, while cooler indicates fewer songs. While interesting, this graphic shows few strong relationships between temperature and genre preference. ‘Dancier’ genres like EDM, latin, and reggaeton may be slightly more popular at the very high-temperature range, but only dance-pop shows a linear trend that could indicate a clear significant difference. This corresponds with a slightly lower proportion of pop (as opposed to its higher-tempo variants) in most warmer countries. In fact, few strong linear connections between genre and temperature appear. This would be indicated by a linear vertical gradient. This suggests that other factors like culture, language, and proximity to artists may determine genre preferences rather than temperature alone.

C. Tempo & Danceability Analysis

These graphs have every point on the graph that represents a country and the levels of either danceability, acousticness, speechiness, or tempo. Taken from the Spotify API: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. Acousticness is a confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. The overall estimated tempo of a track is in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. We can conclude from the analysis of the temperature data compared to the average audio features by country that there no really no correlation between average country temperature and audio features as they all fall around the same values. A cool experiment would be to see then if there is a slight deviation in different seasons.

6. Final Thoughts

Although we didn’t see any correlation with climate and listening habits in our data, this project was still a great exercise in looking for patterns in data as it relates to something we all enjoy- music.

What we have right now shows that pop music is popular around the world no matter the climate, which isn’t anything new. We believe the Spotify data is flawed in that it only gives genre tags based on the artist, if the genres were given on a song basis, there might be more differentiation in our data, as it would provide a deeper classification. There is also the possibility that Spotify puts more priority and marketing behind popular music, which could skew the data and make it so that popular songs stay popular.

We also could take this project a step further, and compare listening habits based on the seasons, that way we may see a more normal distribution of genres. Overall, this project was a novel idea, and given a more balanced dataset, could provide some insightful results.

--

--

Data science, economics, and political science student at Tufts University. Music and film lover.