Is music a reflection of society?

A data-analytical exploration to find interesting trends in the music industry

Published in

Towards Data Science

19 min readDec 9, 2020

Remember when you were a teenager and your parents told you to stop listening to that heavy metal band (in my case, it was Rhapsody), because if you kept listening, you would end up becoming the “drug-abusing, good-for-nothing” musicians that appeared in the album cover? (Ah, those lovely stereotypes…)

Well, guess what? As it turns out, it also works the other way around: the music you listen to actually reflects who you already are. Taking it a step further, the media that we consume could be a reflection of who we are as a society, hence studying media consumption could unlock a wealth of knowledge about our identity as people, where we come from and (predictive modeling, anyone?) where we are going in the future.

So, with this basic idea in our heads, let’s play around with some data to find out how music consumption has evolved in the past decade, and how those trends may impact the music industry.

Part 1: The data

The data I’m using here was downloaded from Kaggle and scrapped directly from Spotify by someone other than me. It has been previously sanitized (also by someone else), so we shouldn’t run into any major issues with things like null values or numerical mistakes.

The entities in the dataset consist of the most popular songs (according to Billboard) from 2010 to 2019. The descriptive variables include data like the year the song was chosen by Billboard among the most popular, the artist, and several musical characteristics.

About the musical characteristics: they are descriptions of what each song sounds like. Some are related to basic musical concepts like tempo and duration, which makes them easy to understand, however others are a bit more abstract and honestly kinda confusing. As a data analyst, one of the first things you want to do is make sure you understand the variables, because your entire analysis depends on what interpretation you give to your raw data.

So, for the sake of clarity, let’s lay out the meaning of the descriptive variables we will focus on here:

df.head()

BPM: short for Beats Per Minute. It’s the tempo or “speed” of the song.
Energy: the name explains itself.
Danceability: the higher the value, the easier it is to dance to the song.
dB: how loud the song is. In digital productions, these values are negative.
Valence: a measure of how “positive” or happy the song is.
Duration: duration in seconds.
Acousticness: measures whether the song uses acoustic instruments.
Year: the year when each song was chosen by Billboard as one of the most popular. Keep in mind that a song can be chosen in multiple years (though this is rare).
Popularity: the name explains itself, although it is unclear how the measure is obtained (I’ll try to stay away from variables that aren’t completely clear).

Part 2: Context matters

Keep in mind that this is a data-analytical exploration to find inspiration and interesting ideas. Basically, this is me playing around with a bunch of data, turning it upside down, filtering it, plotting it, trying out different angles to see what I find and figuring out if those findings are worth further analysis. Remember, as analysts we are only looking at a specific set of data and any patterns we find must be confined to that specific dataset. If, however, any finding seems REALLY interesting and useful for the business, we may send it over to our friends the statisticians for some good ol’ statistical inference.

In other words, this is me doing what data analysts do all day long (which is why I believe data analysts have the coolest job in data science).

On the other hand, whenever you “play around with data” it’s important to have a clearly established context, to make sure you know why you’re playing and what exactly you will consider useful findings. When an analyst does his thing, he almost never does it just for fun (as fun as it may be), but because he’s trying to help somebody do something. So what I’m saying is, make sure your data explorations make sense and help your organization get to where it wants to go.

OK, enough blah-blah-blah, let’s get down to business.

Part 3: A quick look around to see what we got

The first thing I want to know is how many songs in the dataset were chosen by Billboard each year. Let’s take a look:

sns.set_style("whitegrid")g= sns.countplot(x= 'Year', data= df)
g.set_xlabel("Year", size= 15)
g.set_ylabel("Count of songs", size= 15)
g.set_title('Number of songs per year', size= 15, y= 1.05)
g.set_ylim(0,100)plt.show()

Interesting to note that there’s a different number of songs for every year. I would’ve been nice to have an even number of songs throughout, but hey, part of an analyst’s job is to play with the data he’s got and make the best of it. Also, for some reason, the data has very few songs in 2019, which I’ll keep in mind, since it might have an influence on some of the results.

I also want to know how many songs and descriptive variables there are in total:

df.shape

Like I said at the beginning, the data has been previously cleaned, so no issues should pop up there, but let’s just make sure that there aren’t any null values:

df.isnull().sum()

Before I do any analysis, I want to look at the distribution of the musical attributes. This might allow (in a very general way) to identify if the values look reasonable (aka, free of errors). I’ll start with BPM:

sns.set_style("white") 
g= sns.boxenplot(data=df, x='Year', y='BPM')g.set_title("Distribution of BPM by year", size= 15, y= 1.05) 
g.set_xlabel('Year', size= 15)
g.set_ylabel('BPM', size= 15)
g.set_ylim(-10,220)plt.show()

AHA! Right there I can see there is a value that’s probably wrong. Can you see it? There appears to be a song released in 2016 with a tempo of 0 or close to 0. That of course cannot be, because no song is that slow. It’s important to spot these situations, because erratic values can throw off our analysis. So let’s find out which one it is:

df[df["BPM"] == 0]

I don’t know about you, but I’ve never heard an Adele song with a tempo of 0. Furthermore, almost all of the other values are also 0, so this is clearly a mistake. Seems like the data cleaning performed on this dataset skipped this one song. I’ll go ahead and delete it because it will not greatly impact the data.

df.drop(442, inplace= True)

Let’s check the distribution again to make sure the problem’s fixed:

Solved.

Now, the distribution of duration shows a very short song in 2015 but that isn’t necessarily an error:

sns.set_style("white") 
g= sns.boxenplot(data=df, x='Year', y='Duration')g.set_title("Distribution of Duration by year", size= 15, y= 1.05) 
g.set_xlabel('Year', size= 15)
g.set_ylabel('Duration (seconds)', size= 15)plt.show()

In fact it’s Justin Bieber’s “Mark My Words”, which indeed is only 134 seconds long (you can check it out on Spotify. It’s a pretty good song).

df[df['Duration']<140]

The rest of the variables seem reasonable, so no problem there:

sns.set_style("white")fig, (ax0, ax1, ax2, ax3, ax4, ax5)= plt.subplots(nrows=6, ncols=1, sharex= True, sharey=True, figsize=(10,10))ax_dict= {'a': ax0, 'b': ax1, 'c': ax2, 'd': ax3, 'e':ax4, 'f': ax5}
ax_list= ['a', 'b', 'c', 'd', 'e', 'f']
variables= ['Energy', 'Danceability', 'Valence', 'Acousticness', 'Speechness', 'Popularity']for v, a in zip(variables, ax_list):
    sns.boxenplot(data=df, x='Year', y=v, ax= ax_dict[a])fig.suptitle("Distribution of other variables, by year", size= 20, y= 0.93)  
ax5.set_xlabel('Year', size= 15)plt.show()

To close up this part, I want to point out that these musical characteristics are identified automatically by Spotify. Although the detection process is very good, it isn’t perfect, and this dataset provides a good example. Check out the tempo of the song FourFiveSeconds:

df[df['Title'] == 'FourFiveSeconds']

Spotify identifies the tempo at 206, but if you’ve ever heard the song, your musical ear tells you there’s something wrong there. Actually, the song’s tempo is 103 (half what Spotify identified). The reason is that Spotify simply doubled the count and instead of counting quarter notes, it counted eighth notes. This is a subtle mistake, but the point I’m trying to drive home is that you should always check your raw data before analyzing it, and use your domain expertise to make sure it is appropriate and won’t give you bogus results.

Part 4: Q&A

OK, this is where the fun part starts, and why I love the job of the data analyst. Basically, what we do is play with the data to start asking questions and finding answers. Some of these questions and answers might be useful, some might not; only the process itself will tell us which are which.

Also keep in mind that some of the questions may already have been asked by your organization’s decision makers, in which case you can go ahead and try to answer them. Finally, remember that part of the analyst’s job is to help the decision maker to correctly frame the questions and state them clearly in analytical terms. This is where your domain knowledge starts to kick in and you become an analysis Rockstar.

Question 1: Who were the most popular artists of the decade?

To answer this question, I’ll begin by looking at the artists with more appearances (or hits) in the dataset, aka the most popular artists in the entire decade. The focus will be on those who appear more than ten times, to keep the visualization manageable.

hits = df.Artist.value_counts().sort_values(ascending = False)
hits = pd.DataFrame(hits)
hits.columns= ['Hits']
top_artists= hits[hits['Hits'] > 10]fig, ax= plt.subplots()sns.barplot(data=top_artists, x='Hits', y= top_artists.index, ax= ax) 
ax.set_xlabel('Number of Songs', size= 15)
ax.set_xlim(0,18)fig.suptitle('Most Popular Artists of the Decade', size= 20)
plt.yticks(size= 15)
plt.xticks(size= 15)plt.show()

No surprise here. Katy Perry, Justin Bieber and Rihanna dominate the list of the most popular artists of the decade, according to this dataset. Now, this is a very general indicator because the music industry changes A LOT throughout a decade, so let’s zoom in to see the graphs by year. Again, to keep things manageable I’ll focus on those artists with more than 2 songs in the specific year.

years = list(df['Year'].unique())for year in years:
    
    data = df[df['Year'] == year]
    hits = data.Artist.value_counts().sort_values(ascending = False)
    hits = pd.DataFrame(hits)
    hits.columns= ['Hits']
    top_artists= hits[hits['Hits'] > 2]
    
    sns.set_style("whitegrid") 
    
    fig, ax = plt.subplots()
    sns.barplot(data=top_artists, x='Hits', y= top_artists.index) 
    ax.set_xlabel('Number of songs', size= 15)
    ax.set_xlim(0,10)
    ax.set_ylabel('Artist', size= 15)
    ax.set_title(year, y= 1.05, size= 20)plt.show()

Now we have a a clearer view of how the balance of power moved in the music industry during the decade. The first years show artists like Black Eyed Peas, Christina Aguilera and Lady Gaga, while the middle years mark the ascension of One Direction, Katy Perry and Bruno Mars. The decade closes with a strong push by Ed Sheeran, as well as The Chainsmokers and Shawn Mendes.

Interesting to note that most top artists hang around 4 songs per year, with only a few exceptions like Ed Sheeran in 2019 and (most notably) Justin Bieber with a mind-boggling 9 hits in 2015 (this is slightly attenuated by the fact that 2015 is the year with more songs in the dataset).

Question 2: How is the “share of popularity” distributed each year?

A common opinion about the music industry is that it tends to be monopolized by a relatively small amount of very famous artists. Indeed, it is really hard to become massively known in this industry, since it involves numerous aspects like good music production, good interpretation, image creation, investment in promotion, having an extensive network within the various music circuits, managing tours, and many more. This idea links straight back to what Michael Porter calls the “barriers of entry”: basically, it means that it is extremely hard for new competitors (artists) to enter into the music industry and gain some traction.

Thinking about this made me curious to see how popularity was distributed throughout the last ten years. A first glimpse at the answer is to look at the amount of songs per artist per year, since it could be an indication of the distribution of popularity among the artists present in Billboard’s list.

# Calculate the average number of songs per artist for each yearyears = list(df['Year'].unique())
artist_counts= []
song_counts= []for year in years:
    
    data = df[df['Year'] == year]
    unique_artists = data["Artist"].unique()
    artist_count = len(unique_artists)
    artist_counts.append(artist_count)
    unique_songs= data["Title"].unique()
    song_count= len(unique_songs)
    song_counts.append(song_count)
    
songs_per_artist = np.array(song_counts) / np.array(artist_counts)
dic= {"Year": years, "Songs per artist": list(songs_per_artist)}
songs_per_artist_df = pd.DataFrame(dic)#Plot a timeline of the infosns.set_style("whitegrid")
custom_palette= ["#21B36A"]
sns.set_palette(custom_palette)fig, ax= plt.subplots(figsize= (10,10))
 
sns.lineplot(x= 'Year', y= 'Songs per artist', data= songs_per_artist_df, ax= ax)
ax.set_xlabel('Year', size= 15)
ax.set_ylabel('Average songs per artist', size= 15)
ax.set_ylim((1.2,2))plt.xticks(range(2010,2020), size= 15)
plt.yticks(size= 15)fig.suptitle("Songs Per Artist", size= 20, y= 0.93)plt.show()

Here we can see that the first couple of years of the decade had a high number of songs per artist (high concentration of popularity), followed by 3 years of lower concentration, after which the number increases moderately to close the decade.

Another, possibly more interesting way of looking at this question is to analyze the actual popularity distribution across the artists in Billboard’s list. I’ll consider the sum of all songs as the total popularity available that year (aka 100% of the popularity), and ask how much of that pie is eaten by each artist, each year. In other words, what percentage of the totality of songs belongs to each artist? This is what I call share of popularity.

Now, to make this visualization simpler, I decided to calculate the sum of share of popularity among the top 5 artists of each year. This should give us a pretty accurate picture of how much popularity was accumulated by the really huge and famous artists, and how much of the popularity pie was left for everyone else. Let’s take a look:

#Calculate each artist's share of popularity for each yearyears = list(df['Year'].unique())
share_of_pop= {}
for year in years:
    
    data = df[df['Year'] == year]
    yearly_share = data["Artist"].value_counts(normalize= True)
    share_of_pop[year]= yearly_share
    
# Calculate the sum of popularity of the top 5 artists for each year
    
top_5 = []
for year in years:
    sum_top = share_of_pop[year][:5].sum()
    top_5.append(sum_top)
    
top_5_dict = {'Year':years, "Top 5": top_5} 
top_5_df = pd.DataFrame(top_5_dict)# Plot the infosns.set_style("whitegrid")
custom_palette= ["#21B36A"]
sns.set_palette(custom_palette)
fig, ax= plt.subplots(figsize= (10,10))sns.lineplot(x= 'Year', y= 'Top 5', data= top_5_df, ax= ax)
ax.set_xlabel('Year', size= 15)
ax.set_ylabel('Share of pop (%)', size= 15)
ax.set_ylim((0,0.5))plt.xticks(range(2010,2020), size= 15)
plt.yticks(size= 15)
fig.suptitle("Share of popularity - Top 5 artists per year", size= 20, y= 0.93)plt.show()

Similar to the previous graph, here I notice that the decade started with several years of great accumulation of popularity by the big artists (they “monopolized” close to 40% of the total songs). Then, in 2013 it goes way down, which is indicative of a more equally distributed popularity share among the artists in the list. Furthermore, this coincides with an important drop in the average number of songs per artist (previous image). Finally, in 2019 the indicator goes up again, possibly because of the “monopolizing” effect of one Ed Sheeran. Keep in mind that the dataset has very few songs for 2019, and this puts an asterisk on that year.

So, to close this question I might summarize by saying that the share of popularity was monopolized by the big artists during the first couple of years (close to 40% of popularity for the top 5 artists each year), after which popularity was better distributed among the artists in the list.

Question 3: How long do artists sustain their popularity? What is an artist’s popularity lifespan?

Considering that the music industry is after all a business, participants (mainly production labels) are always looking for ways to more effectively monetize their musical products. In this sense, one of the most frequent questions is how long can an artist stay relevant. Otherwise put, once we’ve invested in an artist’s career and he/she has reached a certain level of popularity, how long can it be sustained and how much financial advantage can we obtain?

To answer these questions, I looked at the amount of years in which an artist has at least one song in Billboard’s list. So, for example, if an artist has at least a song in each of the ten years in the dataset, it means that he/she remained highly popular throughout the decade (this practically never happens).

Since there are many artists in the dataset, I calculated the median amount of years with at least a song on the list, for all artists:

# Median number of appearances (in years) in Billboard's list for all artistsyearly_app = df.groupby('Artist')["Year"].unique()
pop_lifespan= []
index= yearly_app.indexfor i in index:
    number= len(yearly_app[i])
    pop_lifespan.append(number)
    
pop_lifespan_array = np.array(pop_lifespan)
lifespan_all= np.median(pop_lifespan_array)
lifespan_all

So, as a record company, you might expect that once one of your artists cracks into Billboard’s popularity lists, he may be there again for one more year in the same decade.

But as a music producer, I know that artists vary greatly in their level of popularity and ability to stay current. So I divided them into three categories: top, medium and small artists. All artist with more than 10 hits throughout the decade are called “top artists”, those with 3 to 10 are “medium artists” and those with less than 3 are “small artists”.

# Categorize into top, medium and small artistshits = df.Artist.value_counts().sort_values(ascending = False)
hits = pd.DataFrame(hits)
hits.columns= ['Hits']cats= {}
cats['top']= hits[hits['Hits'] > 10].index
cats['medium']= hits[(hits['Hits'] < 11) & (hits['Hits'] > 2)].index
cats['small'] = hits[hits['Hits'] < 3].index

And now I calculate the median lifespan (in years) for each category:

# Calculate median lifespan for each category of artistscats_iter= ["top", "medium", "small"]
medians= {}for i in cats:
    filters = yearly_app.index.isin(cats[i])
    app= yearly_app[filters]
    lifespan = []for x in app.index:
        number= len(yearly_app[x])
        lifespan.append(number)array = np.array(lifespan)
    indic= np.median(array)
    medians[i]= indic
    
print(medians)

Ok so now we can understand better how artistic lifespan works in this dataset. Top artists can be expected to remain popular for 6 years, while medium artists only do so for 3, and small artists only for 1. Let’s visualize this:

categories= ["all", "top", "medium", "small"]
lifespan= [lifespan_all, medians["top"], medians["medium"], medians["small"]]g= sns.barplot(x= categories, y= lifespan)
g.set_title("Median artist lifespan (in years)", size= 20, y= 1.05)plt.xticks(size= 15)
plt.yticks(size= 15)plt.show()

To close this out, it’s important to keep in mind that the popularity of an artist is mainly a function of how much money, time and energy is invested into his/her career. So, for example the top artists are those for whom millions of dollars were spent throughout the decade in terms of publicity, image, etc. This means that production and promotion companies need to figure out ways to optimize their Return On Investment which requires that they put everything in a balance and make a decision on how much they think an artist will bring back for them. Usually the “Inverted U” principle holds: the more you invest in an artist, the more return you’ll get, but only up to a certain point, after which the relationship inverts and companies start losing money.

Question 4: How did the main musical attributes evolve through the last decade?

Music taste changes very quickly. It seems like a completely new genre emerges every year, with a whole set of musical characteristics and its very own representative artists. This raises a bunch of questions in my mind. How has music changed in the last decade? Is music faster? Is it more energetic, more danceable, or happier? How has the evolution of society and the music industry changed our musical tastes?

Well, let’s see if the data has something to say about that. I am going to plot some simple KDE’s for 4 key musical attributes: BPM, Energy, Valence and Danceability. The goal here is to visually identify how these attributes have “moved” through the years, so I’ll plot one KDE for every year, starting in 2010 and ending in 2019. In order to see it more easily, I’ll also plot a vertical line that indicates the median for each indicator, for each year.

variables= ['BPM', 'Energy', 'Valence', 'Danceability']sns.set_style("whitegrid")
custom_palette= ["#6F8CB2"]
sns.set_palette(custom_palette)variables= ['BPM', 'Energy', 'Valence', 'Danceability']sns.set_style("whitegrid")
custom_palette= ["#6F8CB2"]
sns.set_palette(custom_palette)for i in variables:
    
    fig, (ax0, ax1, ax2, ax3, ax4, ax5,
          ax6, ax7, ax8, ax9) = plt.subplots(nrows=10, ncols=1, sharex= True, sharey=True, figsize=(10,10))
    
    years = range(2010,2020)
    ax_dict= {'a': ax0, 'b': ax1, 'c': ax2, 'd': ax3, 'e':ax4, 'f': ax5, 'g': ax6, 'h': ax7, 'i': ax8, 'j': ax9}
    ax_list= ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h','i', 'j']
   
    for year, a in zip(years, ax_list):
        
        sns.distplot(df[df['Year'] == year][i], hist=False, 
                     kde_kws={'shade':True}, ax= ax_dict[a])
            
        ax_dict[a].set_xlabel('') 
        ax_dict[a].set_xlim(0,200)
        ax_dict[a].set_ylim(0,0.043)ax_dict[a].axvline(x= df[df['Year'] == year][i].median(), 
        color='#FF6100', label='Median', linestyle='--', linewidth=2)
    
    fig.suptitle(i, size= 20, y= 0.93)
    plt.show()

This is very interesting. Tempo has gone consistently down, from a median of 125 BPM in 2010 to around 107 in 2019. Any musician can tell you that a 22 BPM change makes all the difference in the world in terms of how a song is perceived, how it sounds and the message it communicates. Energy and Valence have a similar behavior, which makes us wonder if music is less “energetic” and “happy” now than it was a decade ago. This change possibly has to do with the popularity decrease of “house-ish” music, and the increase of rhythms closer to trap, as well as the return of RnB.

In spite of the decrease in those metrics, we see that danceability remains pretty constant throughout. Perhaps this means that we are now used to dancing to slower rhythms.

Finally, to clarify the findings in this point, I’ll plot the evolution of each musical attribute:

# Make a dictionary of the median of each attribute, for each yearvariables= ['BPM', 'Energy', 'Valence', 'Danceability']
years = list(df['Year'].unique())
medians= {}for v in variables: 
    values_list= []
    for y in years:
        median= df[df["Year"]==y][v].median()
        values_list.append(median)
    medians[v]= values_list
    
medians["Year"]= years# Plot the median attributes for each yearfig, ax= plt.subplots(sharex=True, sharey= True, figsize= (10,10))sns.lineplot(x= medians["Year"], y= medians["BPM"], color= "r", label= "BPM")
sns.lineplot(x= medians["Year"], y= medians["Energy"], color= "b", label= "Energy")
sns.lineplot(x= medians["Year"], y= medians["Valence"], color= "y", label= "Valence")
sns.lineplot(x= medians["Year"], y= medians["Danceability"], color= "g", label= "Danceability")fig.suptitle("Median musical attributes", size= 20, y= 0.92)legend= True
plt.xticks(medians["Year"], size= 15, rotation= 45)
plt.yticks(size= 15)
plt.show()
medians

Part 5: How can music industry players benefit from this data exploration? (aka business implications)

Like I said in the beginning, a data analysis is only useful when it helps someone do something. So, how can these findings be useful for music industry participants?

It seems like the really huge artists will have around 4 hits on any given year. This means that record companies need to quantify the amount of income that would represent vs the investment required to get there. Considering our top artist will have 4 hits, does it make sense for us to invest in him/her this year?
Popularity seems to have “democratized” somewhat, which might mean this is a good time to launch some new projects that challenge incumbent and established artists. We’ll have to see how this continues to develop though, especially since the “Ed Sheeran effect” still lingers.
An artist’s lifespan depends greatly on his/her traction. A big artist will have a much longer lifespan than a smaller one, but will also require more investment. Is this worth it? This takes us back to the first point above.
It seems clear that the most popular tempo lies around 110 BPM, so production labels might want to resist the temptation of faster beats.

Part 6: Limitations

As any data exploration, this one has limitations. For starters, we have no geographical data, which means that we can’t know if these tendencies behave differently on different parts of the world, or even for different segments of listeners.

On the other hand, there is some selection bias present. After all, every song that is part of the dataset has been included in Billboard’s list of more popular songs, and that means that it is already a worldwide hit. That’s why terms like “small artists” need to be attenuated and contextualized, as well as the conclusions we obtained.

Speaking of conclusions obtained, I close up this data exploration by saying that all the observations made are restricted to this dataset. They are not to be taken as representative of the entire music industry. Always keep this in mind before you jump to conclusions.

Epilogue

So, does music transform you into something, or is it only a reflection of who you already are? I think it’s a little bit of both. After all, looking back I have to say I did end up (for some time at least) looking like those Rhapsody dudes. Wait, does that mean I look like Mac Miller now? Probably not…