The Data Science of K-Pop: Understanding BTS through data and A.I.

Haebichan Jung
Towards Data Science
7 min readOct 27, 2018

--

Making sense of the BTS craze through Data Science.

Intro

BTS (Bangtan Boys) is a K-Pop group that is no doubt an international sensation. The members’ faces were put on the cover of TIME Magazine over a week ago, along with other prominent media outlets like Ellen and BBC cozying up to share the limelight.

With the sudden eminent rise of the group naturally came the desire to make sense of it all, especially in the West. What makes this group so deserving of all this attention? Given that there are literally hundreds of K-Pop artists out there (not to mention western artists), what makes them so unique compared to their competitors?

The same media outlets have tried to answer these confounding questions. Answers offered include:

  1. The members “completely understand the values of a team” — TIME
  2. “High Quality of Music” — Kpopmap
  3. Confucian values of hard work — Psychologist Watches
  4. “Emotional resonance, sincerity, and an ARMY of fans.” — Vox

The Problem

The problem with these explanations is that most of them aren’t really compelling. They can easily be applied to say BigBang, another famous Korean boy band, which also been recognized for “musical variety [and] non-conformity” while “still maintaining unique identity” (Mithunonthe, 2012). Furthermore, these claims aren’t supported by any scientific investigation, remaining only words with no real critical validation of any kind.

This is the problem that this article aims to solve, by using the latest data science techniques in machine learning and A.I. to scientifically answer the question of the group’s true identity.

This is the first of a two-part series on K-Pop and data science. In the second article, I will examine the lyrics of K-pop, exploring how data science techniques can shed a quantitative light on BTS lyrics.

An important note: This article is intended for a more general audience. As such, I will not spend time explaining in technical detail the tools I used, to dive straight into the interpretable results. The entire work can be found on my Github.

The Methodology

Data Analysis: Getting and Analyzing the Data

For part one, I explore one obvious aspect of BTS: their music. I tapped into Spotify’s API that gave me 11 acoustic qualities of every single BTS soundtrack. These qualities mathematically measure the numeric level of each song’s acoustic-ness, danceability, instrumental-ness, etc. The calculations are not my own, but of Spotify’s internal algorithm.

numeric breakdown of every single BTS song.

I collected data of other K-Pop groups like BigBang and Twice, based on this top 20 artist chart. In total, I had a total of 2,673 K-Pop soundtracks. With few lines of code, I extracted some interesting mathematic properties about BTS’ music.

These numbers represent the average of every musical feature for every K-pop song in our dataset. We see here that on average, BTS’ songs have the highest level of speechiness , almost tripling that of other artists. Speechiness is the detection of spoken words in a track, which is converse to instrumentalness: a measurement of whether a track contains vocals. No wonder that BTS’ songs are relatively low in the latter metric. In fact, take a look at the chart below:

-

We see here that since 2013 (when the group first debuted), measurements in liveness danceability , speechiness , and acousticness all fell gradually until the year 2016, when the numbers started to go back up. Converse, energy in their tracks increased until the same year of 2016 when the numbers began declining.

-

Focusing on speechiness alone, Never Mind by BTS had one of the highest levels of with a rate of 0.90! If you check out the music video below, you will see that the song is a mixture of pop, hip-hop and rap, resulting high rate of vocals in the track.

Visual comparisons among artists also can be prove useful for our problem of addressing group identity and uniqueness.

The distribution plot above illustrates a interesting piece of information. While BigBang and iKon’s music generally center around a specific tempo (125 and ~90 respectively), BTS’s music is far more distributed in the beats. This indicates that BTS’ music is diverse in rhythm and speed, comprising of songs that are both extremely fast and slow in pace .

I wanted to hear from myself how different the pace was. The BTS track with the slowest tempo was Butterfly, which is a Ballad-EDM mix. The track containing one of the fastest tempos was MAMA, which lean toward straight up hip-hop.

Not the original for ‘MAMA’. Couldn’t find it on Soundcloud.

-

iKon is especially relevant for BTS comparison, since they are also a pop/hip-hop hybrid group (BigBang too). But their songs are much slower on average, like Welcome Back which hovers around the 90 tempo line. My data showed that their fastest music were either remixes or live concert versions of originals.

Data Modeling: Computing the data (through A.I.)

With these deterministic numbers where unpredictability in the data is low (unlike say user behavior data), it’s relatively easy for machine learning algorithms to learn which musical attributes are important for distinguishing one K-Pop group from another. As such, I built a simple classifier where machine learning models would learn these 11 musical features and try to predict if these songs are from BTS or not.

-

For a more technical audience, I used ensemble models of LightGBM, Gradient Boosting and Random Forest, with a one-v-all approach where the target label was 1 when if the features corresponded to a BTS song and 0 if else. Without any model optimization or feature engineering, all three models near 0.9 in AUC. Oversampling was used to combat class imbalance.

-

The outputs from these machines are extremely relevant in telling us which features are important for differentiating BTS from other K-pop artists. As such, the chart below is the answer to our original problem!

How to read the SHAP Value chart: The SHAP value is a mathematic calculation of how much each feature individually contributes to our machines predicting if a soundtrack is by BTS or not, based on the 11 features.

The higher these features rank on the left y-axis, the more impact they have on the model prediction. The magnitude of positive and negative impact on model output is shown by how spread out the dots are from the center. Positive influence means influence toward BTS prediction, and negative influence toward the other side.

Colors blue and red in the chart indicate the value of a feature. For instance, the red dots represent the higher levels of speechiness and the blue represents the lower values of this feature.

Combining these pieces of information, the plot above shows that speechiness is the most important feature for model prediction. The higher the level of speechiness, the more that song influenced our machines in predicting it to be by BTS.

What about American artists? BTS vs. Bruno Mars

I decided to take the analysis further by comparing BTS with Euro-American pop musicians. After gathering musical data of artists such as Kelly Clarkson, One Direction, and Bruno Mars, I had my same models try to discern their music from those by BTS. I received the following result:

speechiness again ranked the first in feature importance, though the magnitude of influence seems more balanced than previous results. Interestingly, energy is not as equally significant as a predictive factor as liveness. The chart below however shows BTS’ songs to rank much higher in their energetic level. Equally noticeable is how BTS songs rank the second lowest in acousticness compared to other western artists.

Conclusion

Through techniques in data science, we discovered musical qualities unique to BTS, and how they influenced results provided by our machine learning algorithms.

We found a couple of things:

  1. BTS songs are very diverse and distributed in tempo compared to other male pop artists, especially those that do merge pop with hip-hop.
  2. BTS songs have high rates of vocals and low rates of instruments in their tracks on average.
  3. BTS songs are far more energetic on average than popular western pop artists. Their music is also quite low in acoustic measurements.

If you liked this first part of K-pop meets data science, please stay tuned for Part 2, where I examine K-pop lyrics to see if we can draw equally interesting insights related to BTS.

In the meantime, feel free to check out my other article on pop music and A.I. There, I show how I made a machine that could create pop music itself!

--

--

Silicon Valley Data Scientist | Former Project Lead @TowardsDataScience (Medium)