Using Deep Learning to Predict Hip-Hop Popularity on Spotify

Can I build a neural network that has good taste in music?

Nicholas Indorf

Published in

Towards Data Science

27 min readJan 27, 2022

KCMakesMusic — KC Makes Music, my cousin and client. Image courtesy of KC Makes Music.

High-level Summary

(the broad strokes)

In this project, I wanted to build a tool that could help my cousin, a hip-hop artist named “KC Makes Music”. This tool would assess whether songs he hasn’t released yet had the potential to be popular on Spotify, the gold-standard music streaming service. I gathered and used only audio preview samples and associated popularity scores of recently released hip-hop tracks from Spotify’s database. These were processed into spectrograms and a binary target (popular/not popular). These were fed into a neural network, which was iteratively constructed with increasing complexity. The most complex model, based on a Recurrent Neural Network, performed the best, with 59.8% accuracy and 0.646 ROC-AUC. This signifies an improvement over baseline of 9.1% accuracy and 0.146. This is not a mind-blowing result, but it isn’t nothing! Especially considering that this model was very limited with only audio, this is a very promising result. I would not recommend this model for use on the individual song level, but it could still be useful for a larger number of samples. It could direct an artist’s attention towards a subset to focus on, which saves effort and is thus still valuable. Further improvements will only make this a much more powerful tool at my client’s disposal.

GitHub link

Context

Audio classification is a classic use case for neural network models. In a previous project, for example, a team I was on built a classifier to distinguish between music genres from this dataset. This has been done in multiple blog posts (blog 1, blog 2) and would seem to be a fairly well-known project. In addition, there are audio classification competitions going right now on Kaggle that would be prime candidates for a similar kind of workflow. This speech recognition competition from Google Brain and this bird call identification competition from Cornell Lab of Ornithology are two big examples.

With these in mind, I wondered how far I could take audio classification. Genres, human speech, and bird calls involve classes whose differences are plain to the human ear. Jazz sounds different than metal, syllables sound different from one another, and different birds have different calls. These are all obvious to a human observer. I knew that machines have a higher capacity for differentiation than humans have, and I found it peculiar that in the realm of audio, we gave machines problems that the human ear could solve effortlessly. The most difficult audio example I could find was differentiating COVID-19 coughs from normal coughs, but a trained respiratory doctor could probably assess decently well. In comparison, we have models that can predict whether water wells in Tanzania are functional or not based on a slew of datapoints about the well. In the field of neural networks, Nature published a deep neural network that can look at an ECG to make predictions about a heart diagnosis. A human would have significant difficulty sifting through either data set to make such predictions (except maybe a cardiologist for the second one).

Therefore, I wanted to choose a project that involved audio to make a prediction on something that a human would have difficulty doing. I wanted to push the envelope and do something ambitious. I love music, so I wanted to see if a model trained only on song samples could predict a track’s popularity. This is a pretty common idea (blog 1, blog 2, blog 3), but these all use Spotify’s provided audio features (such as danceability, instrumentalness, and others). They don’t use the audio samples themselves, and I thought that using the raw song samples with a neural net might work out better. It’s also more useful, since you don’t need to rely on Spotify’s metrics and can do this analysis before a song’s release.

Popularity is a tough target — the above blog posts have mixed success with more traditional (i.e. not neural network) techniques. Furthermore, neural networks are typically used with audio data that is relatively easy for the human ear to classify. But a person can’t listen to a song and say “oh yeah, that sounds popular”. They might say that it sounds good but popularity is a bit more nebulous to quantify.

Business Problem

KC Makes Music is the stage name for my cousin, who is a hip-hop artist on Spotify. I thought it would make for an interesting learning experience if I used my data science skills to try and help him gain listeners on the platform. He has had some success, with ~24.5k monthly listeners as of Dec. 2021, but understanding what to do to gain monthly listeners and gain a following tends towards the arcane. Spotify has an extremely robust database, so I knew there was an opportunity here to help my cousin increase his listener base, and therefore expand his reach as an artist.

After looking at the available angles for this project, I decided that one way to go about this was to survey recent hop hop tracks and try to build a model that could predict if a previously unreleased song had the potential to be popular in the current scene. I would gather a random assortment of songs (of all popularity scores), take their preview audio files, transform them into spectrograms, and then feed them through a neural network. This would, hopefully, provide a model that has picked up on common features among the popular songs such that it could tell if a new song would be popular. My cousin has a number of unreleased, finished tracks which we could put through the model. If it’s accurate enough, it could determine which of the tracks would do well. If none of them do well, it would signal to my cousin that his tracks could use more work, and we could re-test them in an iterative manner. Furthermore, I could use tools like LIME to determine what in particular the model is finding important and pass that information along to my cousin, to focus on.

For this project, I decided to simplify the target into a popular/not popular binary because it will be a bit easier for the model to differentiate between two classes rather than determine how popular something will be. For my business purposes, there is not that much of a difference between popular/not popular and how popular. If I get amazing results, I can always convert it to a regression-type model later, but for now, simplifying the label should help me see the model’s effectiveness a bit more easily. Moreover, it is most important that the model can separate between the two classes. Having a model that mixes up popular and unpopular songs is useless. A false positive means that my cousin releases a song that won’t do well, and a false negative means that my cousin overproduces a song that is good as it was. Either way, we ruin good work. With this in mind, accuracy is a useful metric, but ROC-AUC is the preferred metric to use, as it measures how well the two classes can be separated.

Data Acquisition

All data comes from the Spotify web API. Spotify was chosen because not only is it the platform we are trying to optimize for, but also because it is one of the most comprehensive music databases around. Using Spotipy, a python interface for the Spotify web API, I gathered information on random songs in the genre “hip-hop” released 2019–2021. Random tracks were generated adapting a method outlined here.

For more detail on how I acquired the data, please look at the GitHub.

Ultimately, 16168 tracks were selected for inclusion. The data includes most critically an http link to a ~30 second mp3 preview audio file, and a popularity score. This score ranged 0–100, with 100 being the most popular. Popularity was trimodal, with a huge peak at 0, a small peak at ~28 and a large peak at ~45.

Taking a Closer Look

Dataset: an examination of completeness and representativeness

This is most easily done by looking at a few representative artists who I know have released songs in the past 3 years. In this case, Drake, Kanye West, and KC Makes Music (my cousin). For more detail on the code please check out the github.

Drake had ~140 tracks represented. Seems about right.
Kanye West had 9. Definitely not accurate, since he had released Donda not too long ago.
KC Makes Music is completely missing!

This dataset doesn’t seem as complete as it could be. There are a lot of songs with a tremendous diversity of artists and popularity, but it’s missing songs I would have expected to be present. For example, Kanye West released Donda not too long ago, and I don’t see anything from that album here. Furthermore, my cousin’s music is not represented at all.

As I said, it’s still a lot of songs so it’s probably good enough for now. I’ll proceed with what I have and if I need more I will investigate more later.

Popularity: an examination of the target

track_df['popularity'].hist(bins=100)

png — Popularity score (x-axis) vs counts (y-axis). Image by author

40 or higher seems to be a decent cutoff, but I’ll select the cutoff after the train-test-holdout split though to keep data leakage at a minimum.

As you’ll see later on, 39 ended up being the cutoff number, so this was close.

Other interesting tidbit — songs without mp3s skew slightly more popular. Surprising!

track_df[track_df['preview_url'].isna()]\['popularity'].hist(bins=100)

png — Popularity score (x-axis) vs counts (y-axis). Image by author

# before removal of nulls/duplicate links
track_df['popularity'].hist(bins=100)

png — Popularity score (x-axis) vs counts (y-axis). Image by author

# after removal of nulls/duplicate links
mp3s['popularity'].hist(bins=100)

png — Popularity score (x-axis) vs counts (y-axis). Image by author

As you can see, the population distribution for songs with preview mp3s and all unique songs collected is about the same. This is despite the above thing being true about no mp3 songs, so there’s no bias being introduced by the removal of songs without preview mp3s.

Artists: an examination of average popularities

Just to satisfy my curiosity, and to place popularity values into context, I looked at the distribution of mean song popularities of artists in the dataset. I used the same track_df_expart from above to take a look.

A lot of artists who I’d consider very popular are below 70 on average, and a lot of artists I’ve never heard of have very high average popularities…

Average popularity scores for some decently famous artists. Image by author

With this in mind, the above 40 estimate seems to be a very solid line to draw for “popular”. Again, I’ll look at the training data only, but I expect it to have a similar distribution due to the large sample size.

Songs: an examination of repeat values

I saw that some songs have different mp3 preview http links but are in fact the same song, just on different albums. I’ll investigate further.

# find duplicates based on track name and the duration
# lots of repeats -- 652 in the dataset
mp3s[mp3s.duplicated(subset=['track', 'duration_ms'], keep=False)]['track'].value_counts()
6 'N the Mornin'                                     6
3 Headed Goat (feat. Lil Baby & Polo G)              6
Zulu Screams (feat. Maleek Berry & Bibi Bourelly)    4
How TF (feat. 6LACK)                                 4
durag activity (with Travis Scott)                   4
                                                    ..
Zu Besuch                                            2
50 in Da Safe (feat. Pink Sweat$)                    2
The Announcement (Sex Drugs & Rock and Roll)         2
Shotta Flow                                          2
I'LL TAKE YOU ON (feat. Charlie Wilson)              2
Name: track, Length: 652, dtype: int64

Many duplicates, as you can see. Let’s take a look at the top example.

mp3s[mp3s['track'] == "3 Headed Goat (feat. Lil Baby & Polo G)"]

The most duplicated song, in tabular form. Table by author

Wow, this one song has 2 single entries, 2 album entries, and 2 deluxe album entries, all with different preview mp3 links (not pictured). All of them have different popularity scores, as well!

Although there are duplicate songs from different albums (single, album, etc.), they often have different popularity scores, which is still valuable info. I kept the repeats so long as their popularity scores are different. This removed only 26 entries:

mp3s[mp3s.duplicated(subset=['track','duration_ms','popularity'], 
                     keep=False)]['track'].value_counts()
6 'N the Mornin'                           6
9ja Hip Hop                                3
Face Off                                   2
8 Figures                                  2
80's - Instrumental - Remastered           2
3 Headed Goat (feat. Lil Baby & Polo G)    2
Studio 54                                  2
SAME THING                                 2
One Punch Wulf                             2
50/50 Love                                 2
96 Freestyle                               2
6itch remix - feat. Nitro                  2
6565                                       2
Zero Survivors                             2
Jazz Hands                                 2
Just Mellow - Norman Cook 7'' Remix        2
Aries (feat. Peter Hook and Georgia)       2
60%                                        2
Sex Cells                                  2
Seven Day Hustle                           2
Ring (feat. Young Thug)                    2
8 Missed Calls                             2
Name: track, dtype: int64

Target & Librosa Processing

Now that I have my extracted dataset, I need to process my target into a binary-encoded variable, and my mp3 preview links into Mel spectrograms with librosa before I can pass both into a neural network.

As I mentioned earlier, popularity is being simplified into a binary target to make it easier for my model, and because how popular isn’t much more descriptive as whether it is popular or not.

Mel spectrograms are a well-known tool to represent audio data in a visual format. It decomposes audio into frequencies (Mel frequency is a scale that more closely resembles how humans hear) and displays the frequency distributions over time. In this way, patterns regarding beat, timbre, etc. can be detected by a model.

# making train test holdout splits
# train = 75%, test = 15%, holdout = 10%

X = mp3s.drop(columns=['popularity'])
y = mp3s['popularity']

X_pretr, X_holdout, y_pretr, y_holdout = train_test_split(X, y, test_size=0.10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_pretr, y_pretr, test_size=15/90, random_state=42)
print(X_train.shape, X_test.shape, X_holdout.shape)(12125, 9) (2426, 9) (1617, 9)

Popularity

39 seems to be a good cutoff.

fig, ax = plt.subplots(figsize=(10,6))
y_train.hist(bins=95, ax=ax)
ax.grid(False)
ax.set(title='Popularity', xlabel='Score', ylabel='Song Count')

The final popularity cutoff (text and cutoff mark added post-render). Image by author

# defining popular as >= 39 and encoding (1 = popular)
y_train = y_train.map(lambda x: 1 if x >= 39 else 0)
y_train.value_counts(normalize=True)0    0.512
1    0.488
Name: popularity, dtype: float64y_test = y_test.map(lambda x: 1 if x >= 39 else 0)
y_test.value_counts(normalize=True)0    0.516076
1    0.483924
Name: popularity, dtype: float64y_holdout = y_holdout.map(lambda x: 1 if x >= 39 else 0)
y_holdout.value_counts(normalize=True)0    0.506494
1    0.493506
Name: popularity, dtype: float64

Mel Spectrogram Processing

Look at the github for more details on the code since it’s far too much detail for a blog post.

Generally, the workflow went like this:

Get .mp3s from the http link
Convert .mp3 to .wav with pydub.AudioSegment
Get mel spectrograms for the train, test, and holdout splits
Scale the spectrograms (simple min-max, fit to the training data)

Let’s take a look at an example waveform. This is “INDUSTRY BABY (feat. Jack Harlow)” by Lil Nas X — a decidedly popular song, with an original popularity score of 90. As you can see, it is 30 seconds long with a variety of amplitudes.

# visualize waveform
y, sr = librosa.load('data/X_train/wav/10564.wav',duration=30)
fig, ax = plt.subplots(figsize=(10,6))
librosa.display.waveshow(y, sr=sr, ax=ax);
ax.set(title='Waveform', xlabel='Time (sec)', ylabel='Amplitude')
plt.savefig('images/waveform.png');

Below is the corresponding Mel spectrogram. What’s interesting is you can sort of make out the shapes of chords, melodies, and rhythms (among other things) as the song progresses. This is very important for the model.

fig, ax = plt.subplots(figsize=(10,6))
img = librosa.display.specshow(X_train[4118][:,:,0], x_axis='time', 
                               y_axis='mel', fmax=11025, ax=ax)
fig.colorbar(img, ax=ax, format='%+.1f dB')
ax.set(title='Mel Spectrogram', xlabel='Time (sec)', ylabel='Frequency (mels)')
plt.savefig('images/melspec.png');

png — Corresponding Mel spectrogram (unscaled). Image by author

Modeling

Spectrograms are a type of image, so this workflow is similar to an image classification problem using neural networks.

I begin with a baseline understanding, then move to a traditional multi-layer perceptron. I develop the neural networks with a couple of different convolutional configurations and then finish with a Convolutional Neural Net that includes a Gated Recurrent Unit, which is a type of Recursive/Recurrent Neural Network.

As I increase the complexity of the model, the goal is to use it to increase the model’s ability to pick up on underlying patterns in the spectrograms that are related to the target. I’ll explain more of my thought process in detail as I introduce each model iteration.

Baseline Understanding

As I train my neural networks I will need a sense of how an untrained/useless neural network is going to perform. I will get this information by looking at a dummy classifier that simply predicts the majority class for every song.

dummy = DummyClassifier(random_state=42)
dummy.fit(X_train, y_train)
dummypreds = dummy.predict(X_test)

print(f"Dummy Accuracy: {accuracy_score(y_test, dummypreds)}")
print(f"Dummy ROC-AUC: {roc_auc_score(y_test, dummypreds)}")Dummy Accuracy: 0.516075845012366
Dummy ROC-AUC: 0.5

As expected, this did quite poorly. These are good numbers to have in mind as to how a model that only predicts “not popular” would perform.

Multilayer Perceptron

Now that we have an understanding of the worst-case scenario, we can move into the simplest variety of neural networks. I thought that a single-layer perceptron would be particularly useless, so I opted for a slightly-more-complex but still simple 2-layer dense model.

128 and 64 nodes were chosen relatively arbitrarily based on prior experiences with these types of models.

input_shape = X_train.shape[1:]
batch_size = X_train.shape[0]/100# set random seed for reproducibility
np.random.seed(42)
set_seed(42)

# build sequentially
mlp = keras.Sequential(name='mlp')

# flatten input 3D tensor to 1D
mlp.add(layers.Flatten(input_shape=input_shape))

# two hidden layers
mlp.add(layers.Dense(128, activation='relu'))
mlp.add(layers.Dense(64, activation='relu'))

# output layer
mlp.add(layers.Dense(1, activation='sigmoid'))

# compile perceptron
mlp.compile(loss='binary_crossentropy',
            optimizer="adam",
            metrics=['accuracy', 'AUC'])

# take a look at model architecture
mlp.summary()Model: "mlp"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 flatten (Flatten)           (None, 165376)            0         
                                                                 
 dense (Dense)               (None, 128)               21168256  
                                                                 
 dense_1 (Dense)             (None, 64)                8256      
                                                                 
 dense_2 (Dense)             (None, 1)                 65        
                                                                 
=================================================================
Total params: 21,176,577
Trainable params: 21,176,577
Non-trainable params: 0
_________________________________________________________________mlp_history = mlp.fit(X_train, y_train, epochs=20, batch_size=30,
                  validation_data=(X_test, y_test))Epoch 1/20
346/346 [==============================] - 10s 28ms/step - loss: 0.8132 - accuracy: 0.4975 - auc: 0.5008 - val_loss: 0.8145 - val_accuracy: 0.5161 - val_auc: 0.4437
Epoch 2/20
346/346 [==============================] - 9s 27ms/step - loss: 0.7078 - accuracy: 0.5031 - auc: 0.5002 - val_loss: 0.6934 - val_accuracy: 0.4971 - val_auc: 0.5652
Epoch 3/20
346/346 [==============================] - 10s 28ms/step - loss: 0.7144 - accuracy: 0.5046 - auc: 0.5035 - val_loss: 0.6919 - val_accuracy: 0.5161 - val_auc: 0.5573
Epoch 4/20
346/346 [==============================] - 10s 29ms/step - loss: 0.6962 - accuracy: 0.5039 - auc: 0.5015 - val_loss: 0.6923 - val_accuracy: 0.5161 - val_auc: 0.5369
Epoch 5/20
346/346 [==============================] - 11s 30ms/step - loss: 0.6930 - accuracy: 0.5097 - auc: 0.5040 - val_loss: 0.6934 - val_accuracy: 0.5161 - val_auc: 0.5004
Epoch 6/20
346/346 [==============================] - 9s 27ms/step - loss: 0.6933 - accuracy: 0.5077 - auc: 0.5010 - val_loss: 0.6926 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 7/20
346/346 [==============================] - 10s 28ms/step - loss: 0.6929 - accuracy: 0.5112 - auc: 0.4996 - val_loss: 0.6927 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 8/20
346/346 [==============================] - 10s 28ms/step - loss: 0.6930 - accuracy: 0.5112 - auc: 0.4888 - val_loss: 0.6927 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 9/20
346/346 [==============================] - 10s 28ms/step - loss: 0.6930 - accuracy: 0.5112 - auc: 0.4966 - val_loss: 0.6927 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 10/20
346/346 [==============================] - 10s 28ms/step - loss: 0.6930 - accuracy: 0.5112 - auc: 0.4961 - val_loss: 0.6927 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 11/20
346/346 [==============================] - 10s 28ms/step - loss: 0.6930 - accuracy: 0.5112 - auc: 0.4947 - val_loss: 0.6927 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 12/20
346/346 [==============================] - 10s 28ms/step - loss: 0.6929 - accuracy: 0.5112 - auc: 0.4977 - val_loss: 0.6927 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 13/20
346/346 [==============================] - 10s 28ms/step - loss: 0.6930 - accuracy: 0.5112 - auc: 0.4937 - val_loss: 0.6927 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 14/20
346/346 [==============================] - 10s 28ms/step - loss: 0.6930 - accuracy: 0.5112 - auc: 0.4994 - val_loss: 0.6928 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 15/20
346/346 [==============================] - 10s 28ms/step - loss: 0.6929 - accuracy: 0.5112 - auc: 0.5009 - val_loss: 0.6927 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 16/20
346/346 [==============================] - 10s 28ms/step - loss: 0.6930 - accuracy: 0.5112 - auc: 0.4951 - val_loss: 0.6927 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 17/20
346/346 [==============================] - 10s 30ms/step - loss: 0.6930 - accuracy: 0.5112 - auc: 0.4899 - val_loss: 0.6927 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 18/20
346/346 [==============================] - 10s 28ms/step - loss: 0.6930 - accuracy: 0.5112 - auc: 0.4937 - val_loss: 0.6926 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 19/20
346/346 [==============================] - 10s 28ms/step - loss: 0.6930 - accuracy: 0.5112 - auc: 0.5026 - val_loss: 0.6929 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 20/20
346/346 [==============================] - 10s 28ms/step - loss: 0.6930 - accuracy: 0.5112 - auc: 0.4942 - val_loss: 0.6928 - val_accuracy: 0.5161 - val_auc: 0.5000visualize_training_results(mlp_history)

The multilayer perceptron did not lead to a particularly good result, as expected. After 5 epochs, the metrics plateaued, even in the training data. That tells us that this model is struggling quite hard to pick up on anything useful. The test accuracy and ROC-AUC also match the dummy classifier, so it’s definitively useless.

Convolutional Neural Network

Let’s add convolutional functionality to help it parse visual information better. Generally speaking, Convolutional architectures are used in image processing to group together larger and larger pieces of an image.

This is typically done with a convolutional component, which is then fed to a densely connected perceptron.

The below architecture is adapted from a previous project where we built a genre classifier using a similar process as that used here. The GitHub repo for that project can be found here.

# set random seed for reproducibility
np.random.seed(42)
set_seed(42)

# build sequentially
cnn1 = keras.Sequential(name='cnn1')

# convolutional and max pooling layers with successively more filters
cnn1.add(layers.Conv2D(16, (3, 3), activation='relu', padding='same', input_shape=input_shape))
cnn1.add(layers.MaxPooling2D((2, 4)))

cnn1.add(layers.Conv2D(32, (3, 3), activation='relu', padding='same'))
cnn1.add(layers.MaxPooling2D((2, 4)))

cnn1.add(layers.Conv2D(64, (3, 3), activation='relu', padding='same'))
cnn1.add(layers.MaxPooling2D((2, 2)))

cnn1.add(layers.Conv2D(128, (3, 3), activation='relu', padding='same'))
cnn1.add(layers.MaxPool2D((2, 2)))

# fully-connected layers for output
cnn1.add(layers.Flatten())
cnn1.add(layers.Dense(128, activation='relu'))
cnn1.add(layers.Dropout(0.3))
cnn1.add(layers.Dense(64, activation='relu'))
cnn1.add(layers.Dropout(0.3))

# output layer
cnn1.add(layers.Dense(1, activation='sigmoid'))

# compile cnn
cnn1.compile(loss='binary_crossentropy',
            optimizer="adam",
            metrics=['accuracy', 'AUC'])

# take a look at model architecture
cnn1.summary()Model: "cnn1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv2d (Conv2D)             (None, 128, 1292, 16)     160       
                                                                 
 max_pooling2d (MaxPooling2D  (None, 64, 323, 16)      0         
 )                                                               
                                                                 
 conv2d_1 (Conv2D)           (None, 64, 323, 32)       4640      
                                                                 
 max_pooling2d_1 (MaxPooling  (None, 32, 80, 32)       0         
 2D)                                                             
                                                                 
 conv2d_2 (Conv2D)           (None, 32, 80, 64)        18496     
                                                                 
 max_pooling2d_2 (MaxPooling  (None, 16, 40, 64)       0         
 2D)                                                             
                                                                 
 conv2d_3 (Conv2D)           (None, 16, 40, 128)       73856     
                                                                 
 max_pooling2d_3 (MaxPooling  (None, 8, 20, 128)       0         
 2D)                                                             
                                                                 
 flatten_1 (Flatten)         (None, 20480)             0         
                                                                 
 dense_3 (Dense)             (None, 128)               2621568   
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense_4 (Dense)             (None, 64)                8256      
                                                                 
 dropout_1 (Dropout)         (None, 64)                0         
                                                                 
 dense_5 (Dense)             (None, 1)                 65        
                                                                 
=================================================================
Total params: 2,727,041
Trainable params: 2,727,041
Non-trainable params: 0
_________________________________________________________________cnn1_history = cnn1.fit(X_train, y_train, epochs=20, batch_size=100,
                  validation_data=(X_test, y_test))Epoch 1/20
104/104 [==============================] - 152s 1s/step - loss: 0.6937 - accuracy: 0.5081 - auc: 0.4969 - val_loss: 0.6926 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 2/20
104/104 [==============================] - 152s 1s/step - loss: 0.6930 - accuracy: 0.5074 - auc: 0.5047 - val_loss: 0.6929 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 3/20
104/104 [==============================] - 151s 1s/step - loss: 0.6932 - accuracy: 0.5114 - auc: 0.4907 - val_loss: 0.6928 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 4/20
104/104 [==============================] - 151s 1s/step - loss: 0.6929 - accuracy: 0.5112 - auc: 0.5015 - val_loss: 0.6927 - val_accuracy: 0.5161 - val_auc: 0.5043
Epoch 5/20
104/104 [==============================] - 151s 1s/step - loss: 0.6930 - accuracy: 0.5112 - auc: 0.4978 - val_loss: 0.6927 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 6/20
104/104 [==============================] - 151s 1s/step - loss: 0.6929 - accuracy: 0.5112 - auc: 0.4984 - val_loss: 0.6926 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 7/20
104/104 [==============================] - 150s 1s/step - loss: 0.6928 - accuracy: 0.5111 - auc: 0.5071 - val_loss: 0.6927 - val_accuracy: 0.5161 - val_auc: 0.5032
Epoch 8/20
104/104 [==============================] - 148s 1s/step - loss: 0.6929 - accuracy: 0.5120 - auc: 0.5002 - val_loss: 0.6927 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 9/20
104/104 [==============================] - 149s 1s/step - loss: 0.6930 - accuracy: 0.5114 - auc: 0.4937 - val_loss: 0.6927 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 10/20
104/104 [==============================] - 152s 1s/step - loss: 0.6930 - accuracy: 0.5116 - auc: 0.4982 - val_loss: 0.6927 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 11/20
104/104 [==============================] - 152s 1s/step - loss: 0.6930 - accuracy: 0.5108 - auc: 0.4994 - val_loss: 0.6927 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 12/20
104/104 [==============================] - 155s 1s/step - loss: 0.6928 - accuracy: 0.5112 - auc: 0.5051 - val_loss: 0.6927 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 13/20
104/104 [==============================] - 150s 1s/step - loss: 0.6930 - accuracy: 0.5111 - auc: 0.4957 - val_loss: 0.6927 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 14/20
104/104 [==============================] - 150s 1s/step - loss: 0.6930 - accuracy: 0.5112 - auc: 0.4986 - val_loss: 0.6927 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 15/20
104/104 [==============================] - 153s 1s/step - loss: 0.6930 - accuracy: 0.5118 - auc: 0.4976 - val_loss: 0.6927 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 16/20
104/104 [==============================] - 150s 1s/step - loss: 0.6930 - accuracy: 0.5112 - auc: 0.4946 - val_loss: 0.6927 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 17/20
104/104 [==============================] - 149s 1s/step - loss: 0.6929 - accuracy: 0.5112 - auc: 0.4928 - val_loss: 0.6927 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 18/20
104/104 [==============================] - 155s 1s/step - loss: 0.6930 - accuracy: 0.5112 - auc: 0.4981 - val_loss: 0.6927 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 19/20
104/104 [==============================] - 150s 1s/step - loss: 0.6930 - accuracy: 0.5112 - auc: 0.4974 - val_loss: 0.6927 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 20/20
104/104 [==============================] - 149s 1s/step - loss: 0.6929 - accuracy: 0.5112 - auc: 0.4956 - val_loss: 0.6927 - val_accuracy: 0.5161 - val_auc: 0.5000visualize_training_results(cnn1_history)

png — First CNN results. Image by author

The first CNN did not do very well. It seems more reactive than the multilayer perceptron, but not in a way that produces results. The test data eventually went to dummy classifier values, so this is a useless model as well.

CNN Round 2

Same exact architecture as above, just without the second dense layer. This is the same as the first iteration model in the Music Genre Classification project I mentioned earlier.

While working on that project, I noticed that if you connect the CNN to a single dense layer, it performs better. It is maybe unintuitive, but I wanted to try it out.

# set random seed for reproducibility
np.random.seed(42)
set_seed(42)

# build sequentially
cnn2 = keras.Sequential(name='cnn2')

# convolutional and max pooling layers with successively more filters
cnn2.add(layers.Conv2D(16, (3, 3), activation='relu', padding='same', input_shape=input_shape))
cnn2.add(layers.MaxPooling2D((2, 4)))

cnn2.add(layers.Conv2D(32, (3, 3), activation='relu', padding='same'))
cnn2.add(layers.MaxPooling2D((2, 4)))

cnn2.add(layers.Conv2D(64, (3, 3), activation='relu', padding='same'))
cnn2.add(layers.MaxPooling2D((2, 2)))

cnn2.add(layers.Conv2D(128, (3, 3), activation='relu', padding='same'))
cnn2.add(layers.MaxPool2D((2, 2)))

# fully-connected layer for output
cnn2.add(layers.Flatten())
cnn2.add(layers.Dense(128, activation='relu'))


# output layer
cnn2.add(layers.Dense(1, activation='sigmoid'))

# compile cnn
cnn2.compile(loss='binary_crossentropy',
            optimizer="adam",
            metrics=['accuracy', 'AUC'])

# take a look at model architecture
cnn2.summary()Model: "cnn2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv2d_4 (Conv2D)           (None, 128, 1292, 16)     160       
                                                                 
 max_pooling2d_4 (MaxPooling  (None, 64, 323, 16)      0         
 2D)                                                             
                                                                 
 conv2d_5 (Conv2D)           (None, 64, 323, 32)       4640      
                                                                 
 max_pooling2d_5 (MaxPooling  (None, 32, 80, 32)       0         
 2D)                                                             
                                                                 
 conv2d_6 (Conv2D)           (None, 32, 80, 64)        18496     
                                                                 
 max_pooling2d_6 (MaxPooling  (None, 16, 40, 64)       0         
 2D)                                                             
                                                                 
 conv2d_7 (Conv2D)           (None, 16, 40, 128)       73856     
                                                                 
 max_pooling2d_7 (MaxPooling  (None, 8, 20, 128)       0         
 2D)                                                             
                                                                 
 flatten_1 (Flatten)         (None, 20480)             0         
                                                                 
 dense_2 (Dense)             (None, 128)               2621568   
                                                                 
 dense_3 (Dense)             (None, 1)                 129       
                                                                 
=================================================================
Total params: 2,718,849
Trainable params: 2,718,849
Non-trainable params: 0
_________________________________________________________________cnn2_history = cnn2.fit(X_train, y_train, epochs=20, batch_size=100,
                  validation_data=(X_test, y_test))Epoch 1/20
104/104 [==============================] - 144s 1s/step - loss: 0.6943 - accuracy: 0.5069 - auc: 0.5027 - val_loss: 0.6929 - val_accuracy: 0.5161 - val_auc: 0.4861
Epoch 2/20
104/104 [==============================] - 140s 1s/step - loss: 0.6932 - accuracy: 0.5074 - auc: 0.4978 - val_loss: 0.6930 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 3/20
104/104 [==============================] - 139s 1s/step - loss: 0.6932 - accuracy: 0.5088 - auc: 0.5013 - val_loss: 0.6930 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 4/20
104/104 [==============================] - 139s 1s/step - loss: 0.6930 - accuracy: 0.5095 - auc: 0.4984 - val_loss: 0.6928 - val_accuracy: 0.5161 - val_auc: 0.5102
Epoch 5/20
104/104 [==============================] - 140s 1s/step - loss: 0.6929 - accuracy: 0.5121 - auc: 0.5119 - val_loss: 0.6926 - val_accuracy: 0.5161 - val_auc: 0.5000
Epoch 6/20
104/104 [==============================] - 139s 1s/step - loss: 0.6928 - accuracy: 0.5112 - auc: 0.5153 - val_loss: 0.6925 - val_accuracy: 0.5161 - val_auc: 0.5096
Epoch 7/20
104/104 [==============================] - 138s 1s/step - loss: 0.6927 - accuracy: 0.5144 - auc: 0.5148 - val_loss: 0.6915 - val_accuracy: 0.5474 - val_auc: 0.5670
Epoch 8/20
104/104 [==============================] - 137s 1s/step - loss: 0.6934 - accuracy: 0.5106 - auc: 0.5047 - val_loss: 0.6924 - val_accuracy: 0.5161 - val_auc: 0.5395
Epoch 9/20
104/104 [==============================] - 137s 1s/step - loss: 0.6925 - accuracy: 0.5115 - auc: 0.5183 - val_loss: 0.6931 - val_accuracy: 0.5161 - val_auc: 0.4675
Epoch 10/20
104/104 [==============================] - 137s 1s/step - loss: 0.6923 - accuracy: 0.5163 - auc: 0.5186 - val_loss: 0.6936 - val_accuracy: 0.4918 - val_auc: 0.5717
Epoch 11/20
104/104 [==============================] - 138s 1s/step - loss: 0.6926 - accuracy: 0.5178 - auc: 0.5260 - val_loss: 0.6917 - val_accuracy: 0.5260 - val_auc: 0.5594
Epoch 12/20
104/104 [==============================] - 140s 1s/step - loss: 0.6896 - accuracy: 0.5247 - auc: 0.5415 - val_loss: 0.6869 - val_accuracy: 0.5400 - val_auc: 0.5728
Epoch 13/20
104/104 [==============================] - 136s 1s/step - loss: 0.6876 - accuracy: 0.5356 - auc: 0.5477 - val_loss: 0.6862 - val_accuracy: 0.5396 - val_auc: 0.5769
Epoch 14/20
104/104 [==============================] - 136s 1s/step - loss: 0.6863 - accuracy: 0.5379 - auc: 0.5572 - val_loss: 0.6863 - val_accuracy: 0.5408 - val_auc: 0.5772
Epoch 15/20
104/104 [==============================] - 137s 1s/step - loss: 0.6854 - accuracy: 0.5460 - auc: 0.5625 - val_loss: 0.6850 - val_accuracy: 0.5433 - val_auc: 0.5788
Epoch 16/20
104/104 [==============================] - 143s 1s/step - loss: 0.6835 - accuracy: 0.5496 - auc: 0.5703 - val_loss: 0.6855 - val_accuracy: 0.5400 - val_auc: 0.5811
Epoch 17/20
104/104 [==============================] - 137s 1s/step - loss: 0.6878 - accuracy: 0.5363 - auc: 0.5500 - val_loss: 0.6854 - val_accuracy: 0.5412 - val_auc: 0.5762
Epoch 18/20
104/104 [==============================] - 136s 1s/step - loss: 0.6839 - accuracy: 0.5442 - auc: 0.5686 - val_loss: 0.6882 - val_accuracy: 0.5466 - val_auc: 0.5743
Epoch 19/20
104/104 [==============================] - 136s 1s/step - loss: 0.6833 - accuracy: 0.5532 - auc: 0.5707 - val_loss: 0.6856 - val_accuracy: 0.5425 - val_auc: 0.5710
Epoch 20/20
104/104 [==============================] - 136s 1s/step - loss: 0.6832 - accuracy: 0.5545 - auc: 0.5717 - val_loss: 0.6837 - val_accuracy: 0.5453 - val_auc: 0.5821visualize_training_results(cnn2_history)

png — Second CNN results. Image by author

Wow, it actually learned a little something! Neat! It did not end up at dummy classifier values, so I can be sure it actually learned something. Eventually, we can see towards the later epochs that the model is starting to overfit on the training data, but the test scores plateaued at a good spot when the training finished.

Recurrent Neural Network

Finally, we’ll try this flavor of recursive neural network. I am taking the below architecture from this paper and adapting it to a binary classification. In that paper, they sought to classify music artists using this technique.

It uses a convolutional component to start, like the previous two iterations, but then the output is fed into a Gated Recurrent Unit, which “summarize[s] temporal structure following 2D convolution”. In other words, it allows the neural network to pick up on time patterns using what it learns from the convolutional portion.

Pretty neat, huh?

# set random seed for reproducibility
np.random.seed(42)
set_seed(42)

# build sequentially
rnn = keras.Sequential(name='rnn')

# convolutional and max pooling layers with successively more filters
rnn.add(layers.Conv2D(64, (3, 3), activation='elu', padding='same', input_shape=input_shape))
rnn.add(layers.MaxPooling2D((2, 2)))
rnn.add(layers.Dropout(0.1))

rnn.add(layers.Conv2D(128, (3, 3), activation='elu', padding='same'))
rnn.add(layers.MaxPooling2D((4, 2)))
rnn.add(layers.Dropout(0.1))

rnn.add(layers.Conv2D(128, (3, 3), activation='elu', padding='same'))
rnn.add(layers.MaxPooling2D((4, 2)))
rnn.add(layers.Dropout(0.1))

rnn.add(layers.Conv2D(128, (3, 3), activation='elu', padding='same'))
rnn.add(layers.MaxPool2D((4, 2)))
rnn.add(layers.Dropout(0.1))

rnn.add(layers.Reshape((80,128)))
rnn.add(layers.GRU(units=32, dropout=0.3, return_sequences=True))
rnn.add(layers.GRU(units=32, dropout=0.3))

# output layer
rnn.add(layers.Dense(1, activation='sigmoid'))

# compile cnn
rnn.compile(loss='binary_crossentropy',
            optimizer="adam", 
            metrics=['accuracy', 'AUC'])

# take a look at model architecture
rnn.summary()Model: "rnn"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv2d (Conv2D)             (None, 128, 1292, 64)     640       
                                                                 
 max_pooling2d (MaxPooling2D  (None, 64, 646, 64)      0         
 )                                                               
                                                                 
 dropout (Dropout)           (None, 64, 646, 64)       0         
                                                                 
 conv2d_1 (Conv2D)           (None, 64, 646, 128)      73856     
                                                                 
 max_pooling2d_1 (MaxPooling  (None, 16, 323, 128)     0         
 2D)                                                             
                                                                 
 dropout_1 (Dropout)         (None, 16, 323, 128)      0         
                                                                 
 conv2d_2 (Conv2D)           (None, 16, 323, 128)      147584    
                                                                 
 max_pooling2d_2 (MaxPooling  (None, 4, 161, 128)      0         
 2D)                                                             
                                                                 
 dropout_2 (Dropout)         (None, 4, 161, 128)       0         
                                                                 
 conv2d_3 (Conv2D)           (None, 4, 161, 128)       147584    
                                                                 
 max_pooling2d_3 (MaxPooling  (None, 1, 80, 128)       0         
 2D)                                                             
                                                                 
 dropout_3 (Dropout)         (None, 1, 80, 128)        0         
                                                                 
 reshape (Reshape)           (None, 80, 128)           0         
                                                                 
 gru (GRU)                   (None, 80, 32)            15552     
                                                                 
 gru_1 (GRU)                 (None, 32)                6336      
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
=================================================================
Total params: 391,585
Trainable params: 391,585
Non-trainable params: 0
_________________________________________________________________rnn_history = rnn.fit(X_train, y_train, epochs=30, batch_size=100,
                  validation_data=(X_test, y_test))Epoch 1/30
104/104 [==============================] - 948s 9s/step - loss: 0.6970 - accuracy: 0.5051 - auc: 0.4990 - val_loss: 0.6913 - val_accuracy: 0.5223 - val_auc: 0.5421
Epoch 2/30
104/104 [==============================] - 933s 9s/step - loss: 0.6932 - accuracy: 0.5156 - auc: 0.5170 - val_loss: 0.6940 - val_accuracy: 0.5161 - val_auc: 0.5173
Epoch 3/30
104/104 [==============================] - 953s 9s/step - loss: 0.6932 - accuracy: 0.5153 - auc: 0.5186 - val_loss: 0.6908 - val_accuracy: 0.5161 - val_auc: 0.5515
Epoch 4/30
104/104 [==============================] - 947s 9s/step - loss: 0.6906 - accuracy: 0.5260 - auc: 0.5372 - val_loss: 0.6883 - val_accuracy: 0.5289 - val_auc: 0.5532
Epoch 5/30
104/104 [==============================] - 960s 9s/step - loss: 0.6881 - accuracy: 0.5328 - auc: 0.5526 - val_loss: 0.6876 - val_accuracy: 0.5317 - val_auc: 0.5649
Epoch 6/30
104/104 [==============================] - 955s 9s/step - loss: 0.6874 - accuracy: 0.5369 - auc: 0.5558 - val_loss: 0.6875 - val_accuracy: 0.5441 - val_auc: 0.5646
Epoch 7/30
104/104 [==============================] - 951s 9s/step - loss: 0.6868 - accuracy: 0.5429 - auc: 0.5609 - val_loss: 0.6863 - val_accuracy: 0.5276 - val_auc: 0.5869
Epoch 8/30
104/104 [==============================] - 961s 9s/step - loss: 0.6896 - accuracy: 0.5329 - auc: 0.5498 - val_loss: 0.6909 - val_accuracy: 0.5293 - val_auc: 0.5688
Epoch 9/30
104/104 [==============================] - 966s 9s/step - loss: 0.6870 - accuracy: 0.5451 - auc: 0.5639 - val_loss: 0.6830 - val_accuracy: 0.5540 - val_auc: 0.5773
Epoch 10/30
104/104 [==============================] - 978s 9s/step - loss: 0.6857 - accuracy: 0.5511 - auc: 0.5658 - val_loss: 0.6826 - val_accuracy: 0.5556 - val_auc: 0.5868
Epoch 11/30
104/104 [==============================] - 963s 9s/step - loss: 0.6865 - accuracy: 0.5436 - auc: 0.5640 - val_loss: 0.6810 - val_accuracy: 0.5676 - val_auc: 0.5884
Epoch 12/30
104/104 [==============================] - 972s 9s/step - loss: 0.6835 - accuracy: 0.5565 - auc: 0.5793 - val_loss: 0.6790 - val_accuracy: 0.5750 - val_auc: 0.6003
Epoch 13/30
104/104 [==============================] - 960s 9s/step - loss: 0.6837 - accuracy: 0.5563 - auc: 0.5772 - val_loss: 0.6851 - val_accuracy: 0.5478 - val_auc: 0.5953
Epoch 14/30
104/104 [==============================] - 969s 9s/step - loss: 0.6816 - accuracy: 0.5640 - auc: 0.5863 - val_loss: 0.6784 - val_accuracy: 0.5709 - val_auc: 0.6051
Epoch 15/30
104/104 [==============================] - 961s 9s/step - loss: 0.6807 - accuracy: 0.5679 - auc: 0.5901 - val_loss: 0.6771 - val_accuracy: 0.5697 - val_auc: 0.6011
Epoch 16/30
104/104 [==============================] - 967s 9s/step - loss: 0.6793 - accuracy: 0.5719 - auc: 0.5957 - val_loss: 0.6834 - val_accuracy: 0.5668 - val_auc: 0.5973
Epoch 17/30
104/104 [==============================] - 961s 9s/step - loss: 0.6801 - accuracy: 0.5682 - auc: 0.5927 - val_loss: 0.6785 - val_accuracy: 0.5783 - val_auc: 0.5988
Epoch 18/30
104/104 [==============================] - 967s 9s/step - loss: 0.6793 - accuracy: 0.5716 - auc: 0.5932 - val_loss: 0.6931 - val_accuracy: 0.5602 - val_auc: 0.5948
Epoch 19/30
104/104 [==============================] - 972s 9s/step - loss: 0.6811 - accuracy: 0.5654 - auc: 0.5888 - val_loss: 0.6843 - val_accuracy: 0.5441 - val_auc: 0.5977
Epoch 20/30
104/104 [==============================] - 958s 9s/step - loss: 0.6759 - accuracy: 0.5754 - auc: 0.6050 - val_loss: 0.6764 - val_accuracy: 0.5779 - val_auc: 0.6049
Epoch 21/30
104/104 [==============================] - 972s 9s/step - loss: 0.6741 - accuracy: 0.5849 - auc: 0.6108 - val_loss: 0.6796 - val_accuracy: 0.5688 - val_auc: 0.5943
Epoch 22/30
104/104 [==============================] - 972s 9s/step - loss: 0.6706 - accuracy: 0.5895 - auc: 0.6195 - val_loss: 0.6898 - val_accuracy: 0.5581 - val_auc: 0.5958
Epoch 23/30
104/104 [==============================] - 968s 9s/step - loss: 0.6727 - accuracy: 0.5821 - auc: 0.6149 - val_loss: 0.6797 - val_accuracy: 0.5767 - val_auc: 0.6058
Epoch 24/30
104/104 [==============================] - 966s 9s/step - loss: 0.6705 - accuracy: 0.5877 - auc: 0.6203 - val_loss: 0.6753 - val_accuracy: 0.5717 - val_auc: 0.6027
Epoch 25/30
104/104 [==============================] - 974s 9s/step - loss: 0.6667 - accuracy: 0.5915 - auc: 0.6289 - val_loss: 0.6816 - val_accuracy: 0.5660 - val_auc: 0.6017
Epoch 26/30
104/104 [==============================] - 973s 9s/step - loss: 0.6664 - accuracy: 0.5997 - auc: 0.6308 - val_loss: 0.6820 - val_accuracy: 0.5730 - val_auc: 0.6105
Epoch 27/30
104/104 [==============================] - 971s 9s/step - loss: 0.6701 - accuracy: 0.5872 - auc: 0.6207 - val_loss: 0.6792 - val_accuracy: 0.5775 - val_auc: 0.5988
Epoch 28/30
104/104 [==============================] - 976s 9s/step - loss: 0.6669 - accuracy: 0.5958 - auc: 0.6298 - val_loss: 0.6959 - val_accuracy: 0.5350 - val_auc: 0.6071
Epoch 29/30
104/104 [==============================] - 1424s 14s/step - loss: 0.6650 - accuracy: 0.5948 - auc: 0.6336 - val_loss: 0.6758 - val_accuracy: 0.5787 - val_auc: 0.6059
Epoch 30/30
104/104 [==============================] - 971s 9s/step - loss: 0.6666 - accuracy: 0.6011 - auc: 0.6315 - val_loss: 0.6716 - val_accuracy: 0.5837 - val_auc: 0.6152visualize_training_results(rnn_history)

Not bad at all! Obviously, nothing amazing, but still not nothing. Clearly, this type of neural network architecture is picking up on something present in the music. It’s around ~7–8% more accurate and ROC-AUC improved by ~0.1, which is a clear improvement. This definitely shows promise, and perhaps some level of utility.

Holdout Testing

models_dict = {'Dummy': dummy, 'MP': mlp_dict, 'CNN 1': cnn1_dict, 'CNN 2': cnn2_dict, 
               'RNN': rnn_dict}acc_dict, roc_auc_dict = evaluate_holdout(models_dict, X_holdout, y_holdout)# numerical scores for all the models
print("Accuracy Scores:")
for model, score in acc_dict.items():
    print(f'{model}: {score}')
print("<>"*10)
print("ROC-AUC Scores:")
for model, score in roc_auc_dict.items():
    print(f'{model}: {score}')Accuracy Scores:
Dummy: 0.5065015479876162
MP: 0.5065015479876162
CNN 1: 0.5065015479876162
CNN 2: 0.5523219814241486
RNN: 0.5981424148606811
<><><><><><><><><><>
ROC-AUC Scores:
Dummy: 0.5
MP: 0.5
CNN 1: 0.5
CNN 2: 0.5742730226123023
RNN: 0.645901654431502# graph!

plt.rcParams.update({'font.size': 20})
fig, (ax1, ax2) = plt.subplots(2, figsize=(9,9))
plt.tight_layout(pad=2.5)
fig.suptitle('Holdout Data Results')

ax1.bar(list(acc_dict.keys()), list(acc_dict.values()))
ax1.set_ylabel('Accuracy')
ax1.set_ylim(0.5, 1)
ax1.tick_params(axis='x', rotation=45)

ax2.bar(list(roc_auc_dict.keys()), list(roc_auc_dict.values()))
ax2.set_ylabel('ROC-AUC')
ax2.set_ylim(0.5, 1)
ax2.tick_params(axis='x', rotation=45)
plt.show()

Final Results on Holdout Set (+ changes added post-render). Image by author

Evaluation

As you can see, even as we engaged with some very complex and cutting-edge neural network architectures, we did not get a drastic increase in our chosen metrics: accuracy and ROC-AUC. This signals a limited amount of learning on the model’s part to successfully distinguish the two groups.

However, we did get 59.8% accuracy and 0.646 ROC-AUC, which is not nothing — that is an increase of 9.1% accuracy and 0.146 ROC-AUC. And this was on the holdout set, so it did pick up on a few underlying abstract patterns, which is interesting and useful.

If we let the model continue learning, it would eventually overfit on the training data. At any rate, If I analyze this with LIME, I could definitely gain some real insight into what exactly the patterns it picked up on were.

Conclusion

I knew from the outset that this modeling project was an incredibly ambitious undertaking. Now that it’s completed I can say with the wisdom of hindsight that it did not do amazingly well, but it did perform admirably. To be frank, I might go so far as to assert that it was doomed to some level of failure with the limitations I set out with — namely, using solely audio samples. However, my results showed there to be a nugget of utility. More importantly, it does demonstrate a proof of concept that is quite powerful.

Creating a model that does something difficult for the human ear was a challenge. Not only is the data difficult to group together by any sort of underlying pattern, but the target of popularity is nebulous and inherently difficult to quantify. There is a great deal wrapped up in popularity, and I knew going in that there are factors outside of the music itself that exert great influence on its outcome. Many bad songs are popular by association with a popular artist, for example.

In the end, I obtained an accuracy of 59.8% and 0.646 ROC-AUC, which is extremely promising (increase of 9.1% accuracy and 0.146 ROC-AUC). Despite the necessarily limiting way I went about doing this project, I was able to build a model that picked up on some common features of popular songs and made some successful predictions.

This changes the way in which my model might be useful. I set out to build a model that could be used as a tool on the song level by an artist to tell if a specific song had promise. What I ended up with was a model that could be passed a number of possible songs and indicate the few songs it believes will be popular. Coincidentally, my client runs a recording studio as well and is in contact with several dozen artists who each have a collection of songs in production. In the model’s current form, we could pass in a collection of song samples for it to predict on, and the ones it chooses as “popular” could then be selected for focus. This selection won’t perfectly determine which songs will be popular, but it could save time as a first pass for the artists and my client’s recording studio.

Next Steps

Eventually with enough work, I believe the model can be tooled enough that it will be useful to evaluate individual songs. Clearly, there are other factors influencing popularity than what I collected for this project, and step one is to try to incorporate those into the model.

Within the audio itself, we can process the samples differently and pull out other relevant information. Maybe more crucially, we can look at factors that exist outside of the music itself. When I called my client to talk about how the model performed, the first thing he said was “I’m not surprised,” because he is intimately acquainted with how seemingly arbitrary factors can lead to a song’s popularity. Exposure writ large is a HUGE factor in whether/how much a song blows up. Prime example — my client’s top track on Spotify is the soundtrack to a meme in some TikTok circles. Other non-audio features can include timing of release and artist information.

Additionally, I had quite a bit of success with the more advanced neural network architectures — specifically, Gated Recurrent Unit (a type of Recurrent/Recursive Neural Network). Looking into other architectures and further tweaking existing ones will likely improve performance.

I mentioned much earlier in this notebook that a significant limitation of the dataset is that the tracklist did not seem to be all that complete. To more fully capture the landscape of contemporary hip-hop music, I need to reassess my collection methods to gather more songs.

Other than expanding on what I have, I would also like to look at what worked. I mentioned using LIME (or similar) in the introduction but did not end up using it this time around. This would be a very useful tool for looking under the hood as to what was the strongest influence on the model’s predictions. Especially once I start incorporating more/different inputs, input types, and architectures, this will help me gain insight into what is actually going on. Furthermore, this will advise artists as to what are the important parts of their songs that the model is looking at.