Audio Genre Classification with Python OOP

Published in

Towards Data Science

16 min readApr 22, 2020

This is a bit of a long post, but it’s largely broken into 2 main parts (after the initial overview).

The first main part begins with the ‘Audio Feature Extraction’ header. This is more of a background and justification for the audio feature extraction choices for the classifier, and why they’re necessary.

The second main part gets into modeling and code, and begins with the ‘OOP Model Design’ header.

A repo is here if you want to see the full code!

marcmuon/audio_genre_classification

Train a model to classify various audio file clips by genre, and predict on new data. This repo exists to host code…

github.com

Audio Genre Classification Overview

Motivation

We’re going to develop an audio genre classifier using music information retrieval methods from the librosa library. The input is raw audio files (mp3). We’ll extract features from the digital representation of the audio, and then pass this to a classifier. The ouput of the classifier will be what genre it is (e.g. Hip-hop, electronic, jazz, etc).

Once a model is trained, I could feed unseen examples to the pipeline and use it to predict a genre label. Thus an application is using this to (for example) tag songs in a music library with ‘unknown’ genres.

Generating Training Data

The first step is to create our training set. The base data will simply be full .mp3 files of various songs. I’m going to use music from my own library for this exercise. Most of my6,000 song collection is comprised of the following 5 genres: Ambient, Techno, Rap, Jazz, and Drum & Bass. These will be our five classification classes. I wanted to balance the following when constructing this training set:

Trying to label (roughly) the same amount of songs in each genre class. This won’t be entirely possible because the distribution in the broader ‘population’, i.e., my full library, is not uniform.
Getting a broad range of examples for each genre. I.e., not just choosing 10 songs from one album as this type of narrow training data choice will be less likely to generalize.
Ensuring that a random human labeler would with near certainty agree that this is the correct genre choice. This just amounts to picking songs that are very clearly this genre and not a ‘crossover’ between different genres.

Manual Labeling

We have 6,000 songs; we’ll choose say ~10% according to the 3 bulleted concerns above, then label the genre classes manually. We’ll train a model on this manually labeled training data, then use the trained model to predict on the unlabeled data set to generate genre labels.

Audio Feature Extraction

How can we feed each audio file in a meaningful way to the model? You can use a tool like librosa to load audio files as a floating point time series, but a (downsampled) 7-minute audio file will yield a time series vector nearly ~9,000,000 floating point numbers in length!

Even if we condensed that in some way, it’s still not going to be computationally feasible to model directly.

Tempo Feature

At first glance tempo seems like a great candidate. For example tempo is generally recorded at a faster tempo than rap. But consider something like ambient music — these songs are often entirely beat-less; the tempo won’t be a great discriminating predictor as tempo varies wildly even within the ambient genre. Similarly, jazz tempos might be in the same range as either rap, techno or drum & bass. Despite these drawbacks I’ll include it as one feature, but we’ll need much more nuance to generate a good predictive mapping from audio files to genre.

An aside — Frequency Content

It would probably be more useful to capture the frequency content in some way. From the highlighted link:

The audio spectrum range spans from 20 Hz to 20,000 Hz and can be effectively broken down into seven different frequency bands, with each band having a different impact on the total sound.
The seven frequency bands are:
Sub-bass
Bass
Low midrange
Midrange
Upper midrange
Presence
Brilliance

They key point here is that as time progresses through the song, the frequency content will vary. In particular, at every time step you have a new mix of audible frequencies. So it’s not going to be enough to take, say, an “average” frequency content.

Spectral Contrast Feature

To solve this dilemma, I turned to modern music feature extraction techniques detailed on musicinformationretrieval.com.

For instance, I know that different instruments dominate certain frequency bands, and the use of certain instruments might allow the model to infer the genre. The spectral contrast feature outlined on musicinformationretrieval.com seems like it has what we need:

“Spectral contrast considers the spectral peak, the spectral valley, and their difference in each frequency subband.”

In particular, spectral contrast gives us that information regarding the spectral peak, valley and difference over time, per frequency subband! Thus a lot of information is encoded in this feature.

One notes is that you still end up with a large matrix of information for this feature, so you’ll need to aggregate the calculations over time.
E.g. say we extract 6 frequency subbands. We extract the spectral contrast calculation for each band over time (time on x-axis, subband on y-axis).
Then you can take a mean (or stddev, etc) of spectral contrast across time for each of the subbands
The vector of means (or of stddev, or both) then becomes a vector of features, for every song! Each subband mean (or stddev) gets a column; each song is a row.

That would probably make for a decent classifier, but lets add some more info to the per-song feature vector.

MFCC Feature

What I really wanted is some numerical representation of timbre. Timbre is essentially the human perceived quality of a note. E.g. I can play ‘C’ on a violin and ‘C’ on a trombone and it a vastly different experience.

But capturing this experience across genres is exactly what we want.

Turning again to musicinformationretrieval.com it appears the mel frequency cepstral coefficients fit our needs:

The mel frequency cepstral coefficients (MFCCs) of a signal are a small set of features (usually about 10–20) which concisely describe the overall shape of a spectral envelope. In MIR, it is often used to describe timbre.

Similar care here needs to be taken to extract the appropriate means/stddev for each song, but the process is similar to spectral contrast described above, so let’s move on.

OOP Model Design

For code organization, I’ve created a Model class and an AudioFeature class. The Audio class handles the feature extraction described above. Both are used via a short script in main.py:

(Note that I’ll be posting code snippets both within the post and from Github Gists. Sometimes Medium doesn’t embed GitHub Gists so you may see both snippets in a row, but it’s the same info).

marcmuon/audio_genre_classification

Train a genre classifier using a variety of music information retrieval techniques …

github.com

# extract_audio_metadata will be a top-level helper function
all_metadata = parse_audio_playlist(playlist='data/Subset.txt')

audio_features = []
for metadata in all_metadata:
    path, genre = metadata
    audio = AudioFeature(path, genre)
    audio.extract_features('mfcc', 'spectral_contrast', 'tempo',
                           save_local=True)
    audio_features.append(audio)

feature_matrix = np.vstack([audio.features for audio in audio_features])
genre_labels = [audio.genre for audio in audio_features]

model_cfg = dict(
        tt_test_dict=dict(shuffle=True, test_size=0.3),
        tt_val_dict=dict(shuffle=True, test_size=0.25),
        scaler=StandardScaler(copy=True),
        base_model=RandomForestClassifier(
            random_state=42,
            n_jobs=4,
            class_weight="balanced",
            n_estimators=250,
            bootstrap=True,
        ),
        param_grid=dict(
            model__criterion=["entropy", "gini"],
            model__max_features=["log2", "sqrt"],
            model__min_samples_leaf=np.arange(2, 4),
        ),
        grid_dict=dict(n_jobs=4,
                       refit=True,
                       iid=False,
                       scoring="balanced_accuracy"),
        kf_dict=dict(n_splits=3, random_state=42, shuffle=True),
    )

model = Model(feature_matrix, genre_labels, model_cfg)
model.train_kfold()
model.predict(holdout_type="val")
model.predict(holdout_type="test")

There are 4 components to the pipeline:

Choosing which audio paths on disk to point to, along with their manually labeled genre [all_metadata]
Extracting audio features for every audio path in the labeled training set, and saving the extracted feature vectors for each audio file in an AudioFeature object. [audio_features]
Concatenating the audio feature vectors for every AudioFeature object into a feature matrix for modeling, and extracting their corresponding labels
Sending the feature matrix, labels, and model config dict (not yet shown) to a Model object, running cross-validation within 1 or more ‘trials’, and predicting on a hold-out test set using the trained model.

Grabbing audio file metadata

There’s one helper function in the main.py file to load in information about where the audio files live on disk.

I made this to work with an Apple Music playlist. There’s an example playlist file committed to the repo in the data directory.

The function output here is an iterator that includes the path to the audio file and the genre.

def parse_audio_playlist(playlist):
    """
    Assumes an Apple Music playlist saved as plain text
    Returns: zip object with (paths, genres)
    """

    df = pd.read_csv(playlist, sep='\t')
    df = df[['Location', 'Genre']]

    paths = df['Location'].values.astype(str)
    paths = np.char.replace(paths, 'Macintosh HD', '')

    genres = df['Genre'].values

    return zip(paths, genres)

Playlist extractor

The AudioFeature object

The AudioFeature class I wrote holds methods to extract features from a raw audio file (mp3).

To use this class, the ‘public’ method is meant to be called as follows:

audio = AudioFeature(path, genre)
audio.extract_features('mfcc',
                       'spectral_contrast',
                       'tempo',
                       save_local=True)

You can feed it a list of features as strings that you want extracted.

Here is the code for the class. Note that when someone calls .extract_features(…) as shown above, this kicks off a series of other method calls depending on the string arguments.

import librosa
import numpy as np
import pickle


class AudioFeature:
    def __init__(self,
                 path,
                 genre,
                 duration=10,
                 offset=25,
                 sr=22050):
        """
        Keep duration num seconds of each clip, starting at
        offset num seconds into the song (avoid intros)
        """
        self.path = path
        self.genre = genre
        self.y, self.sr = librosa.load(self.path,
                                       sr=sr,
                                       duration=duration,
                                       offset=offset)

        self.features = None

    def _concat_features(self, feature):
        """
        Whenever an _extract_X method is called by main.py,
        this helper function concatenates to Audio instance
        features attribute
        """
        self.features = np.hstack(
            [self.features, feature]
            if self.features is not None else feature)

    def _extract_mfcc(self, n_mfcc=12):
        """
        Extract MFCC mean and std_dev vectors for a clip.
        Appends (2*n_mfcc,) shaped vector to
        instance feature vector
        """
        mfcc = librosa.feature.mfcc(self.y,
                                    sr=self.sr,
                                    n_mfcc=n_mfcc)

        mfcc_mean = mfcc.mean(axis=1).T
        mfcc_std = mfcc.std(axis=1).T
        mfcc_feature = np.hstack([mfcc_mean, mfcc_std])
        self._concat_features(mfcc_feature)

    def _extract_spectral_contrast(self, n_bands=3):
        """
        Extract Spectral Contrast mean and std_dev vectors
        Appends (2*(n_bands+1),) shaped vector to
        instance feature vector
        """
        spec_con = librosa.feature.spectral_contrast(y=self.y,
                                                     sr=self.sr,
                                                    n_bands=n_bands)

        spec_con_mean = spec_con.mean(axis=1).T
        spec_con_std = spec_con.std(axis=1).T
        spec_con_feature = np.hstack([spec_con_mean, spec_con_std])
        self._concat_features(spec_con_feature)

    def _extract_tempo(self):
        """
        Extract the BPM.
        Appends (1,) shaped vector to instance feature vector
        """
        tempo = librosa.beat.tempo(y=self.y, sr=self.sr)
        self._concat_features(tempo)

    def extract_features(self, *feature_list, save_local=False):
        """
        Specify a list of features to extract,
        built for you for a given Audio sample.

        Currently supported: 'mfcc', 'spectral_contrast', 'tempo'
        """
        for feature in feature_list:
            if feature == 'mfcc':
                self._extract_mfcc()
            elif feature == 'spectral_contrast':
                self._extract_spectral_contrast()
            elif feature == 'tempo':
                self._extract_tempo()
            else:
                raise KeyError('Feature type not understood')

        if save_local:
            self._save_local(mem_clean=True)

    def _save_local(self, mem_clean=True):
        self.local_path = self.path.split('/')[-1]
        self.local_path = (
             self.local_path.replace('.mp3', '').replace(' ', '')       )
        with open(f'data/{self.local_path}.pkl', 'wb') as out_f:
            pickle.dump(self, out_f)

        if mem_clean:
            self.y = None

audio.py module

Template OOP Design Pattern

I am loosely following the template OOP design pattern. As shown earlier above, the client main.py will first create each AudioFeature object and then call the .extract_features method. This method will kick off a sequence of internal helper methods in-order that extract the features.

To make this a bit more concrete — if I run the main.py script I could tell it to only include tempo as a feature, or only include mfcc as a feature.

To achieve this I’d only have to change a single parameter in the main.py script posted earlier, without touching the AudioFeature module code in any way.

Param choices to send to the class

I chose to leave two additional parameters as an option for the main.py caller to send to the constructor. These are:

The duration of the audio clip to extract
A starting offset (e.g. start 10 seconds into the song instead of the beginning)

I leave these as options to send to the AudioFeature constructor because I’d want to test running the entire feature transformation and model pipeline with different settings of these later on.

Finally note that all the feature extraction happens after the constructor loads an audio file with relevant params using librosa.load. So if I wanted to change the entire pipeline with a different duration or offset, I only need to edit the constructor param calls to AudioFeature in main.py, and the rest of the pipeline will then work without any editing.

Model — Passing Features and Labels

Let’s zoom in just on this part from main.py:

audio_features = []
for metadata in all_metadata:
    path, genre = metadata
    audio = AudioFeature(path, genre)
    audio.extract_features('mfcc', 'spectral_contrast', 'tempo',
                           save_local=True)
    audio_features.append(audio)

feature_matrix = (
    np.vstack([audio.features for audio in audio_features])
)
genre_labels = [audio.genre for audio in audio_features]

The all_metadata name points to a Python zip object of paths and genres, which are passed to the AudioFeature class in the audio module.

Thus we create an AudioFeature object for every path listed in the training data playlist (labeled with a genre). Then features are extracted for each one and (optionally) saved to disk.

In addition the extracted feature vectors for every AudioFeature object now is in our namespace and we can:

Get the audio.features instance attribute (which is a feature vector for every audio file) and send it to np.vstack to make a feature matrix. This is stored in the feature_matrix name
Get the genre from audio.features instance variable and store in the genre_labels name. These are the manual labels.

Now this is exactly what gets passed to the model, which also exists as its own class.

The Model class

Now main.py just needs to run the following 3 lines to kick off the training cross-validation and score predictions on the holdout dev test and test set (which will be printed):

model = Model(feature_matrix, genre_labels, model_cfg)
model.train_kfold()
model.predict(holdout_type="val")
model.predict(holdout_type="test")

Naturally, there are a ton of levers to tune in the model. I set a config dict main.py (model_cfg) and run different tests by altering the parameters and re-running the Model process. Note that I can do this without having to re-run the feature generation process, since I saved all that on disk!

I.e. — after creating feature_matrix and genre labels in a session, you can just re-run the model calls independently in (e.g. iPython) to test out different model ideas. Make sure to only judge based on the dev set and not the holdout test set.

Class instance attributes

Let’s look first at the __init__ method in the Model class:

from sklearn.pipeline import Pipeline

from sklearn.model_selection import (
    train_test_split,
    GridSearchCV,
    StratifiedKFold,
)
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import LabelEncoder
import numpy as np


class Model:

    def __init__(self, feature_matrix, labels, cfg):

        self.X = feature_matrix
        self.y = labels
        self.cfg = cfg

        # populated in .run_cv_trials()
        self.best_estimators = None
        self.holdout_test_sets = None

        # populated in .predict_from_holdout()
        self.fnr = None
        self.fpr = None
        self.accuracy = None
        self.y_preds = None

    def run_cv_trials(self, n_trials=1):

        encoder = LabelEncoder()
        self.y = encoder.fit_transform(self.y)

        self.best_estimators = []
        self.holdout_test_sets = []

        for i in range(n_trials):

            X_cv, X_test, y_cv, y_test = train_test_split(
                self.X,
                self.y,
                random_state=i,
                stratify=self.y,
                **self.cfg['tt_dict'])

            pipe = Pipeline([
                ('scaler', self.cfg['scaler']),
                ('model', self.cfg['base_model'])
            ])

            kf = StratifiedKFold(**self.cfg['kf_dict'])

            grid_search = GridSearchCV(
                                 estimator=pipe,
                                 param_grid=self.cfg['param_grid'],
                                 cv=kf,
                                 scoring='balanced_accuracy',
                                 return_train_score=True,
                                 verbose=3,
                                 **self.cfg['grid_dict'])

            grid_search.fit(X_cv, y_cv)
            best_estimator = grid_search.best_estimator_

            self.best_estimators.append(best_estimator)
            self.holdout_test_sets.append((X_test, y_test))

    def predict_from_holdout(self):

        self.fpr = []
        self.fnr = []
        self.accuracy = []
        self.y_preds = []

        for i, test in enumerate(self.holdout_test_sets):

            X_test, y_test = test[0], test[1]
            scaler = self.best_estimators[i]['scaler']
            model = self.best_estimators[i]['model']

            X_test_scaled = scaler.transform(X_test)

            y_pred = model.predict(X_test_scaled)

            cnf_matrix = confusion_matrix(y_test, y_pred)

            FP = cnf_matrix.sum(axis=0) - np.diag(cnf_matrix)
            FN = cnf_matrix.sum(axis=1) - np.diag(cnf_matrix)
            TP = np.diag(cnf_matrix)
            TN = cnf_matrix.sum() - (FP + FN + TP)

            FP = FP.astype(float)
            FN = FN.astype(float)
            TP = TP.astype(float)
            TN = TN.astype(float)

            print(f'Test Set from Trial Number {i}:')
            print(f'TN:{TN}, FP:{FP}, FN:{FN}, TP:{TP}')

            self.fpr.append(FP / (FP + TN))
            self.fnr.append(FN / (TP + FN))
            self.accuracy.append((TP + TN) / (TP + TN + FP + FN))

        print(f'Averages over all Test Sets:')
        print(f'False Positive Rate: {self.fpr}')
        print(f'False Negative Rate: {self.fnr}')
        print(f'Accuracy: {self.accuracy}')

I’m using this area to define every instance attribute I ultimately want to live in this object (which I can later save to disk if I want via pickle for easy re-load access later).

The feature matrix gets stored in self.X and the genre labels get stored in self.y. The config dict (cfg) gets saved in self.cfg, and this will let me see exactly how I configured each Model instance.

Moving down, after I call .train_kfold() I’ll populate the self.best_estimator attribute, which saves the best model found during cross-validation.

I’ll also store my holdout test set and holdout dev set, in case I want to do error analysis on individual examples and examine if their distribution is different from the training data distribution.

The train_kfold method

Here’s where the model gets trained:

def train_kfold(self):
    """
    Using Pipeline objects as they don't leak transformations
    into the validation folds as shown here: https://bit.ly/2N7rdQ0,
    and here: https://bit.ly/346THQL

    Note that return_train_score=True and verbose=3 in GridSearchCV
    is useful for debugging.
    """

    # Save a holdout test set that WON'T go through RepeatedKFold
    # We will not fit any paramter choices to the holdout test set
    X_cv, X_test, y_cv, y_test = train_test_split(
            self.X,
            self.y,
            random_state=42,
            stratify=self.y,
            **self.cfg['tt_test_dict'])

    self.holdout_test_set = (X_test, y_test)

    # From the non-holdout-test data, split off a validation piece
    X_train, X_val, y_train, y_val = train_test_split(
        X_cv,
        y_cv,
        random_state=42,
        stratify=y_cv,
        **self.cfg['tt_val_dict'])

    # Note these val sets won't go into GridSearchCV
    # We'll predict on these in the .predict_from_val() method
    self.holdout_val_set = (X_val, y_val)

    pipe = Pipeline([
        ('scaler', self.cfg['scaler']),
        ('model', self.cfg['base_model'])
    ])

    # Use stratification within KFold Split inside GridSearchCV

    kf = StratifiedKFold(**self.cfg['kf_dict'])

    # Perform KFold many times according to our Param Grid Search
    grid_search = GridSearchCV(estimator=pipe,
                                param_grid=self.cfg['param_grid'],
                                cv=kf,
                                return_train_score=True,
                                verbose=3,
                                **self.cfg['grid_dict'])

    # refit the best estimator on the FULL train set
    grid_search.fit(X_train, y_train)
    self.best_estimator = grid_search.best_estimator_

I’m intending to write another full post soon on this cross validation process in depth, but here’s a bullet point rundown of this approach:

I want a holdout test set that never touches the cross-validation inner loop. The CV takes places within GridSearchCV from sklearn.
Note that what gets passed to the CV GridSearch is the result of a 2nd train test_split!
So now we have train / dev / test. All of these are stored as attributes in the class.
The pipeline is passed to GridSeachCV so that we don’t leak info from the transformations into the CV training, see here: https://stackoverflow.com/questions/57651455/are-the-k-fold-cross-validation-scores-from-scikit-learns-cross-val-score-and
Note that GridSearchCV will find the best params from a param dict that we pass in. This dict (in this case) was created in main.py (shown in first code snippet from this post).

The predict_from_holdout() method

At this point we have a trained model and saved dev and test sets. These all live as attributes in the class. So now from main.py we can call other methods on the Model object to get predictions:

model = Model(feature_matrix, genre_labels, model_cfg)
model.train_kfold()
model.predict(holdout_type="val")
model.predict(holdout_type="test")

Here’s what that looks like:

def _parse_conf_matrix(self, cnf_matrix):
    TP = np.diag(cnf_matrix)
    FP = cnf_matrix.sum(axis=0) - np.diag(cnf_matrix)
    FN = cnf_matrix.sum(axis=1) - np.diag(cnf_matrix)
    TN = cnf_matrix.sum() - (FP + FN + TP)

    TP = TP.astype(float)
    FP = FP.astype(float)
    TN = TN.astype(float)
    FN = FN.astype(float)

    return TP, FP, TN, FN


def _predict_(self, holdout_type):
    if holdout_type == "val":
        X_holdout, y_holdout = self.holdout_val_set

    elif holdout_type == "test":
        X_holdout, y_holdout = self.holdout_test_set

    scaler = self.best_estimator['scaler']
    model = self.best_estimator['model']

    X_holdout_scaled = scaler.transform(X_holdout)
    y_pred = model.predict(X_holdout_scaled)
    cnf_matrix = confusion_matrix(y_holdout, y_pred)

    TP, FP, TN, FN = self._parse_conf_matrix(cnf_matrix)

    return TP, FP, TN, FN


def predict(self, holdout_type):
    """
    Specify either "val" or "test" as a string arg
    """
    TP, FP, TN, FN = self._predict_(holdout_type)

    print(f'{holdout_type} Set, per class:')
    print(f'TP:{TP}, FP:{FP}, TN:{TN}, FN:{FN}')

    print(f'{holdout_type} False Positive Rate per Class: {FP / (FP + TN)}')
    print(f'{holdout_type} False Negative Rate per Class: {FN / (TP + FN)}')
    print(f'{holdout_type} Accuracy per Class: {(TP + TN) / (TP + TN + FP + FN)}')

Recall we stored a holdout dev set and a holdout test set as we went through the training loop. Neither of those hit the KFold CV.

Why bother with two sets?

Let’s assume we only had one test set. Now also consider choosing a different model family, say, Logistic Regression, and running it through the entire process. At the end in this example you’d have holdout scores on a test set between two different families.
But now — if you choose which model family to use based on this single test set, you’re technically fitting a param (model family) to your holdout!
This is bad, and is going to hurt your chances at generalization to unseen data. Indeed this is why people use both a dev and test set
You should make the model family selection (following my example) based on the dev set scores. Don’t make any decisions based on the test set scores.

Per Class Metrics

Note that since we have a multiclass task, we will get classification metrics per class. This is a bit tricky to do if you code it by hand, which is why I’ve shown a function to do that in numpy, _parse_conf_matrix().

Of course you could just look at a sklearn classification report as well. However, I like to explicitly view the false positive rates and false negative rates per class in this easy to read format (printed in the class), as it quickly will show you where your model is failing.

Results

"""
val set, per class:TP:[ 8.  9. 11. 52. 25.], FP:[1. 7. 1. 5. 4.], TN:[113. 100. 110.  61.  90.], FN:[1. 7. 1. 5. 4.]val False Positive Rate per Class: [0.00877193 0.06542056 0.00900901 0.07575758 0.04255319]val False Negative Rate per Class: [0.11111111 0.4375     0.08333333 0.0877193  0.13793103]val Accuracy per Class: [0.98373984 0.88617886 0.98373984 0.91869919 0.93495935]

test Set, per class:TP:[10. 16. 18. 91. 46.], FP:[ 2.  2.  4. 14.  9.], TN:[194. 182. 188. 100. 153.], FN:[ 6. 12.  2.  7.  4.]test False Positive Rate per Class: [0.01020408 0.01086957 0.02083333 0.12280702 0.05555556]test False Negative Rate per Class: [0.375      0.42857143 0.1        0.07142857 0.08      ]test Accuracy per Class: [0.96226415 0.93396226 0.97169811 0.9009434  0.93867925]
"""

It looks like overall this approach works quite well!

We can see that there is definitely a False Negative problem across the board, but particularly bad in class index = 1.

Note that we stored the encoder in the class, so we can quickly see what class it is:

print(model.encoder.classes_)
# array(['Ambient', 'Drum & Bass', 'Jazz', 'Rap', 'Techno'], dtype='<U11')

So this makes a lot of sense! Many Drum & Bass songs are very sonically similar to Techno songs, so my next step would be to check if that’s indeed what’s happening.

Recall that we stored the holdout dev set in the class, so we could go in and check if indeed this is where the confusion lies.

Then, we could assess whether we need more training data or different features in order to capture the nuance between the genres.

Using this model to classify unlabeled data

Note how easy it is now to apply this to unlabeled examples. I could take the trained model, and simply return y_pred from the _predict function.

Then for any unlabeled examples in my music library, I can use this prediction as a new genre label!

Real-world application

Say I have 1000s of unlabeled songs in my music library. Maybe I don’t want to use the predictions for every example; perhaps I only want to apply the genre predictions to my music library if the model strongly thinks that a particular example belongs in one of the 5 genres.

To capture that notion, I could use the probability scores from the .predict_proba method on the model class from sklearn. Then, I could say: “If model predicts > 80% probability on 1 of the 5 classes, we will use that genre tag in the music library”

In other words, I’d print out the probability scores for each class for each example on new unlabeled data, and then only use the predicted genre labels in my music library if it meets my threshold criteria.

Audio Genre Classification with Python OOP

marcmuon/audio_genre_classification

Train a model to classify various audio file clips by genre, and predict on new data. This repo exists to host code…

Audio Genre Classification Overview

Motivation

Generating Training Data

Manual Labeling

Audio Feature Extraction

Tempo Feature

An aside — Frequency Content

Spectral Contrast Feature

MFCC Feature

OOP Model Design

marcmuon/audio_genre_classification

Train a genre classifier using a variety of music information retrieval techniques …

Grabbing audio file metadata

The AudioFeature object

Template OOP Design Pattern

Param choices to send to the class

Model — Passing Features and Labels

The Model class

Class instance attributes

The train_kfold method

The predict_from_holdout() method

Per Class Metrics

Results

Using this model to classify unlabeled data

Real-world application

Thanks for reading!

Written by Marc Kelechava