Urban Sound Classification with Librosa — tricky cross-validation

Featuring the Leave One Group Out strategy using scikit-learn

Marc Kelechava

Published in

Towards Data Science

11 min readAug 6, 2020

Outline

The goal of this post is two-fold:

I’ll show an example of implementing the results of an interesting research paper on classifying audio clips based on their sonic content. This will include applications of the librosa library, which is a Python package for music and audio analysis. The clips are short audio clips from city, and the classification task is predicting the appropriate category label.
I’ll show the importance of a valid cross-validation scheme. Given the nuances of the audio source dataset I’ll be using, it is very easy to accidentally leak information from the recording that will overfit your model and prevent it from generalizing. The solution is somewhat subtle so it seemed like a nice opportunity for a blog post.

Original research paper

http://www.justinsalamon.com/uploads/4/3/9/4/4394963/salamon_urbansound_acmmm14.pdf

Source dataset, by paper authors

https://urbansounddataset.weebly.com/urbansound8k.html

Summary of their dataset

“This dataset contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, engine_idling, gun_shot, jackhammer, siren, and street_music. The classes are drawn from the urban sound taxonomy.”

I’ll extract features from these sound excerpts and fit a classifier to predict one of the 10 classes. Let’s get started!

Note on my Code

I’ve created a repo that allows you to re-create my example in full:

Script runner: https://github.com/marcmuon/urban_sound_classification/blob/master/main.py
Feature extraction module: https://github.com/marcmuon/urban_sound_classification/blob/master/audio.py
Model module: https://github.com/marcmuon/urban_sound_classification/blob/master/model.py

The script runner handles loading the source audio from disk, parsing the metadata about the source audio, and passing this information to the feature extractor and the model.

Downloading the data

You can download the data, which extracts to 7.09GB, using this form from the research paper authors: https://urbansounddataset.weebly.com/download-urbansound8k.html

Directory structure [optional section — if you want to run this yourself]

Obviously you can fork the code and re-map it to whatever directory structure you want, but if you want to follow mine:

In your home directory: create a folder called datasets, and in there place the unzipped UrbanSound8K folder [from link in ‘Downloading the Data’]
Also in your home directory: create a projects folder and put the cloned repo there ending up with ~/projects/urban_sound_classification/…

Within the code, I use some methods to automatically write the extracted feature vectors for each audio file into ~/projects/urban_sound_classification/data

I do this because the feature extraction takes a long time and you won’t want to do it twice. There’s also code that checks to see if these feature vectors exist.

tl;dr — if you follow my directory structure, you can simply run the main.py script and everything should work!

Why this problem requires careful cross-validation

Note that the source data is split up into 10 sub-folders, labeled ‘Fold1’, ‘Fold2’, etc.

We have 8,732 four-second audio clips of various city sounds. These clips were manually created by the research paper authors, where they labeled them into groups such as ‘car horn’, ‘jackhammer’, ‘children playing’, and so on. In addition to the 10 folds, there are 10 classes.

The fold numbers do not have anything to do with the class labels; rather, the folds refer to the uncut audio file(s) that these 4-second training examples were spliced from.

What we don’t want is for the model to be able to learn how to classify things based on aspects of the particular underlying recording.

We want a generalizable classifier that will work with a wide array of recording types, but that still classifies the sounds correctly.

Guidance from the paper authors on proper CV

That’s why the authors have pre-built folds for us, and offered the following guidance, which is worth quoting:

Don’t reshuffle the data! Use the predefined 10 folds and perform 10-fold (not 5-fold) cross validation…
…If you reshuffle the data (e.g. combine the data from all folds and generate a random train/test split) you will be incorrectly placing related samples in both the train and test sets, leading to inflated scores that don’t represent your model’s performance on unseen data. Put simply, your results will be wrong.

Summary of the proper approach

Train on folds 1–9, then test on fold 10 and record the score. Then train on folds 2–10, and test on fold 1 and record the score.
Repeat this until each fold has served as the holdout one time.
The overall score will be the average of the 10 accuracy score from 10 different holdout sets.

Re-creating the paper results

Note that the research paper does not have any code examples. What I want to do is first see if I can re-create (more or less) the results from the paper with my own implementation.

Then if that looks in line, I’ll work on some model improvements to see if I can beat it.

Here’s a snapshot of their model accuracy across folds from the paper [their image, not mine]:

Image from Research Paper Authors — Justin Salamon, Christopher Jacoby, and Juan Pablo Bello

Thus we’d like to get up to the high 60%/low 70% accuracy across the folds as shown in 3a.

Audio Feature Extraction

Librosa is an excellent and easy to use Python library that implements music information retrieval techniques. I recently wrote another blog post on a model using the librosa library here. The goal of that exercise was to train an audio genre classifier on labeled audio files (label=music genre) from my personal library. Then I use that trained model to predict the genre for other untagged files in my music library.

I will use some of the music information retrieval techniques I learned from that exercise and apply them to audio feature extraction for the city sound classification problem. In particular I’ll use:

A quick detour on audio transformations [optional]

[My other blog post expands on some of this section in a bit more detail if any of this is of particular interest]

Note that it is technically possible to convert a raw audio source to a numerical vector and train that directly. However, a (downsampled) 7-minute audio file will yield a time series vector nearly ~9,000,000 floating point numbers in length!

Even for our 4-second clips, the raw time series representation is a vector ~7000-dim. Given we only have 8,732 training examples, this is likely too high-dim to be workable.

The various music informational retrieval techniques reduce the dimensionality of the raw audio vector representation and make this more tractable for modeling.

The techniques that we’ll be using to extract features seek to capture different qualities about the audio over time. For instance, the MFCCs describe the spectral envelope [amplitude spectrum] of a sound. Using librosa we get this information over time — i.e., we get a matrix!

The MFCC matrix for a particular audio file will have coefficients on the y-axis and time on the x-axis. Thus we want to summarize these coefficients over time (across the x-axis, or axis=1 in numpy land). Say we take an average over time — then we get the average value for each MFCC coefficient across time, i.e., a feature vector of numbers for that particular audio file!

What we can do is repeat this process for different music informational retrieval techniques, or different summary statistics. For instance, the spectral contrast technique will also yield a matrix of different spectral characteristics for different frequency ranges over time. Again we can repeat the aggregation process over time and pack it into our feature vector.

What the authors used

The paper authors call out MFCC explicitly. They mention pulling the first 25 MFCC coefficients and

“The per-frame values for each coefficient are summarized across time using the following summary statistics: minimum, maximum, median, mean, variance, skewness, kurtosis and the mean and variance of the first and second derivatives, resulting in a feature vector of dimension 225 per slice.”

Thus in their case they kept aggregating the 25 MFCCs over different summary statistics and packed them into a feature vector.

I’m going to implement something slightly different here, since it worked quite well for me in the genre classifier problem mentioned previously.

I take (for each snippet):

Mean of the MFCC matrix over time
Std. Dev of the MFCC matrix over time
Mean of the Spectral Contrast matrix over time
Std. Dev of the Spectral Contrast matrix over time
Mean of the Chromagram matrix over time
Std. Dev of the Chromagram matrix over time

My output (for each audio clip) will only be 82-dimensional as opposed to the 225-dim of the paper, so modeling should be quite a bit faster.

Finally some code! Audio feature extraction in action.

[Note that I’ll be posting code snippets both within the blog post and with GitHub Gist links. Sometime Medium does not render Github Gists correctly, which is why I’m doing this. Also all the in-document code is copy and pasteable to an ipython terminal, but GitHub gists are not].

Referring to my script runner here:

marcmuon/urban_sound_classification

Code for a series of blog posts on models using the Urban Sound 8K dataset - marcmuon/urban_sound_classification

github.com

I parse through the metadata (given with the dataset) and grab the filename, fold, and class label for each audio file. Then this gets sent to an audio feature extractor class.

The AudioFeature class wraps around librosa, and extracts the features you feed in as strings as shown above. It also then saves the AudioFeature object to disk for every audio clip. The process takes a while, so I save the class label and fold number in the AudioFeature object along with the feature vector. This way you can come back and play around with the model later on the extracted features.

This class implements what I described earlier — which is aggregating the various music information retrieval techniques over time, and then packing everything into a single feature vector for each audio clip.

Modeling

Since we put all the AudioFeature objects in a list above, we can do some quick comprehensions to get what we need for modeling:

The Model class will implement the cross-validation loop as described by the authors (keeping the relevant pitfalls in mind!).

As a reminder, here’s a second warning from the authors:

“Don’t evaluate just on one split! Use 10-fold (not 5-fold) cross validation and average the scores
We have seen reports that only provide results for a single train/test split, e.g. train on folds 1–9, test on fold 10 and report a single accuracy score. We strongly advise against this. Instead, perform 10-fold cross validation using the provided folds and report the average score.
Why?
Not all the splits are as “easy”. That is, models tend to obtain much higher scores when trained on folds 1–9 and tested on fold 10, compared to (e.g.) training on folds 2–10 and testing on fold 1. For this reason, it is important to evaluate your model on each of the 10 splits and report the average accuracy.
Again, your results will NOT be comparable to previous results in the literature.”

On their latter point —(this is from the paper) it’s worth noting that different recordings/folds have different distributions of when the snippets appear in either the foreground or the background —this is why some folds are easy and some are hard.

tl;dr CV

We need to train on folds 1–9, predict & score on fold 10
Then train on folds 2–10, predict & score on fold 1
…etc…
Averaging the scores on the test folds with this process will match the existing research AND ensure that we aren’t accidentally leaking data about the source recording to our holdout set.

Leave One Group Out

Initially, I coded the split process described above by hand using numpy with respect to the given folds. While it wasn’t too bad, I realized that scikit-learn provides a perfect solution in the form of LeaveOneGroupOut KFold splitting.

To prove to myself it is what we want, I ran a slightly altered version of the test code for the splitter from the sklearn docs:

Note that in this toy example there are 3 groups, ‘1’, ‘2’, and ‘3’.

When I feed the group membership list for each training example to the splitter, it correctly ensures that the same group examples never appear in both train and test.

The Model Class

Thanks to sklearn this ends up being pretty easy to implement!

Here I add in some scaling, but in essence the splitter will give us the desired CV. After each iteration of the splitter I train the fold on 9 folds and predict on the holdout fold. This happens 10 times, and then we can average over the returned list of 10 scores on the holdout folds.

Results

"""
In: fold_acc                                                                                                                                              
Out: 
[0.6632302405498282,
 0.7083333333333334,
 0.6518918918918919,
 0.6404040404040404,
 0.7585470085470085,
 0.6573511543134872,
 0.6778042959427207,
 0.6910669975186104,
 0.7230392156862745,
 0.7825567502986858]

In: np.mean(fold_acc)                                                                                                                                     
Out: 0.6954224928485881
"""

69.5% is about right in line with what the authors have in their paper for the top models! Thus I’m feeling good that this was implemented as they envisioned. Also note they also show that fold10 was the easiest to score on (and we have that too), so we’re in line there as well.

Why not run a hyperparameter search for this task? [very optional]

Here’s where things get a little tricky.

A ‘Normal’ CV Process:

If we could train/test/split arbitrarily, we could do something like:

Split off a holdout test set
From the larger train portion, split off a validation set.
Run some type of parameter search algo (say, GridSearchCV) on the train (non-val, non-test).
The GridSearch will run k-fold cross-validation on the train test, splitting it into folds. At the end, an estimator will be refit on the train portion with the best params found in the inner k-fold cross-validation of GridSearchCV
Then we take that fitted best estimator and score in on validation set

Because we have the validation set in part 5, we can repeat steps 3 and 4 a bunch of times on different model families or different parameter search ranges.

Then when we are done we’d take our final model and see if it generalizes using the holdout test set, which we hadn’t touched to that point.

But how is this going to work within our Fold based LeaveOneGroupOut approach? Imagine we tried to setup a GridSearchCV as follows:

But now when GridSearchCV runs the inner split, we’ll run into the same problem that we had solved by using LeaveOneGroupOut!

That is, imagine the first run of this loop where the test set is fold 1 and the train set is on folds 2–10. If we then pass the train set (of folds 2–10) into the inner GridSearchCV loop, we’ll end up with inner KFold cases where the same fold is used in the inner GridSearchCV train and the inner GridSearchCV test.

Thus it’s going to end up (very likely) overfitting the choice of best params within the inner GridSearchCV loop.

And hence, I’m not going to run a hyperparameter search within the LeaveOneGroupOut loop.

Next Steps

I’m pretty pleased this correctly implemented the research paper — at least in terms of very closely matching their results.

I’d like to try extracting larger feature vectors per example, and then running these through a few different Keras based NN architectures following the same CV process here
In terms of feature extraction, I’d also like to consider the nuances of misclassifications between classes and see if I can think up better features for the hard examples. For instance, it’s definitely getting confused on the air conditioner v engine idling class. To check this, I have some code in my prior audio blog post that you can use to look at the False Positive Rate and False Negative rate per class: https://github.com/marcmuon/audio_genre_classification/blob/master/model.py#L84-L128

Thanks for reading this far! I intend to do a 2nd part of this post addressing the Next Steps soon. Some other work that might be of interest can be found here:

https://github.com/marcmuon
https://medium.com/@marckelechava

Citations

J. Salamon, C. Jacoby and J. P. Bello, “A Dataset and Taxonomy for Urban Sound Research”, 22nd ACM International Conference on Multimedia, Orlando USA, Nov. 2014.
[ACM][PDF][BibTeX]