The world’s leading publication for data science, AI, and ML professionals.

Music Genre Classification with Python

Music is like a mirror, and it tells people a lot about who you are and what you care about.You are what you stream.

A Guide to analyzing Audio/Music signals in Python

Photo by Jean on Unsplash
Photo by Jean on Unsplash

Music is like a mirror, and it tells people a lot about who you are and what you care about, whether you like it or not. We love to say "you are what you stream,":Spotify

Spotify, with a net worth of $26 billion, is reigning the music streaming platform today. It currently has millions of songs in its database and claims to have the right music score for everyone. Spotify’s Discover Weekly service has become a hit with the millennials. Needless to say, Spotify has invested a lot in research to improve the way users find and listen to music. Machine Learning is at the core of their research. From NLP to Collaborative filtering to Deep Learning, Spotify uses them all. Songs are analyzed based on their digital signatures for some factors, including tempo, acoustics, energy, danceability, etc., to answer that impossible old first-date query: What kind of music are you into?


Objective

Companies nowadays use music classification, either to be able to place recommendations to their customers (such as Spotify, Soundcloud) or simply as a product (for example, Shazam). Determining music genres is the first step in that direction. Machine Learning techniques have proved to be quite successful in extracting trends and patterns from a large data pool. The same principles are applied in Music Analysis also.

In this article, we shall study how to analyze an audio/music signal in Python. We shall then utilize the skills learnt to classify music clips into different genres.

Audio Processing with Python

Sound is represented in the form of an audio signal having parameters such as frequency, bandwidth, decibel, etc. A typical audio signal can be expressed as a function of Amplitude and Time.

source
source

These sounds are available in many formats, making it possible for the computer to read and analyze them. Some examples are:

  • mp3 format
  • WMA (Windows Media Audio) format
  • wav (Waveform Audio File) format

Audio Libraries

Python has some great libraries for audio processing like Librosa and PyAudio.There are also built-in modules for some basic audio functionalities.

We will mainly use two libraries for audio acquisition and playback:

1. Librosa

It is a Python module to analyze audio signals in general but geared more towards music. It includes the nuts and bolts to build a MIR(Music information retrieval) system. It has been very well documented, along with a lot of examples and tutorials.

_For a more advanced introduction that describes the package design principles, please refer to the librosa paper at SciPy 2015._

Installation

pip install librosa
or
conda install -c conda-forge librosa

To fuel more audio-decoding power, you can install FFmpeg, which ships with many audio decoders.

2. IPython.display.Audio

[IPython.display.Audio](https://ipython.org/ipython-doc/stable/api/generated/IPython.display.html#IPython.display.Audio) lets you play audio directly in a jupyter notebook.

Loading an audio file

import librosa
audio_path = '../T08-violin.wav'
x , sr = librosa.load(audio_path)
print(type(x), type(sr))
<class 'numpy.ndarray'> <class 'int'>
print(x.shape, sr)
(396688,) 22050

This returns an audio time series as a numpy array with a default sampling rate(sr) of 22KHZ mono. We can change this behavior by saying:

librosa.load(audio_path, sr=44100)

to resample at 44.1KHz, or

librosa.load(audio_path, sr=None)

to disable resampling.

The sample rate is the number of samples of audio carried per second, measured in Hz or kHz.

Playing Audio

Using,IPython.display.Audio to play the audio

import IPython.display as ipd
ipd.Audio(audio_path)

This returns an audio widget in the jupyter notebook as follows:

screenshot of the Ipython audio widget
screenshot of the Ipython audio widget

This widget won’t work here, but it will work in your notebooks. I have uploaded the same to SoundCloud so that we can listen to it.

You can even use an mp3 or a WMA format for the audio example.

Visualizing Audio

Waveform

We can plot the audio array using [librosa.display.waveplot](https://librosa.github.io/librosa/generated/librosa.display.waveplot.html#librosa.display.waveplot):

%matplotlib inline
import matplotlib.pyplot as plt
import librosa.display
plt.figure(figsize=(14, 5))
librosa.display.waveplot(x, sr=sr)

Here, we have the plot of the amplitude envelope of a waveform.

Spectrogram

A spectrogram is a visual representation of the spectrum of frequencies of sound or other signals as they vary with time. Spectrograms are sometimes called sonographs, voiceprints, or voicegrams. When the data is represented in a 3D plot, they may be called waterfalls. In 2-dimensional arrays, the first axis is frequency while the second axis is time.

We can display a spectrogram using. [librosa.display.specshow](https://librosa.github.io/librosa/generated/librosa.display.specshow.html).

X = librosa.stft(x)
Xdb = librosa.amplitude_to_db(abs(X))
plt.figure(figsize=(14, 5))
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar()

The vertical axis shows frequencies (from 0 to 10kHz), and the horizontal axis shows the time of the clip. Since all action is taking place at the bottom of the spectrum, we can convert the frequency axis to a logarithmic one.

librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='log')
plt.colorbar()

Writing Audio

[librosa.output.write_wav](https://librosa.github.io/librosa/generated/librosa.output.write_wav.html#librosa.output.write_wav) saves a NumPy array to a WAV file.

librosa.output.write_wav('example.wav', x, sr)

Creating an audio signal

Let us now create an audio signal at 220Hz. An audio signal is a numpy array, so we shall create one and pass it into the audio function.

import numpy as np
sr = 22050 # sample rate
T = 5.0    # seconds
t = np.linspace(0, T, int(T*sr), endpoint=False) # time variable
x = 0.5*np.sin(2*np.pi*220*t)# pure sine wave at 220 Hz
Playing the audio
ipd.Audio(x, rate=sr) # load a NumPy array
Saving the audio
librosa.output.write_wav('tone_220.wav', x, sr)

So, here it is- the first sound signal created by you.🙌

Feature extraction

Every audio signal consists of many features. However, we must extract the characteristics that are relevant to the problem we are trying to solve. The process of extracting features to use them for analysis is called feature extraction. Let us study about few of the features in detail.

  • Zero-Crossing Rate

The zero-crossing rate is the rate of sign-changes along with a signal, i.e., the rate at which the signal changes from positive to negative or back. This feature has been used heavily in both speech recognition and music information retrieval. It usually has higher values for highly percussive sounds like those in metal and rock.

Let us calculate the zero-crossing rate for our example audio clip.

# Load the signal
x, sr = librosa.load('../T08-violin.wav')
#Plot the signal:
plt.figure(figsize=(14, 5))
librosa.display.waveplot(x, sr=sr)
# Zooming in
n0 = 9000
n1 = 9100
plt.figure(figsize=(14, 5))
plt.plot(x[n0:n1])
plt.grid()

There appear to be 6 zero crossings. Let’s verify with librosa.

zero_crossings = librosa.zero_crossings(x[n0:n1], pad=False)
print(sum(zero_crossings))
6

It indicates where the " center of mass" for a sound is located and is calculated as the weighted mean of the frequencies present in the sound. Consider two songs, one from a blues genre and the other belonging to metal. Now, as compared to the blues genre song, which is the same throughout its length, the metal song has more frequencies towards the end. So spectral centroid for blues song will lie somewhere near the middle of its spectrum while that for a metal song would be towards its end.

[librosa.feature.spectral_centroid](https://librosa.github.io/librosa/generated/librosa.feature.spectral_centroid.html#librosa.feature.spectral_centroid) computes the spectral centroid for each frame in a signal:

spectral_centroids = librosa.feature.spectral_centroid(x, sr=sr)[0]
spectral_centroids.shape
(775,)
# Computing the time variable for visualization
frames = range(len(spectral_centroids))
t = librosa.frames_to_time(frames)
# Normalising the spectral centroid for visualisation
def normalize(x, axis=0):
    return sklearn.preprocessing.minmax_scale(x, axis=axis)
#Plotting the Spectral Centroid along the waveform
librosa.display.waveplot(x, sr=sr, alpha=0.4)
plt.plot(t, normalize(spectral_centroids), color='r')

There is a rise in the spectral centroid towards the end.

  • Spectral Rolloff

It is a measure of the shape of the signal. It represents the frequency below which a specified percentage of the total spectral energy, e.g., 85%, lies.

[librosa.feature.spectral_rolloff](https://librosa.github.io/librosa/generated/librosa.feature.spectral_rolloff.html#librosa.feature.spectral_rolloff) computes the roll-off frequency for each frame in a signal:

spectral_rolloff = librosa.feature.spectral_rolloff(x+0.01, sr=sr)[0]
librosa.display.waveplot(x, sr=sr, alpha=0.4)
plt.plot(t, normalize(spectral_rolloff), color='r')

The Mel frequency cepstral coefficients (MFCCs) of a signal are a small set of features (usually about 10–20) that concisely describe the overall shape of a spectral envelope. It models the characteristics of the human voice.

Let’ work with a simple loop wave this time.

x, fs = librosa.load('../simple_loop.wav')
librosa.display.waveplot(x, sr=sr)

[librosa.feature.mfcc](https://bmcfee.github.io/librosa/generated/librosa.feature.mfcc.html#librosa.feature.mfcc) computes MFCCs across an audio signal:

mfccs = librosa.feature.mfcc(x, sr=fs)
print mfccs.shape
(20, 97)
#Displaying  the MFCCs:
librosa.display.specshow(mfccs, sr=sr, x_axis='time')

Here mfcc calculated 20 MFCC s over 97 frames.

We can also perform feature scaling such that each coefficient dimension has zero mean and unit variance:

import sklearn
mfccs = sklearn.preprocessing.scale(mfccs, axis=1)
print(mfccs.mean(axis=1))
print(mfccs.var(axis=1))
librosa.display.specshow(mfccs, sr=sr, x_axis='time')

Chroma features are an interesting and powerful representation for music audio in which the entire spectrum is projected onto 12 bins representing the 12 distinct semitones (or chroma) of the musical octave.

[librosa.feature.chroma_stft](https://librosa.github.io/librosa/generated/librosa.feature.chroma_stft.html#librosa.feature.chroma_stft) is used for computation

# Loadign the file
x, sr = librosa.load('../simple_piano.wav')
hop_length = 512
chromagram = librosa.feature.chroma_stft(x, sr=sr, hop_length=hop_length)
plt.figure(figsize=(15, 5))
librosa.display.specshow(chromagram, x_axis='time', y_axis='chroma', hop_length=hop_length, cmap='coolwarm')

Case Study: Classify songs into different genres.

After having an overview of the acoustic signal, its features, and its feature extraction process, it is time to utilize our newly developed skill to work on a Machine Learning Problem.

Objective

In his section, we will try to model a classifier to classify songs into different genres. Let us assume a scenario in which, for some reason, we find a bunch of randomly named MP3 files on our hard disk, which are assumed to contain music. Our task is to sort them according to the music genre into different folders such as jazz, classical, country, pop, rock, and metal.

Dataset

We will be using the famous GITZAN dataset for our case study. This dataset was used for the well-known paper in genre classification " Musical genre classification of audio signals " by G. Tzanetakis and P. Cook in IEEE Transactions on Audio and Speech Processing 2002.

The dataset consists of 1000 audio tracks, each 30 seconds long. It contains ten genres: blues, classical, country, disco, hip-hop, jazz, reggae, rock, metal, and pop. Each genre consists of 100 sound clips.

Preprocessing the Data

Before training the classification model, we have to transform raw data from audio samples into more meaningful representations. The audio clips need to be converted from .au format to .wav format to make it compatible with python’s wave module for reading audio files. I used the open-source SoX module for the conversion. Here is a handy cheat sheet for SoX conversion.

sox input.au output.wav

Classification

  • Feature Extraction

We then need to extract meaningful features from audio files. We will choose five features to classify our audio clips, i.e., Mel-Frequency Cepstral Coefficients, Spectral Centroid, Zero Crossing Rate, Chroma Frequencies, Spectral Roll-off. All the features are then appended into a .csv file so that classification algorithms can be used.

  • Classification

Once the features have been extracted, we can use existing classification algorithms to classify the songs into different genres. You can either use the spectrogram images directly for classification or can extract the features and use the classification models on them.

Next Steps

Music Genre Classification is one of the many branches of Music Information Retrieval. From here, you can perform other tasks on musical data like beat tracking, music generation, recommender systems, track separation and instrument recognition, etc. Music analysis is a diverse field and also an interesting one. A music session somehow represents a moment for the user. Finding these moments and describing them is an interesting challenge in the field of Data Science.


Related Articles