The world’s leading publication for data science, AI, and ML professionals.

Build A MFCC-Based Music Recommendation Engine On Cloud

An approachable guide to creating unstructured data-driven, innovative apps in Microsoft Azure.

GIF by Author
GIF by Author

Hands-on Tutorials

In the past decade, the evolution of mobile technology and cellular networks have unprecedentedly reshaped the world in ways that no one could have predicted. We are living in an era of information explosion and taking advantage of fancy mobile apps driven by ever more affordable cellular data. Walkmans and iPods have no longer been put in our pockets since the debut of smart phones equipped with music streaming apps like Spotify and Pandora. If you are using or have ever used one of them, you might be aware of the music recommendation list, aka "guess you like" feature while streaming a soundtrack. Music recommendation is a big topic and there are existing articles articulating and exploring the algorithms running behind – cluster analysis on genres, NLP modelling on lyrics, user based and content based collaborative filtering to name but a few. Well, is there an intrinsic way that recommendation can be made based on audio signal itself? The answer is yes and this article will run through some basic Acoustic knowledge and explore the feasibility of a lightweight audio feature-based music recommendation system.


1. Sound Pitch

By definition, sound is a kind of energy produced by vibrations that propagates a sinusoidal wave at a certain frequency and amplitude through a transmission medium like air. A piece of music is essentially a sequence of sound waves at different frequencies and amplitudes. The term "pitch" is widely used in the perception of audio characteristics of musical instruments. Although not completely equivalent, frequency is seen as a proxy of "pitch". When people say "high-pitch instruments", it in particular refers to those orchestra instruments always producing sound waves at higher frequency, e.g., trumpet, piccolo or violin. It is evident that there is a strong correlation between sound frequency and people’s mood, one of the fun facts is that at times people claim superiority of pianos tuned to A432 over those tuned to A440 because of the alleged "healing frequency" at 432Hz.

Image by Alexyo.Netcom on wikimedia commons https://commons.wikimedia.org/wiki/File:Estensione_Strumenti_Musicale.jpg
Image by Alexyo.Netcom on wikimedia commons https://commons.wikimedia.org/wiki/File:Estensione_Strumenti_Musicale.jpg

2. Amplitude

The amplitude of a sound wave determines the loudness, when sound produced, the air compression resulting from vibrations creates pressure change that can be perceived by ears. The perceived pressure change, measured in dB, is also affected by the distance to the source. Generally speaking, the sound pressure level will decay at the rate of 6dB for point source while 3dB for line source at each time the distance is doubled.

3. Digital Recording

When talking about digital audio, there is a key concept we cannot steer clear of – sample rate. In digital recording, samples are taken along the sound waves at a regular interval, the frequency at which samples are taken is what we refer to as sample rate. If you love music, you might be aware of the "44.1kHz/16-bit" tag printed on the backside of your CD cover. 44.1kHz is the standard sample rate for consumer CDs, some can go up to 48kHz. Why this number? This is because 20kHz is generally considered the upper limit that human ear is capable to perceive, according to Nyquist–Shannon sampling theorem, sample rate must be at least twice the maximum frequency of the original audio signal to avoid unexpected distortion or alias. Is sample rate higher better? Sound quality wise yes, it is quite often to see Hi-Fi enthusiasts sniff at the "inferior" 44.1kHz/16-bit and pursue premium quality soundtracks recorded at 96kHz/24-bit or even 192kHz/24-bit sourced from analogue medium like vinyls in any linear PCM based lossless audio format, e.g., WAV, FLAC or APE.

Plot by Author
Plot by Author

Ok, time for some audio analysis, let me plot the audio samples of the soundtrack "A lover in Berlin" performed by my favourite Norwegian singer Kari Bremnes. Normally we expect 2 channels for most soundtracks and in most cases they are similar. The amplitude is normalised to the range from 0 to 1.

import matplotlib.pyplot as plt
from scipy.io import wavfile
import numpy as np
file = r"(01) [Kari Bremnes] A Lover in Berlin.wav"
sample_rate,data = wavfile.read(file)
length = data.shape[0]/sample_rate
fig,axes = plt.subplots(1,2,figsize=(20,7))
time = np.linspace(.0,length,data.shape[0])
for i,j,k in zip(range(axes.size),["Left Channel","Right Channel"],["b","y"]):
    axes[i].plot(time, data[:, i]/data[:, i].max(),color=k)
    axes[i].set_xlabel("Time [s]")
    axes[i].set_ylabel("Amplitude")
    axes[i].set_title(j)
plt.show()
Plot by Author
Plot by Author

4. Frequency Domain Signal

As sound waves are time-domain signals, Fast Fourier Transformation needs to be performed for frequency-domain response.

from scipy.fftpack import fft
# Take the left channel
a = data.T[0]
# Normalised fast fourier transformation
c = fft(a)/len(a)
freqs = np.arange(0, (len(a)), 1.0) * (sample_rate*1.0/len(a))
plt.figure(figsize=(20,7))
y=c[:int(len(c))]
# Normalised the amplitude plt.plot(freqs/1000,abs(y)/max(abs(y)),'r') 
plt.xlabel("Frequency (kHz)")
plt.ylabel("Amplitude")
plt.show()
Plot by Author
Plot by Author

Wait…Why is this FFT plot mirrored? Technically it is called conjugate symmetry as a result of the nature of Discrete Fourier Transformation. The DFT formula is written as:

Xn here is the amplitude inputs which in this case are real numbers, k denotes the current frequency while the outputs Xk are complex numbers encapsulating both amplitude and phase information. According to Euler’s formula, the underlying equation can be written as:

Xk has either positive or negative frequency term at a certain n. The magnitude of Xk is essentially the the amplitude we want to plot, which can be derived from its real part and imaginary part:

As you can see, Xk at positive frequencies and negative frequencies have the same magnitude and thus responding the same way in terms of amplitude, in other words, Xk at positive frequency is conjugate of Xk at negative frequency. That is to say we only need to take the first half of the full plot to tell the story but given that most features distribute across low frequency band, I took the first 1/20th instead for easy showcase:

Plot by Author
Plot by Author

As expected, the acoustic feature of this soundtrack is well presented in the sense that rich spikes spread across 200~600 Hz, which is in line with typical female vocal range. Having said that FFT response makes a lot of sense here, we cannot use it solely when it comes to modelling as FFT doesn’t address time feature. Music, to some extent, is just like language that carries information in a certain order. If you are familiar with NLP, you might know that algorithms like bidirectional LSTM and BERT generally perform better than TF-IDF simply because the input order plays a key role. Similarly, we want to capture that feature here. Spectrogram seems a good approach as it shows how frequency and amplitude change over time.

plt.figure(figsize=(25,8))
plt.title('Spectrogram - A Lover In Berlin',fontsize=18)
spec, freqs, t, im = plt.specgram(data[:,0],Fs=sample_rate,NFFT=512)
plt.xlabel('Time [s]',fontsize=18)
plt.ylabel('Frequency',fontsize=18)
plt.show()
Plot by Author
Plot by Author

5. Mel-Frequency Cepstral Coefficients

Well, a new concern would arise if we choose to model on spectrum – it seems unrealistic to take raw frequency as the input as it is way too granular. Is there any trick that generates a low-cardinality feature based on what we have? MFCC (Mel-Frequency Cepstral Coefficients) is one of the ways to go! MFCC is widely used in voice analytics. Although there are existing articles covering this topic, I still want to briefly go through the concept of cepstrum. cepsturm is spectrum with "spec" reversed, technically it is derived from the inverse Fourier Transformation against the logarithm of the original FFT signal. It describes the change rate in the different spectrum bands. The resulting cepstrum is a signal in "quefrency" domain.

Plot by Author
Plot by Author

Mel-frequency cepstrum is derived from passing the initial FFT response to a set of band-pass filters known as Mel-filter bank prior to taking the logarithm, these filters are engineered to work the same way as human ear does given that human ear is naturally a low-pass filter, this is why most of us are not able to perceive high frequency sounds over 20kHz. Mel-frequency maps the original frequency input to Mel-scale through the following formula:

In a nutshell, Mel-scale is optimised for human auditory system. The resulting MFC has 13 coefficients:

from python_speech_features import mfcc
from matplotlib import cm
plt.figure(figsize=(25,8))
mfcc_feat = mfcc(a,sample_rate)
mfcc_data= mfcc_feat.T
plt.imshow(mfcc_data, cmap=plt.cm.jet, aspect='auto',origin='lower')
plt.title('MFC - A Lover In Berlin',fontsize=18)
plt.xlabel('Time [s]',fontsize=18)
plt.ylabel('MFCC',fontsize=18)
plt.show()
Plot by Author
Plot by Author

6. Dynamic Time Warping

Given that our ultimate goal is to develop a music recommendation engine, recommendations can be made based on the similarities of these coefficients between songs. Well…How are the similarities calculated? As soundtrack length may vary from one to another, it is not quite possible to make apples to apples comparison, is it? Dynamic Time Wrapping (DTW) is designed for resolving this problem. By definition, DTW is a time series alignment algorithm that aims to align two sequences of feature vectors by warping the time axis iteratively until an optimal match. It means we can use it to calculate the similarity or distance between any two input vectors without worrying about input lengths.

Plot by Author
Plot by Author

I created a mini music library containing 151 soundtracks in MP3 format at 190kbps covering a wide range of genres like pop, jazz, folk, metal, rock, R&B etc. Let’s take a look at the distribution of all 13 coefficients across my music lib.

from python_speech_features import mfcc
import seaborn as sns
import librosa
import os
tracks = os.listdir(r"soundtracks")
L=[]
for i in tracks:
    data,sample_rate = librosa.load(r"soundtracks"+""+i,sr=44100)
    # Cut off the first and the last 500 samples
    a = data.T[500:-500] 
    a = a/a.max()
    plt.figure(figsize=(20,7))
    mfcc_feat = mfcc(a,sample_rate)
    mfcc_data= mfcc_feat.T
    L.append(mfcc_data)
L2 = np.array([i.mean(axis=1) for i in L])
fig,axes = plt.subplots(5,3,figsize=(30,10))
for i in range(L2.shape[1]):
    sns.distplot(L2.T[i],ax=axes.ravel()[i])
    axes.ravel()[i].set_title("Coe "+str(i))
    plt.tight_layout()
Plot by Author
Plot by Author

It seems all the 13 coefficients are normally distributed across my music lib and thus I am going to sum the calculated DTW distance over all 13 coe vectors as a proxy of the overall similarity of any two soundtracks. Let’s take a look if this algorithm works:

from fastdtw import fastdtw
import numpy as np
c=[]
for i in range(len(L)):
    group = []
    for n in range(13):
        dis,path=fastdtw(L[2][n],L[i][n])
        group.append(dis)
    total_dis = np.sum(group)
    c.append([total_dis,i])
c.sort(key=lambda x:x[0])
fig,axes = plt.subplots(1,2,figsize=(25,8))
tracks = os.listdir("soundtracks")
for i,j in enumerate([tracks[2],tracks[c[1][1]]]):
    title = "MFC-"+j.replace(".mp3","")
    data,sample_rate = librosa.load(r"soundtracks"+""+j,sr=44100)
    a = data.T[500:-500]
    a = a/a.max()
    mfcc_feat = mfcc(a,sample_rate)
    mfcc_data= mfcc_feat.T
    axes[i].set_title(title,fontsize=18)
    axes[i].set_xlabel('Time [s]',fontsize=18)
    axes[i].set_ylabel('MFCC',fontsize=18)
    axes[i].imshow(mfcc_data, cmap=plt.cm.jet, aspect='auto',origin='lower')
plt.tight_layout()
Plot by Author
Plot by Author

Ok…The result indicates that the closest soundtrack to "A Lover In Berlin" is "Samba De Verao" performed by Ono Lisa (小野リサ). Folk pop vs. Bossa nova, both are stylish female vocal music. Not too bad!

Image by JesterWr, Danniel Shen on wikimedia commons https://commons.wikimedia.org/wiki/File:Lisa_Ono2005(cropped).jpg https://commons.wikimedia.org/wiki/File:Kari-Bremnes2012.jpg
Image by JesterWr, Danniel Shen on wikimedia commons https://commons.wikimedia.org/wiki/File:Lisa_Ono2005(cropped).jpg https://commons.wikimedia.org/wiki/File:Kari-Bremnes2012.jpg

7. Cloud Solution Development

Now let’s develop a full solution on Microsoft Azure to host this service. A mini lambda architecture is adopted for this solution. The speed layer is to extract and load the meta data of the uploaded soundtrack that we want to make recommendations against to Azure CosmosDB. The batch layer is to perform recommendation logic and load the resulting recommendation list to SQL database. All service components are developed in Python and Spark.

Architecture diagram by Author
Architecture diagram by Author

7.1 Main Workflow

  1. A flask app is developed and deployed to Azure App Service as the main UI for initial audio file upload and subsequent recommended music streaming.
  2. An Azure SQL database is built to store music lib meta data, e.g., title, artist, album, genre, release year, soundtrack path and artwork path.
  3. 5 blob containers are created within the same Azure Storage Account.
  4. Container A catches the initial upload in blob and triggers an Azure function to collect the meta data via a 3rd party music recognition API service and load the results in json to CosmosDB to be queried by another Azure function serving as an API endpoint to be consumed by the web app.
  5. The initial upload is duplicated to container B in blob as it is landing at container A. The duplication is seen a blob change event to be captured by Azure Event Grid which consequently triggers an Azure Databricks Notebook placed within an Azure Data Factory pipeline. The Notebook is to perform the recommendation logic and to load the resulting recommendation list to another table in Azure SQL database to be joined in a view containing all the meta data to be queried by the web app.
  6. Container C is to store a parquet file containing the Mel-frequency cepstrum of the music lib. The parquet file is registered as a Hive table to be referenced by Azure Databricks as in step 5.
  7. Container D and E are built to store the music lib soundtracks and artworks to be streamed and displayed via the web app.
  8. All service credentials including CosmosDB connection string, SQL database connection string, Blob storage connection string and the 3rd party music recognition API token are stored in and secured by Azure Key Vault.

7.2 Master Data Preparation

Load soundtracks and artworks to the designated containers.

Image by Author
Image by Author

Create a new table in SQL database to store music lib meta data.

Image by Author
Image by Author

Extract meta data from audio files and load them into a data frame. The soundtrack paths and artwork paths are generated by shared access signatures (SAS) via blob client API.

from mutagen.easyid3 import EasyID3
import pandas as pd
import os
from azure.storage.blob import generate_container_sas
from datetime import datetime,timedelta
songs = os.listdir("soundtracks")
l=[]
for song in songs:
    audio = EasyID3("soundtracks"+""+song)
    meta=[]
    for e in ["title","album","artist","genre","date"]:
        try:
            if e=="date":
                attr = audio[e][0][:4]
            elif e=="title":
                attr = song.replace(".mp3","")
            else:
                attr = audio[e][0]
            meta.append(attr)
        except:
            meta.append(None)
    l.append(meta)
df = pd.DataFrame(l,columns=["TITLE","ALBUM","ARTIST","GENRE","RELEASE_YEAR"])
key=my_key
sas_sound=generate_container_sas('xwstorage', 'soundtracks',key,expiry=datetime.utcnow()+timedelta(days=30),permission='r')
sas_art=generate_container_sas('xwstorage', 'artworks',key,expiry=datetime.utcnow()+timedelta(days=30),permission='r')
df["SOUNDTRACK_PATH"] = "https://xwstorage.blob.core.windows.net/soundtracks/"+df["TITLE"]+".mp3"+"?"+sas_sound
df["ARTWORK_PATH"] = "https://xwstorage.blob.core.windows.net/artworks/"+df["TITLE"]+".jpeg"+"?"+sas_art
Image by Author
Image by Author

Load the resulting data frame to the SQL table we created before.

import sqlalchemy
import pyodbc
from sqlalchemy.engine import URL
cnxn = my_odbc_connection_string
connection_url = URL.create("mssql+pyodbc", query={"odbc_connect": cnxn})
engine = sqlalchemy.create_engine(connection_url)
df.to_sql("SOUNDTRACKS", engine,if_exists="append",index=False)
Image by Author
Image by Author

Dump the music lib Mel-frequency cepstrum into a parquet file and then upload the parquet file to the designated container.

import pandas as pd
from python_speech_features import mfcc
import matplotlib.pyplot as plt
import librosa
import os
tracks = os.listdir(r"soundtracks")
L=[]
for i in tracks:
    print(i)
    data,sample_rate = librosa.load(r"soundtracks"+""+i,sr=44100)
    a = data.T[500:-500] 
    a = a/a.max()
    plt.figure(figsize=(20,7))
    mfcc_feat = mfcc(a,sample_rate)
    mfcc_data= mfcc_feat.T
    L.append(mfcc_data)
columns = ["COE_"+str(i) for i in range(1,14)]
L2 = [pd.DataFrame(i.T,columns=columns) for i in L]
titles = [i.replace(".mp3","") for i in tracks]
for i,j in zip(L2,titles):
    i["Title"]=j
df=pd.concat(L2)
df.reset_index(drop=True).reset_index().to_parquet("soundtracks.parquet")
Image by Author
Image by Author

7.3 Encryption

As mentioned earlier, all service credentials are stored in and secured by Azure Key Vault.

Image by Author
Image by Author

7.4 Speed Layer

Create a CosmosDB instance to store the meta data of the uploaded audio file.

Create a blob-trigger type Azure function as the starting point. The function is to cut a 200KB sample out of the input blob and POST it to a 3rd party music recognition service API for meta data collection.

Have the function bind the output path to CosmosDB and another blob container as such that the original input blob will be duplicated to where it needs to be for batch layer processing and the resulting meta data in json will be loaded into CosmosDB at the same time.

Image by Author
Image by Author
Image by Author
Image by Author

Create a http-trigger type Azure function with GET method only to query the latest record in CosmosDB. This function serves as an API endpoint to be consumed by the web app.

Image by Author
Image by Author

Once we have both Azure functions deployed, we can simply drop a MP3 file to the target blob container to quickly test if the API endpoint works in the browser.

Image by Author
Image by Author

7.5 Batch Layer

Create a cluster instance for Azure Databricks. Load the cepstrum parquet file into a spark data frame and register it as a Hive table. Given that the cepstrum table has nearly 4 million rows while 14 columns (13 MFCC + Title) only, A columnstore format like parquet is more optimised over a rowstore format like CSV in this case.

A secret scope needs to be created to interact with Azure Key Vault.

Image by Author
Image by Author

Create a new notebook and rewrite the DTW based MFCC similarity computing logic from Python list iteration to Spark data frame manipulation and load the results to Azure SQL database via JDBC connector.

Given the fact that similarity computing requires large table aggregations only and doesn’t involve any machine learning algorithm, a multi-node computing environment definitely takes advantage here, this is the fundamental reason why I chose Azure Databricks over Azure ML Service for this operation.

Image by Author
Image by Author

A recommendation list will be loaded into a SQL table to be joined in a view if the operation succeeded.

Place the notebook within a Azure Data Factory pipeline and make it triggerable by a blob change event taking place in the target container. In this fashion, the notebook will be activated as soon as the initial upload is duplicated from the speed layer container to the batch layer container.

Image by Author
Image by Author

7.6 Web App

Develop and deploy a flask app to Azure App Service. The app is built to render 2 html pages. The first is for music file uploading, the second is for recommended music streaming.

Image by Author
Image by Author

7.7 Web App Test

Finally… demo session. Uploaded an infoless soundtrack "Lunchbox" performed by Marilyn Manson. The song was recognised precisely along with the top 5 similar songs and one of which is performed by the same singer!

References

  1. https://link.springer.com/content/pdf/bbm%3A978-3-319-49220-9%2F1.pdf
  2. https://en.wikipedia.org/wiki/Dynamic_time_warping
  3. https://en.wikipedia.org/wiki/Cepstrum

Related Articles