Audio Classification with Pre-trained VGG-19 (Keras)

Asad Mahmood
In this post, I’ll target the problem of audio classification. I’ll train an SVM classifier on the features extracted by a pre-trained VGG-19, from the waveforms of audios. The main idea behind this post is to show the power of pre-trained models, and the ease with which they can be applied.

I wanted to evaluate this approach on real-world data. So I thought of classifying the audios of supercars and heavy bikes. Following are the steps I followed to create this classifier:

Download Audio Files from Youtube

First, I selected the youtube videos I wanted the audio for and then I used the following piece of code to download the audio files in .mp3 format.

from __future__ import unicode_literals
import youtube_dl

ydl_opts = {
'format': 'bestaudio/best',
'postprocessors': [{
'key': 'FFmpegExtractAudio',
'preferredcodec': 'mp3',
'preferredquality': '192',
with youtube_dl.YoutubeDL(ydl_opts) as ydl:
ydl.download([<youtube video link>])
# for bike sounds : https://www.youtube.com/watch?v=sRdRwHPjJPk
# for car sounds : https://www.youtube.com/watch?v=PPdNb-XQXR8

Converting Audio Files from .mp3 to .wav

After downloading the .mp3 files, I converted them to .wav files using the following piece of code.

from pydub import AudioSegment
sound = AudioSegment.from_mp3("car.mp3")
sound.export("car.wav", format="wav")

Extracting Chunks of Audios

In the next step, I extracted chunks of 15-second audios from the .wav files.

from pydub import AudioSegment
import os
if not os.path.exists("bike"):

for i in range(1,1000,15):
t1 = i * 1000 #Works in milliseconds
t2 = (i+15) * 1000
newAudio = AudioSegment.from_wav("bikes.wav")
newAudio = newAudio[t1:t2]
newAudio.export('bike/'+str(count)+'.wav', format="wav") #Exports to a wav file in the current path.

Plotting Amplitude Waveforms

Next step was to plot waveforms of these audios. This was done with the following code.

from scipy.io.wavfile import read
import matplotlib.pyplot as plt
from os import walk
import os
if not os.path.exists("carPlots"):
car_wavs = []
for (_,_,filenames) in walk('car'):
car_wav in car_wavs:
# read audio samples
input_data = read("car/" + car_wav)
audio = input_data[1]
# plot the first 1024 samples
# label the axes
# set the title
# plt.title("Sample Wav")
# display the plot
plt.savefig("carPlots/" + car_wav.split('.')[0] + '.png')
# plt.show()

Extracting Features and Training LinearSVM

Once I had these waveforms for both cars and bikes, I extracted features from these images to feed them into a LinearSVM for classification. To extract features, I used the pre-trained model of VGG-19 and extracted the abstract features of the image from the flatten layer. After extracting these features, I created a 70–30 train test split and trained a LinearSVM. Following is the code for that.

import os
from keras.applications.vgg19 import VGG19
from keras.preprocessing import image
from keras.applications.vgg19 import preprocess_input
from keras.models import Model
import numpy as np

base_model = VGG19(weights='imagenet')
model = Model(inputs=base_model.input, outputs=base_model.get_layer('flatten').output)

def get_features(img_path):
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
flatten = model.predict(x)
return list(flatten[0])

X = []
y = []

car_plots = []
for (_,_,filenames) in os.walk('carPlots'):

cplot in car_plots:
X.append(get_features('carPlots/' + cplot))

bike_plots = []
for (_,_,filenames) in os.walk('bikePlots'):

cplot in bike_plots:
X.append(get_features('bikePlots/' + cplot))

from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42, stratify=y)

clf = LinearSVC(random_state=0, tol=1e-5)
clf.fit(X_train, y_train)

predicted = clf.predict(X_test)

# get the accuracy
print (accuracy_score(y_test, predicted))

This simple model achieved an accuracy of 97% on the test set. This shows how powerful these pre-trained models are and how anyone can use them to create a tool.

One application of this that I can think of is to create a chrome extension, which tells if the audio of the video present on a webpage contains explicit noises or not. I encourage the beginners reading this post to think of a new problem and solve it using the method presented here.

