Sound Event Classification: A to Z

Chathuranga Siriwardhana
Towards Data Science
7 min readSep 26, 2019

--

Audio(Sound) is one of the main sensory information we receive to perceive our environment. Almost every action or an event in our surroundings has its unique sound. Audio has 3 main attributes which help us in distinguish between two sounds.

  • Amplitude — Loudness of the sound
  • Frequency — The pitch of the sound
  • Timbre — Quality of the sound or the identity of the sound (e.g. the Sound difference between a piano and a violin)

Let’s say a sound event is an audio clip which is generated from an action. The action can be speaking, humming, finger-snapping, walking, water pouring, etc. As humans, we have been training since we are very little to recognize events with audio. Someone can say that humans are very efficient in learning new sound events and recognize sound events. Listening to a podcast only uses the sound recognition ability to make sense of the podcast. Sometimes, we use sound recognition with other sensory information to perceive the surroundings.

However, the audio event recognition systematically (preferably using a computer program or an algorithm) is very challenging. This is mainly because of,

  • The noisiness of recorded sound clips — transducer noise and background noise.
  • An event can be occurring at various loudness levels and various time durations.
  • Having a limited number of examples to feed into an algorithm.

Let’s come to the sound classification part. Let’s say the problem is to classify a given audio clip as one of the following events using only sound.

  • Calling — Talk on a phone
  • Clapping
  • Falling — A man falls on the ground
  • Sweeping — Sweeping the floor using a broom
  • Washing Hands — Washing hands with a sink and a water tap
  • Watching TV — TV sound
  • Entering / Exiting — Door opening or closing sound
  • other — None of the above events

The audio data used in this post are from the Assisted Living Dataset collected at University of Moratuwa, Sri Lanka. The audio clips are recorded with two MOVO USB omnidirectional microphones as .wav files. The sound event classification is done by performing Audio preprocessing, Feature extraction and classification.

Audio preprocessing

First, we need to come up with a method to represent audio clips (.wav files). Then, the audio data should be preprocessed to use as inputs to the machine learning algorithms. The Librosa library provides some useful functionalities for processing audio with python. The audio files are loaded into a numpy array using Librosa. The array will consist of the amplitudes of the respective audio clip at a rate called ‘Sampling rate’. (The sampling rate usually would 22050 or 44100)

Having loaded an audio clip into an array, the first challenge, noisiness should be addressed. The Audacity uses ‘spectral gating’ algorithm to suppress the noise in the audio. A good implementation of the algorithm can be found in noisereduce python library. The algorithm can be found on the documentation of the noisereduce library.

import noisereduce as nr# Load audio file
audio_data, sampling_rate = librosa.load(<audio_file_path.wav>)
# Noise reduction
noisy_part = audio_data[0:25000]
reduced_noise = nr.reduce_noise(audio_clip=audio_data, noise_clip=noisy_part, verbose=False)
# Visualize
print("Original audio file:")
plotAudio(audio_data)
print("Noise removed audio file:")
plotAudio(reduced_noise)
Visualization of the Noise reduction with waveform plots

The resulting audio clip is containing an empty length (unnecessary silence) in it. Let’s trim the leading and trailing parts which are silence than a threshold loudness level.

trimmed, index = librosa.effects.trim(reduced_noise, top_db=20, frame_length=512, hop_length=64)print(“Trimmed audio file:”)
plotAudio(trimmed)
Noise Reduced and silence trimmed audio file waveform

Feature Extraction

The preprocessed audio files themselves cannot be used to classify as sound events. We have to extract feature from the audio clips to make the classification process more efficient and accurate. Let’s extract the absolute values of Short-Time Fourier Transform (STFT) from each audio clip. To calculate STFT, Fast Fourier transform window size(n_fft) is used as 512. According to the equation n_stft = n_fft/2 + 1, 257 frequency bins(n_stft) are calculated over a window size of 512. The window is moved by a hop length of 256 to have a better overlapping of the windows in calculating the STFT.

stft = np.abs(librosa.stft(trimmed, n_fft=512, hop_length=256, win_length=512))

The number of frequency bins, window length and hop length are determined empirically for the dataset. There is no universal set of values for the parameters in the feature generation. This will be discussed again in the Tuning and enhancing Results section.

Let’s review the meaning of the absolute STFT features of an audio clip. Consider an audio clip which has a t number of samples in it. Say we are obtaining anf number of frequency bins in the STFT. Consider the window length is w and window’s hop length h. When calculating STFT, a series of windows are obtained sliding a fixed w length window by a step of h. This will produce 1+(t-w)/h number of windows. For each such window, the amplitudes of frequency bins (in Hz) in the range 0 to sampling_rate/2 are recorded. The frequency range is equally divided when determining the values of the frequency bins. For example, consider STFT bins with n_fft=16 and sampling rate of 22050. Then there will be 9 frequency bins having the following values(in Hz).

[     0   ,   1378.125,   2756.25 ,   4134.375, 5512.5  ,   6890.625,   8268.75 ,   9646.875,  11025   ]

The absolute STFT features of an audio clip is a 2-dimensional array which contains mentioned frequency amplitude bins for each window.

Since sound events have different durations(number of samples), the 2-d feature arrays are flattened using mean on the frequency axis. Thus, the audio clips will be represented using an array of fixed size 257 (number of STFT frequency bins). This seems like a bad representation of the audio clip since it does not contain temporal information. But every given audio event has its unique frequency range. For example, the scratching sound of the sweeping-event makes its feature vector having more high-frequency amplitudes than Falling-event. Finally, the feature vector is normalized by min-max normalization. Normalization is applied to make every sound event lie on a common loudness level. (As introduced, the amplitude is an audio property and we should not use amplitude in this use case to differentiate sound events). The following figure shows the Normalized STFT Frequency signatures of sound events captured by the absolute STFT features.

Normalized STFT Frequency signatures of Sweeping (right) and Falling (left) sound events

Event Classification

Now, the sound events are preprocessed and represented efficiently using STFT features. The STFT features are used to train a fully connected Neural Network(NN) which will be used to classify new sound events. The NN consists of 5 fully connected layers. The layers are having 256, 256, 128, 128 and 8 neurons in them. All layers are having ReLU activation function and the 4ᵗʰ layer is having a dropout layer to reduce the overfitting to the training data. The neural network (model) can be easily built using Keras Dense layers. The model is compiled using the ‘Adam’ optimizer.

# build model
model = Sequential()model.add(Dense(256, input_shape=(257,)))
model.add(Activation(‘relu’))

model.add(Dense(256))
model.add(Activation(‘relu’))

model.add(Dense(128))
model.add(Activation(‘relu’))

model.add(Dense(128))
model.add(Activation(‘relu’))
model.add(Dropout(0.5))
model.add(Dense(num_labels))
model.add(Activation(‘relu’))
model.compile(loss=’categorical_crossentropy’, metrics=[‘accuracy’], optimizer=’adam’)

The above NN model was trained on 880 samples and validated on 146 samples. It classifies unseen audio events with an average accuracy of 93%. The following figure shows the normalized confusing matrix corresponding to a prediction.

Confusion Matrix of the NN model’s prediction

Tuning and enhancing Results

We have obtained a very good model by now. Can we do even better? Always there will be a better model. The limit is how good we want our classification is. Looking at the confusion matrix we can observe that some activities like Watching TV have been misclassified with an error of 50%. Therefore, we should consider tuning the model, preprocessing and feature extraction steps.

We can programmatically check what is the best values for the number of frequency bins, window length and hop length in feature extraction. A visualization method can be used to determine what are the best values for the parameters. The feature type (Absolutes of STFTs here)also can be decided empirically. The machine learning model is the necessary part to be tuned. The number of layers, dropout layer properties, activation functions, optimizers and learning rates can be determined in such a manner.

Note: Tuning is done for some extend to make the accuracy of above Neural Network high. You can still improve.

As the next step, we can implement a continuous sound event classifier which is accepting a stream of audio and classify the sound events real-time. Hope you find the article useful. A complement article will be published on continuous audio classification near future.

Find full python code in this Git repository.

--

--