Decoding the Symphony of Sound: Audio Signal Processing for Musical Engineering

The Ultimate Guide to Time and Frequency Domain Audio Feature Extraction using Python

Naman Agrawal
Towards Data Science

--

Image by OpenClipart-Vectors from Pixabay

Contents

  1. Introduction
  2. Time Domain Feature Extraction
    2.1 Basics of Audio Signal Processing: Frame Size and Hop Length
    2.2 Feature 1: Amplitude Envelope
    2.3 Feature 2: Root Mean Square Energy
    2.4 Feature 3: Crest Factor
    2.5 Feature 4: Zero Crossing Rate
  3. Frequency Domain Feature Extraction
    3.1 Feature 5: Band Energy Ratio
    3.2 Feature 6: Spectral Centroid
    3.3 Feature 7: Spectral Bandwidth
    3.4 Feature 8: Spectral Flatness
  4. Conclusion
  5. References

Introduction

The ability to process and analyze data of different kinds to obtain practical insights is one of the most vital skills of the information age. Data is all around us: from the books we read to the movies we watch, from the Instagram posts we like to the music we listen to. In this article, we will try to understand the basics of Audio Signal Processing:

  1. How does a computer read an audio signal
  2. What are time and frequency domain features?
  3. How are these features extracted?
  4. Why do these features need to be extracted?

In particular, we will cover the following features in detail:

  • Time Domain features: Amplitude envelope, root mean square energy, crest factor (and peak-to-average power ratio), zero crossing rate.
  • Frequency Domain Features: Band energy ratio, spectral centroid, spectral bandwidth (spread), spectral flatness.

We will describe the theory and write Python codes from scratch to extract each of these features for audio signals from 3 different musical instruments: Acoustic Guitar, Brass, and Drum Set. The sample audio data files used can be downloaded here: https://github.com/namanlab/Audio-Signal-Processing-Feature-Extraction

The entire code file is also available at the above repository or can be accessed via this link: https://github.com/namanlab/Audio-Signal-Processing-Feature-Extraction/blob/main/Audio_Signal_Extraction.ipynb

Time Domain Feature Extraction

Let’s begin by recalling what is sound and how is it perceived by us. As some of you might remember from your high school lessons, sound is the propagation of vibrations through a medium. The production of sound sets the surrounding air molecules to vibration, which manifests in alternating regions of compression (high pressure) and rarefaction (low pressure). These compressions and rarefactions travel through the medium and reach our ears allowing us to perceive sound the way it is. Thus, the propagation of sound involves the transmission of these pressure variations over time. The time domain representation of sound involves capturing and analyzing these pressure variations at different time intervals by sampling the sound wave at discrete points in time (typically using a digital audio recording technique). Each sample represents the sound pressure level at a specific moment. By plotting these samples, we obtain a waveform that shows how the sound pressure level changes over time. The horizontal axis represents time, while the vertical axis represents the amplitude or intensity of the sound, usually scaled to fit between -1 and 1, where positive values indicate compression and negative values indicate rarefaction. This helps us give a visual representation of the sound wave’s characteristics, such as its amplitude, frequency, and duration.

Basics of Sound Propagation [Image by Author]

In order to extract the waveform of a given audio using Python, we begin by loading the required packages:

import numpy as np
import matplotlib.pyplot as plt

import librosa
import librosa.display
import IPython.display as ipd
import scipy as spp

NumPy is a popular Python package for processing and working with arrays and matrices. It contains a vast range of tools from linear algebra to making many tasks simpler!

librosa is Python’s package for audio processing and analysis and contains several functions and tools to make it quite easy to harness different kinds of audio features. As told earlier, we’ll be analyzing the waveforms for 3 different musical instruments: Acoustic Guitar, Brass, and Drum Set. You may download the audio files from the link shared earlier and upload them to your local repository. In order to listen to the audio files we use IPython.display. The code is given below:

# Listen to the audio files
# Ensure correct relative / absolute path to the sound files.

acoustic_guitar_path = "acoustic_guitar.wav"
ipd.Audio(acoustic_guitar_path)

brass_path = "brass.wav"
ipd.Audio(brass_path)

# Keep volume low!
drum_set_path = "drum_set.wav"
ipd.Audio(drum_set_path)

Next, we load the music files in librosa using the function librosa.load(). This function allows us to parse the audio file and return two objects:

  1. y (NumPy Array): Contains the amplitude values for different time intervals. Try printing the array to see it yourself!
  2. sr (number > 0): Sampling Rate

The sampling rate refers to the number of samples taken per unit of time while converting an analog signal into its digital representation. As discussed above, the pressure variation across the medium constitutes an analog signal, one that has a waveform that continuously varies in time. Theoretically, storing continuous data would require an infinite amount of space. Thus, in order to process and store these analog signals digitally, they need to be converted into a discrete representation. This is where sampling steps in to capture screenshots of the sound wave at discrete (uniformly spaced) time intervals. The spacing between these intervals is captured by the inverse of the sampling rate.

The sampling rate determines how frequently samples are taken from the analog signal and is therefore measured in samples per second or hertz (Hz). A higher sampling rate means more samples are taken every second, resulting in a more accurate representation of the original analog signal, but requiring more memory resources. In contrast, a lower sampling rate means lesser samples are taken every second, resulting in a less accurate representation of the original analog signal, but requiring fewer memory resources.

The usual default sampling rate is 22050. However, as per application/memory, the user may choose a lower or higher sampling rate, which can be specified by the sr argument of librosa.load(). While choosing an appropriate sampling rate for analog-to-digital conversion, it may be important to know about the Nyquist-Shannon sampling theorem, which states that in order to accurately capture and reconstruct an analog signal, the sampling rate must be at least twice the highest frequency component present in the audio signal (called the Nyquist Rate/Frequency).

By sampling at a frequency higher than the Nyquist Frequency, we can avoid a phenomenon called aliasing, which can distort the original signal. The discussion on aliasing isn’t particularly relevant for the purpose of this article. If you’re interested in reading more about it, here’s an excellent source: https://thewolfsound.com/what-is-aliasing-what-causes-it-how-to-avoid-it/

Following is the code to read the audio signals:

# Load music in librosa
sr = 22050
acoustic_guitar, sr = librosa.load(acoustic_guitar_path, sr = sr)
brass, sr = librosa.load(brass_path, sr = sr)
drum_set, sr = librosa.load(drum_set_path, sr = sr)

In the above example, the sampling rate is said to be 22050 (which is also the default rate). Executing the above code returns 3 arrays, each of which stores the amplitude values at discrete time intervals (specified by the sampling rate). Next, we visualize the waveforms for each of the 3 audio samples using librosa.display.waveshow(). Some transparency has been added (by setting alpha = 0.5) for clearer visualization of the amplitude density across time.

def show_waveform(signal, name=""):
# Create a new figure with a specific size
plt.figure(figsize=(15, 7))
# Display the waveform of the signal using librosa
librosa.display.waveshow(signal, alpha=0.5)
# Set the title of the plot
plt.title("Waveform for " + name)
# Show the plot
plt.show()

show_waveform(acoustic_guitar, "Acoustic Guitar")
show_waveform(brass, "Brass")
show_waveform(drum_set, "Drum Set")
Waveform for Acoustic Guitar [Image by Author]
Waveform for Brass [Image by Author]
Waveform for Drum Set [Image by Author]

Take some time to review the above plots. Think about the patterns that you see. In the waveform of the acoustic guitar, we can identify a periodic pattern characterized by regular oscillations in the amplitude, which reflects the harmonically rich nature of the guitar’s sound. The oscillations correspond to the vibrations produced by the plucked strings that generate a complex waveform consisting of multiple harmonics contributing to the characteristic tone and timbre of the guitar.

Likewise, the brass waveform also exhibits a periodic pattern resulting in a consistent pitch and timbre. Brass instruments produce sound through the buzzing of the musician’s lips into a mouthpiece. This buzzing action generates a waveform with distinct harmonics and a regular pattern of amplitude variations.

In contrast, the waveform of a drum set does not display a clear periodic pattern as drums produce sound through the impact of a drumstick or hand on a drumhead or other percussion surfaces creating complex and irregular waveforms, with varying amplitudes and duration. The absence of a discernible periodic pattern reflects the percussive and non-tonal nature of drum sounds.

Basics of Audio Signal Processing: Frame Size and Hop Length

Before discussing the vital time-domain audio features, it is imperative to talk about two vital feature extraction parameters: Frame Size and Hop Length. Usually, once a signal has been digitally processed, it is split into frames (a set of discrete time intervals that may or may not be overlapping). The frame length describes the size of these frames, while the hop length encapsulates information about how much the frames overlap. But, why is framing important?

The purpose of framing is to capture the time variation in different features of the signal. Usual feature extraction methods give a one-number summary of the input signal (e.g., mean, min, or max). The problem with using these feature extraction methods directly is that this completely annihilates any information associated with time. For example, if you’re interested in computing the mean amplitude of your signal, you get a single number summary, say x. However, naturally, there are intervals when the mean is less and others when the mean is more. Taking a single-number summary eliminates any information about the time variation of the mean, The solution in turn is to split the signals into frames e.g., [0 ms, 10 ms), [10 ms, 20 ms), … The mean is subsequently computed for the portion of the signal in each of these time frames and this collective set of features gives the final extracted feature vector, a time-dependent feature summary, ain’t that cool!

Now, let’s talk about the two parameters in detail:

  • Frame Size: Describes the size of each frame. For example, if the frame size is 1024, you include 1024 samples in every frame and compute the necessary features for each of these sets of 1024 samples. In general, it is recommended to have the frame size as a power of 2. The reasoning behind this isn’t important for the purpose of this article. But if you’re curious, it’s because the Fast Fourier Transformation (a very efficient algorithm to transform a signal from time-domain to frequency-domain), requires the frames to have a size that is a power of 2. We will be talking more about Fourier transformations in the subsequent sections.
  • Hop Length: Refers to the number of samples by which a frame is advanced at each step across the sequence of data i.e., the number of samples we shift to the right before generating a new frame. It may be useful to think of the frame as a sliding window that moves across the signal in steps defined by the hop length. At each step, the window is applied to a new section of the signal or sequence, and feature extraction is performed on that segment. The hop length, therefore, determines the overlap between consecutive audio frames. A hop length of equal to the frame size means there is no overlap as each frame starts exactly where the previous one ends. However, to mitigate the impact of a phenomenon called spectral leakage (which occurs when converting a signal from its time domain to the frequency domain), a windowing function is applied resulting in the loss of data around the edges of each frame (the technical explanation is beyond the purpose of this article, but if you’re curious, feel free to check this link: https://dspillustrations.com/pages/posts/misc/spectral-leakage-zero-padding-and-frequency-resolution.html). Thus, often intermediate hop lengths are chosen to preserve the edge samples, resulting in varying degrees of overlap between frames.

In general, a smaller hop length provides a higher temporal resolution allowing us to capture more details and rapid changes in the signal. However, it also increases memory requirements. Conversely, a larger hop length reduces the temporal resolution but also helps reduce space complexity.

Frame Size and Hop Length [Image by Author]

Note: For clearer visualization, the frame size is shown to be quite large in the above image. For practical purposes, the chosen frame size is much smaller (maybe a few 1000 samples, around 20–40 ms).

Before proceeding to the time-domain different feature extraction methods, let’s clarify some mathematical notations. We will use the following notations throughout this article:

  • xᵢ: the amplitude of the ith sample
  • K: Frame Size
  • H: Hop Length

Feature 1: Amplitude Envelope

First, let’s talk about the amplitude envelope. This is one of the easiest to compute (yet quite useful) features in time domain analysis. The amplitude envelope for a frame of an audio signal is simply the maximum value of its amplitude in that frame. Mathematically, the amplitude envelope (for non-overlapping frames) of the kth frame is given by:

In general, for any frame k containing samples xⱼ₁ , xⱼ₂ , · · · , xⱼₖ, the amplitude envelope is:

The Python code for computing the amplitude envelope of a given signal is given below:

FRAME_SIZE = 1024
HOP_LENGTH = 512

def amplitude_envelope(signal, frame_size=1024, hop_length=512):
"""
Computes the Amplitude Envelope of a signal using a sliding window.

Args:
signal (array): The input signal.
frame_size (int): The size of each frame in samples.
hop_length (int): The number of samples between consecutive frames.

Returns:
np.array: An array of Amplitude Envelope values.
"""
res = []
for i in range(0, len(signal), hop_length):
# Get a portion of the signal
cur_portion = signal[i:i + frame_size]
# Compute the maximum value in the portion
ae_val = max(cur_portion)
# Store the amplitude envelope value
res.append(ae_val)
# Convert the result to a NumPy array
return np.array(res)

def plot_amplitude_envelope(signal, name, frame_size=1024, hop_length=512):
"""
Plots the waveform of a signal with the overlay of Amplitude Envelope values.

Args:
signal (array): The input signal.
name (str): The name of the signal for the plot title.
frame_size (int): The size of each frame in samples.
hop_length (int): The number of samples between consecutive frames.
"""
# Compute the amplitude envelope
ae = amplitude_envelope(signal, frame_size, hop_length)
# Generate the frame indices
frames = range(0, len(ae))
# Convert frames to time
time = librosa.frames_to_time(frames, hop_length=hop_length)
# Create a new figure with a specific size
plt.figure(figsize=(15, 7))
# Display the waveform of the signal
librosa.display.waveshow(signal, alpha=0.5)
# Plot the amplitude envelope over time
plt.plot(time, ae, color="r")
# Set the title of the plot
plt.title("Waveform for " + name + " (Amplitude Envelope)")
# Show the plot
plt.show()

plot_amplitude_envelope(acoustic_guitar, "Acoustic Guitar")
plot_amplitude_envelope(brass, "Brass")
plot_amplitude_envelope(drum_set, "Drum Set")

In the above code, we have defined a function called amplitude_envelope which takes in the input signal array (generated using librosa.load()), frame size (K), and hop length (H), and returns an array of size equal to the number of frames. The kth value in the array corresponds to the value of the amplitude envelope for the kth frame. The computation is done using a simple for loop that iterates through the entire signal with steps determined by the hop length. A list (res) is defined to store these values and is finally converted to a NumPy array before being returned. Another function called plot_amplitude envelope is defined which takes in the same set of inputs (along with a name argument) and overlays the plot of the amplitude envelope over the original frame. In order to plot the waveform, the traditional librosa.display.waveform() has been used, as explained in the previous section.

To plot the amplitude envelope, we need the time and the corresponding amplitude envelope values. The time values are obtained using the very helpful function librosa.frames_to_times(), which takes in two inputs: an iterable corresponding to the number of frames (which is defined using the range function), and the hop length) to generate the mean time for each frame. Subsequently, matplotlib.pyplot is used to overlay the red plot. The above-described process will be consistently used for all time-domain feature extraction methods.

The following figures show the computed amplitude envelope for each of the musical instruments. They have been added as a red line over the original waveform and tend to approximate the upper boundary of the waveform. The amplitude envelope not only preserves the periodic pattern but also indicates the general difference in the audio amplitudes as reflected in the lower intensities of brass compared to acoustic guitar and drum sets.

Amplitude Envelope for Acoustic Guitar [Image by Author]
Amplitude Envelope for Brass [Image by Author]
Amplitude Envelope for Drum Set [Image by Author]

Feature 2: Root Mean Square Energy

Next, let’s talk about the root mean square energy (RMSE), another vital feature in time domain analysis. The root mean square energy for a frame of an audio signal is obtained by taking the square root of the mean of the square of all the amplitude values in a frame. Mathematically, the root mean square energy (for non-overlapping frames) of the kth frame is given by:

In general, for any frame k containing samples xⱼ₁ , xⱼ₂ , · · · , xⱼₖ, the RMSE is:

The RMS energy provides a representation of the overall intensity or strength of a sound signal by taking into account both positive and negative excursions of the waveform, providing a more accurate measure of the signal’s power compared to other measures such as peak amplitude. The Python code for computing the RMSE of a given signal is given below. The structure of the code is the same as that for generating the amplitude envelope. The only change is in the function used for extracting the feature. Instead of the max, the RMSE value is calculated by taking the mean of the squared values in the current portion of the signal followed by a square root.

def RMS_energy(signal, frame_size=1024, hop_length=512):
"""
Computes the RMS (Root Mean Square) energy of a signal using a sliding window.

Args:
signal (array): The input signal.
frame_size (int): The size of each frame in samples.
hop_length (int): The number of samples between consecutive frames.

Returns:
np.array: An array of RMS energy values.
"""
res = []
for i in range(0, len(signal), hop_length):
# Extract a portion of the signal
cur_portion = signal[i:i + frame_size]
# Compute the RMS energy for the portion
rmse_val = np.sqrt(1 / len(cur_portion) * sum(i**2 for i in cur_portion))
res.append(rmse_val)
# Convert the result to a NumPy array
return np.array(res)

def plot_RMS_energy(signal, name, frame_size=1024, hop_length=512):
"""
Plots the waveform of a signal with the overlay of RMS energy values.

Args:
signal (array): The input signal.
name (str): The name of the signal for the plot title.
frame_size (int): The size of each frame in samples.
hop_length (int): The number of samples between consecutive frames.
"""
# Compute the RMS Energy
rmse = RMS_energy(signal, frame_size, hop_length)
# Generate the frame indices
frames = range(0, len(rmse))
# Convert frames to time
time = librosa.frames_to_time(frames, hop_length=hop_length)
# Create a new figure with a specific size
plt.figure(figsize=(15, 7))
# Display the waveform as a spectrogram-like plot
librosa.display.waveshow(signal, alpha=0.5)
# Plot the RMS energy values
plt.plot(time, rmse, color="r")
# Set the title of the plot
plt.title("Waveform for " + name + " (RMS Energy)")
plt.show()

plot_RMS_energy(acoustic_guitar, "Acoustic Guitar")
plot_RMS_energy(brass, "Brass")
plot_RMS_energy(drum_set, "Drum Set")

The following figures show the computed RMS Energy for each of the musical instruments. They have been added as a red line over the original waveform and tend to approximate the center of mass of the waveform. Just as before, this measure not only preserves the periodic pattern but also approximates the overall intensity level of the sound wave.

RMSE for Acoustic Guitar [Image by Author]
RMSE for Brass [Image by Author]
RMSE for Drum Set [Image by Author]

Feature 3: Crest Factor

Now, let’s talk about the crest factor, a measure of the extremeness of the peaks in the waveform. The crest factor for a frame of an audio signal is obtained by dividing the peak amplitude (the largest absolute value of the amplitude) by the RMS Energy. Mathematically, the crest factor (for non-overlapping frames) of the kth frame is given by:

In general, for any frame k containing samples xⱼ₁ , xⱼ₂ , · · · , xⱼₖ, the crest factor is:

The crest factor indicates the ratio of the highest peak level and the average intensity level of a waveform. The Python code for computing the crest factor of a given signal is given below. The structure follows as above, involving the computation of the RMSE value (the denominator) and the highest peak value (the numerator), which is then used to obtain the desired fraction (the crest factor!).

def crest_factor(signal, frame_size=1024, hop_length=512):
"""
Computes the crest factor of a signal using a sliding window.

Args:
signal (array): The input signal.
frame_size (int): The size of each frame in samples.
hop_length (int): The number of samples between consecutive frames.

Returns:
np.array: An array of crest factor values.
"""
res = []
for i in range(0, len(signal), hop_length):
# Get a portion of the signal
cur_portion = signal[i:i + frame_size]
# Compute the RMS energy for the portion
rmse_val = np.sqrt(1 / len(cur_portion) * sum(i ** 2 for i in cur_portion))
# Compute the crest factor
crest_val = max(np.abs(cur_portion)) / rmse_val
# Store the crest factor value
res.append(crest_val)
# Convert the result to a NumPy array
return np.array(res)

def plot_crest_factor(signal, name, frame_size=1024, hop_length=512):
"""
Plots the crest factor of a signal over time.

Args:
signal (array): The input signal.
name (str): The name of the signal for the plot title.
frame_size (int): The size of each frame in samples.
hop_length (int): The number of samples between consecutive frames.
"""
# Compute the crest factor
crest = crest_factor(signal, frame_size, hop_length)
# Generate the frame indices
frames = range(0, len(crest))
# Convert frames to time
time = librosa.frames_to_time(frames, hop_length=hop_length)
# Create a new figure with a specific size
plt.figure(figsize=(15, 7))
# Plot the crest factor over time
plt.plot(time, crest, color="r")
# Set the title of the plot
plt.title(name + " (Crest Factor)")
# Show the plot
plt.show()

plot_crest_factor(acoustic_guitar, "Acoustic Guitar")
plot_crest_factor(brass, "Brass")
plot_crest_factor(drum_set, "Drum Set")

The following figures show the computed crest factor for each of the musical instruments:

Crest Factor for Acoustic Guitar [Image by Author]
Crest Factor for Brass [Image by Author]
Crest Factor for Brass Set [Image by Author]

A higher crest factor, as seen for acoustic guitar and brass, indicates a larger difference between peak levels and average levels suggesting a more dynamic or peaky signal with greater variations in amplitude. A lower crest factor, as seen for a drum set, suggests a more uniform or compressed signal with smaller variations in amplitude. The crest factor is particularly relevant in cases where it is imperative to consider the headroom or available dynamic range of the system. For instance, a high crest factor in a music recording may require careful consideration to prevent distortion or clipping when played back on equipment with limited headroom.

In fact, there is another feature called peak-to-average power ratio (PAPR), closely related to the crest factor. PAPR is simply the squared value of the crest factor, usually converted to a decibel power ratio. In general, for any frame k containing samples xⱼ₁ , xⱼ₂ , · · · , xⱼₖ, the peak-to-average power ratio is:

As a fun challenge, try modifying the above code to generate a plot of the PAPR for each of the 3 musical instruments and analyze your findings.

Feature 4: Zero Crossing Rate

Finally, we’ll talk about the Zero Crossing Rate (ZCR). The zero crossing rate for a frame of an audio signal is simply the number of times the signal crosses zero (the x/time axis). Mathematically, the ZCR (for non-overlapping frames) of the kth frame is given by:

If the consecutive values have the same sign, the expression inside the absolute value cancels out, giving 0. If they have the opposite signs (indicating that the signal has crossed the time axis), the values add yielding 2 (after taking absolute value). Since each zero crossing gives a value of 2, we multiply the result by a factor of half to obtain the required count. In general, for any frame k containing samples xⱼ₁ , xⱼ₂ , · · · , xⱼₖ, the ZCR is:

Note that, in the above expression, the zero crossing rate is calculated by simply adding up the number of times the signal crosses the axis. However, as per application, one may also normalize the values (by dividing by the length of the frame). The Python code for computing the crest factor of a given signal is given below. The structure follows from above, with the definition of another function called as num sign changes, which determines the number of times the sign changes in the given signal.

def ZCR(signal, frame_size=1024, hop_length=512):
"""
Computes the Zero Crossing Rate (ZCR) of a signal using a sliding window.

Args:
signal (array): The input signal.
frame_size (int): The size of each frame in samples.
hop_length (int): The number of samples between consecutive frames.

Returns:
np.array: An array of ZCR values.
"""
res = []
for i in range(0, len(signal), hop_length):
# Get a portion of the signal
cur_portion = signal[i:i + frame_size]
# Compute the number of sign changes in the portion
zcr_val = num_sign_changes(cur_portion)
# Store the ZCR value
res.append(zcr_val)
# Convert the result to a NumPy array
return np.array(res)

def num_sign_changes(signal):
"""
Computes the number of sign changes in a signal.

Args:
signal (array): The input signal.

Returns:
int: The number of sign changes.
"""
res = 0
for i in range(0, len(signal) - 1):
# Check if there is a sign change between consecutive samples
if (signal[i] * signal[i + 1] < 0):
res += 1
return res

def plot_ZCR(signal, name, frame_size=1024, hop_length=512):
"""
Plots the Zero Crossing Rate (ZCR) of a signal over time.

Args:
signal (array): The input signal.
name (str): The name of the signal for the plot title.
frame_size (int): The size of each frame in samples.
hop_length (int): The number of samples between consecutive frames.
"""
# Compute the ZCR
zcr = ZCR(signal, frame_size, hop_length)
# Generate the frame indices
frames = range(0, len(zcr))
# Convert frames to time
time = librosa.frames_to_time(frames, hop_length=hop_length)
# Create a new figure with a specific size
plt.figure(figsize=(15, 7))
# Plot the ZCR over time
plt.plot(time, zcr, color="r")
# Set the title of the plot
plt.title(name + " (Zero Crossing Rate)")
# Show the plot
plt.show()

plot_ZCR(acoustic_guitar, "Acoustic Guitar")
plot_ZCR(brass, "Brass")
plot_ZCR(drum_set, "Drum Set")

The following figures show the computed zero crossing rate for each of the musical instruments.

Zero Crossing for Acoustic Guitar [Image by Author]
Zero Crossing for Brass [Image by Author]
Zero Crossing for Drum Set [Image by Author]

A higher zero crossing rate shows that the signal changes its direction frequently, suggesting the presence of higher-frequency components or a more dynamic waveform. Conversely, a lower zero crossing rate indicates a relatively smoother or constant waveform.

Zero crossing rate is particularly useful in applications of speech and music analysis due to its ability to provide insights into properties such as timbre and rhythmic patterns. For instance, in speech analysis, the zero crossing rate helps distinguish between voiced and unvoiced sounds as voiced sounds tend to have a higher zero crossing rate due to the vibration of the vocal cords. It’s vital to note that, while zero crossing rate is a simple and computationally efficient feature, it may not capture all aspects of a signal’s complexity (as you can see in the above figures, the periodicity is completely lost). Therefore, it is often used in conjunction with other features for a more comprehensive analysis of audio signals.

Frequency Domain Feature Extraction

The frequency domain offers an alternative representation of an audio wave. Unlike the time domain, where the signal is represented as a function of time, in the frequency domain, the signal is decomposed into its constituent frequencies, revealing the amplitude and phase information associated with each frequency i.e., the signal is represented as a function of frequency. Instead of looking at the signal’s amplitude at various points in time, we examine the amplitudes of the different frequency components that constitute the signal. Each frequency component represents a sinusoidal wave of a particular frequency and by combining these components we can reconstruct the original signal in the time domain.

The (most common) mathematical tool used to convert a signal from the time domain to the frequency domain is the Fourier transformation. The Fourier transformation takes the signal as input and decomposes it into a sum of sine and cosine waves of varying frequencies having their own amplitude and phase. The resulting representation is what constitutes the frequency spectrum. Mathematically, the Fourier transform of a continuous signal in its time domain g(t) is defined as follows:

where i = √−1 is the imaginary number. Yes, the Fourier transformation yields a complex output, with the phase and the magnitude corresponding to those of the constituent sine wave! However, for most applications, we care only about the magnitude of the transformation and simply ignore the associated phase. Since digitally processed sound is discrete, we can define the analogous discrete Fourier transform (DFT):

where T is the duration of one sample. In terms, of sampling rate:

Since frequency representations are also continuous, we evaluate the Fourier transform on discretized frequency bins to obtain a discrete frequency domain representation of the audio wave. This is called the short-term Fourier transform. Mathematically,

Don’t get anxious! Let’s review it carefully. The hat-h(k) is the function that maps an integer k ∈ {0, 1, · · · , N − 1} to the magnitude of the frequency k · Sᵣ/N. Notice that we consider only the discrete frequency bins that are integral multiples of Sᵣ/N, where N is the number of samples in the signal. If you’re still unsure about how this works, here’s an excellent explanation of Fourier transforms: https://www.youtube.com/watch?v=spUNpyF58BY&t=393s

The Fourier transform is one of the most beautiful mathematical innovations, so it’s worth knowing about it, even though the discussion isn’t specifically relevant to the purpose of this article. In Python, you can easily obtain the short-term Fourier transform using librosa.stft().

Note: For large audio data, there is a more efficient way to compute the Fourier transform, called the Fast Fourier Transformation (FFT), feel free to check it out if you’re curious!

As before, we aren’t just interested in knowing which frequencies are more dominant: we also want to show when these frequencies are dominant. So, we seek a simultaneous frequency-time representation that shows which frequencies dominate at what point in time. This is where framing steps in: we split the signal into time frames and obtain the magnitude of the resulting Fourier transform in each frame. This gives us a matrix of values, where the number of rows is given by the number of frequency bins (Φ, usually equal to K/2 + 1, where K is the frame size), and the number of columns is given by the number of frames. Since the Fourier transform gives complex-valued output, the matrix generated is complex-valued. In Python, the frame size and hop length parameters can easily be specified as arguments and the resultant matrix can be simply computed using librosa.stft(signal, n fft=frame size, hop length=hop length). Since we only care about the magnitude, we may use numpy.abs() to convert the complex-valued matrix into a real-valued one. It is quite convenient to plot the obtained matrix to have a visually captivating representation of the signal that provides valuable insights into the frequency content and temporal characteristics of the given sound. The so-called representation is called a Spectrogram.

Spectrograms are obtained by plotting time frames on the x-axis and the frequency bins on the y-axis. Colors are then used to indicate the intensity or the magnitude of the frequency for the given time frame. Usually, the frequency axis is converted to a log scale (since humans are known to perceive them better under a log transformation), and the magnitude is expressed in decibels.

The Python code for generating a Spectrogram is shown below:

FRAME_SIZE = 1024
HOP_LENGTH = 512

def plot_spectrogram(signal, sample_rate, frame_size=1024, hop_length=512):
"""
Plots the spectrogram of an audio signal.

Args:
signal (array-like): The input audio signal.
sample_rate (int): The sample rate of the audio signal.
frame_size (int): The size of each frame in samples.
hop_length (int): The number of samples between consecutive frames.
"""
# Compute the STFT
spectrogram = librosa.stft(signal, n_fft=frame_size, hop_length=hop_length)
# Convert the STFT to dB scale
spectrogram_db = librosa.amplitude_to_db(np.abs(spectrogram))
# Create a new figure with a specific size
plt.figure(figsize=(15, 7))
# Display the spectrogram
librosa.display.specshow(spectrogram_db, sr=sample_rate, hop_length=hop_length, x_axis='time', y_axis='log')
# Add a colorbar to show the magnitude scale
plt.colorbar(format='%+2.0f dB')
# Set the title of the plot
plt.title('Spectrogram')
# Set the label for the x-axis
plt.xlabel('Time')
# Set the label for the y-axis
plt.ylabel('Frequency (Hz)')
# Adjust the layout of the plot
plt.tight_layout()
# Show the plot
plt.show()

plot_spectrogram(acoustic_guitar, sr)
plot_spectrogram(brass, sr)
plot_spectrogram(drum_set, sr)

In the above code, we have defined a function called plot spectrogram that takes in 4 arguments: the input signal array, sampling rate, frame size, and hop length. First, librosa.stft() is used to obtain the Spectrogram matrix. Subsequently, np.abs() is used to extract the magnitude followed by a conversion of the amplitude values to decibels using the function librosa.amplitude_to_db(). Finally, the function librosa.display.specshow() is used to plot the Spectrogram. This function takes in the transformed Spectrogram matrix, the sampling rate, hop length, and the specifications for the x and the y axes. A log-transformed y-axis can be specified using the y-axis = ‘log’ argument. An optional color bar can be added using plt.colorbar(). The resultant Spectrograms for the 3 musical instruments are shown below:

Spectrogram for Acoustic Guitar [Image by Author]
Spectrogram for Brass [Image by Author]
Spectrogram for Drum Set [Image by Author]

Spectrograms offer a unique way of visualizing the time-frequency trade-off. The time domain gives us a precise representation of how a signal evolves over time, while the frequency domain, allows us to see the distribution of energy across different frequencies. This enables us to identify not only the presence of specific frequencies but also helps us grasp their duration and temporal variations. Spectrograms are one of the most useful ways to represent sound and are frequently used in audio signal machine learning applications (for example, feeding the Spectrogram of a sound wave into a deep convolutional neural network to make predictions).

Before proceeding to the different frequency-domain feature extraction methods, let’s clarify some mathematical notations. We will use the following notations for the subsequent sections:

  • mₖ(i): the amplitude of the ith frequency of the kth frame.
  • K: Frame Size
  • H: Hop Length
  • Φ: The number of frequency bins (= K/2 + 1)

Feature 5: Band Energy Ratio

First, let’s talk about the band energy ratio. The band energy ratio is a metric used to quantify the ratio of the energies of the lower frequencies to that of the higher frequencies in a given time frame. Mathematically, for any frame k, the band energy ratio is:

where σբ denotes the split frequency bin: a parameter to distinguish the lower frequencies from the higher frequencies. While calculating the band energy ratio, all frequencies having a value lesser than the frequency corresponding to σբ (called the split frequency) are treated as lower frequencies. The sum of the squared energies or these frequencies determines the numerator. Similarly, all frequencies having a value higher than the split frequency are treated as higher frequencies and the sum of the squared energies or these frequencies determines the denominator. The Python code for computing the band energy ratio of a signal is shown below:

def find_split_freq_bin(spec, split_freq, sample_rate, frame_size=1024, hop_length=512):
"""
Calculate the bin index corresponding to a given split frequency.

Args:
spec (array): The spectrogram.
split_freq (float): The split frequency in Hz.
sample_rate (int): The sample rate of the audio.
frame_size (int, optional): The size of each frame in samples. Default is 1024.
hop_length (int, optional): The number of samples between consecutive frames. Default is 512.

Returns:
int: The bin index corresponding to the split frequency.
"""
# Calculate the range of frequencies
range_of_freq = sample_rate / 2
# Calculate the change in frequency per bin
change_per_bin = range_of_freq / spec.shape[0]
# Calculate the bin corresponding to the split frequency
split_freq_bin = split_freq / change_per_bin
return int(np.floor(split_freq_bin))

def band_energy_ratio(signal, split_freq, sample_rate, frame_size=1024, hop_length=512):
"""
Compute the band energy ratio (BER) of a signal.

Args:
signal (array): The input signal.
split_freq (float): The split frequency in Hz.
sample_rate (int): The sample rate of the audio.
frame_size (int, optional): The size of each frame in samples. Default is 1024.
hop_length (int, optional): The number of samples between consecutive frames. Default is 512.

Returns:
ndarray: The band energy ratios for each frame of the signal.
"""
# Compute the spectrogram of the signal
spec = librosa.stft(signal, n_fft=frame_size, hop_length=hop_length)
# Find the bin corresponding to the split frequency
split_freq_bin = find_split_freq_bin(spec, split_freq, sample_rate, frame_size, hop_length)
# Extract the magnitude and transpose it
modified_spec = np.abs(spec).T
res = []
for sub_arr in modified_spec:
# Compute the energy in the low-frequency range
low_freq_density = sum(i ** 2 for i in sub_arr[:split_freq_bin])
# Compute the energy in the high-frequency range
high_freq_density = sum(i ** 2 for i in sub_arr[split_freq_bin:])
# Compute the band energy ratio
ber_val = low_freq_density / high_freq_density
res.append(ber_val)
return np.array(res)

def plot_band_energy_ratio(signal, split_freq, sample_rate, name, frame_size=1024, hop_length=512):
"""
Plot the band energy ratio (BER) of a signal over time.

Args:
signal (ndarray): The input signal.
split_freq (float): The split frequency in Hz.
sample_rate (int): The sample rate of the audio.
name (str): The name of the signal for the plot title.
frame_size (int, optional): The size of each frame in samples. Default is 1024.
hop_length (int, optional): The number of samples between consecutive frames. Default is 512.
"""
# Compute the band energy ratio (BER)
ber = band_energy_ratio(signal, split_freq, sample_rate, frame_size, hop_length)
# Generate the frame indices
frames = range(0, len(ber))
# Convert frames to time
time = librosa.frames_to_time(frames, hop_length=hop_length)
# Create a new figure with a specific size
plt.figure(figsize=(15, 7))
# Plot the BER over time
plt.plot(time, ber)
# Set the title of the plot
plt.title(name + " (Band Energy Ratio)")
# Show the plot
plt.show()

plot_band_energy_ratio(acoustic_guitar, 2048, sr, "Acoustic Guitar")
plot_band_energy_ratio(brass, 2048, sr, "Brass")
plot_band_energy_ratio(drum_set, 2048, sr, "Drum Set")

The structure of the above code is quite similar to that of the time domain extraction. The first step is to define a function called find split_freq_bin() which takes in the spectrogram, the value of the split frequency, and the sample rate to determine the split frequency bin (σբ ) corresponding to the split frequency. The process is rather simple. It involves finding the range of the frequencies (which, as explained earlier is the Nyquists frequency, Sᵣ/2). The number of frequency bins is given by the number of rows of the spectrograms, which is extracted as spec.shape[0]. Dividing the total range of frequencies by the number of frequency bins allows us to compute the change in frequency per bin, which can be divided by the given split frequency to determine the split frequency bin.

Next, we use this function to calculate the band energy ratio vector. The function band_energy_ratio() takes in the input signal, split frequency, sample rate, frame size, and hop length. First, it uses librosa.stft() to extract the spectrogram, followed by a calculation of the split frequency bin. Next, the magnitude of the spectrogram is calculated using np.abs() followed by transposition to facilitate iteration across every frame. During the iteration, the band energy ratio for each frame is calculated using the defined formula and the found split frequency bin. The values are stored in a list, res which is finally returned as a NumPy Array. Finally, the values are plotted using the plot_band_energy_ratio() function.

The band energy ratio plots for the 3 musical instruments are shown below. For these plots, the splitting frequency is chosen to be 2048 Hz i.e., frequencies below 2048 Hz are considered the lower energy frequencies, and the ones above are taken to be higher energy frequencies.

Band Energy Ratio for Acoustic Guitar [Image by Author]
Band Energy Ratio for Brass [Image by Author]
Band Energy Ratio for Drum Set [Image by Author]

A high band energy ratio (for brass) indicates the larger presence of lower-frequency components relative to the higher-frequency components. Thus, we observe that brass instruments produce a significant amount of energy in the lower frequency bands compared to the higher frequency bands. The acoustic guitar has a lower BER compared to brass instruments, indicating a relatively lesser energy contribution in the lower frequency bands compared to the higher frequency bands. In general, acoustic guitars tend to have a more balanced energy distribution across the frequency spectrum, with relatively less emphasis on the lower frequencies compared to other instruments. Finally, the drum set has the lowest BER among the three, suggesting a comparatively lower energy contribution in the lower frequency bands relative to other instruments.

Feature 6: Spectral Centroid

Next, we will talk about the spectral centroid, a measure that quantifies information about the center of mass or the average frequency of a signal’s spectrum in a given time frame. Mathematically, for any frame k, the spectral centroid is:

Think of it like a weighted sum of the frequency bin indexes, where the weight is determined by the energy contribution of the bin in the given time frame. Normalization is also done by dividing the weighted sum by the sum of all the weights to facilitate uniform comparison across different signals. The Python code for computing the spectral centroid of a signal is shown below:

def spectral_centroid(signal, sample_rate, frame_size=1024, hop_length=512):
"""
Compute the Spectral Centroid of a signal.

Args:
signal (array): The input signal.
sample_rate (int): The sample rate of the audio.
frame_size (int, optional): The size of each frame in samples. Default is 1024.
hop_length (int, optional): The number of samples between consecutive frames. Default is 512.

Returns:
ndarray: The spectral centroids for each frame of the signal.
"""
# Compute the spectrogram of the signal
spec = librosa.stft(signal, n_fft=frame_size, hop_length=hop_length)
# Extract the magnitude and transpose it
modified_spec = np.abs(spec).T
res = []
for sub_arr in modified_spec:
# Compute the spectral centroid
sc_val = sc(sub_arr)
# Store the value of spectral centroid for current frame
res.append(sc_val)
return np.array(res)

def sc(arr):
"""
Computes the spectral centroid in a signal.

Args:
arr (array): Frequency domain array for current frame.

Returns:
float: The spectral centroid value for current frame.
"""
res = 0
for i in range(0, len(arr)):
# Compute weighted sum
res += i*arr[i]
return res/sum(arr)

def bin_to_freq(spec, bin_val, sample_rate, frame_size=1024, hop_length=512):
"""
Calculate the frequency corresponding to a given bin value

Args:
spec (array): The spectrogram.
bin_val (): The bin value.
sample_rate (int): The sample rate of the audio.
frame_size (int, optional): The size of each frame in samples. Default is 1024.
hop_length (int, optional): The number of samples between consecutive frames. Default is 512.

Returns:
int: The bin index corresponding to the split frequency.
"""
# Calculate the range of frequencies
range_of_freq = sample_rate / 2
# Calculate the change in frequency per bin
change_per_bin = range_of_freq / spec.shape[0]
# Calculate the frequency corresponding to the bin
split_freq = bin_val*change_per_bin
return split_freq

def plot_spectral_centroid(signal, sample_rate, name, frame_size=1024, hop_length=512, col = "black"):
"""
Plot the spectral centroid of a signal over time.

Args:
signal (ndarray): The input signal.
sample_rate (int): The sample rate of the audio.
name (str): The name of the signal for the plot title.
frame_size (int, optional): The size of each frame in samples. Default is 1024.
hop_length (int, optional): The number of samples between consecutive frames. Default is 512.
"""
# Compute the STFT
spectrogram = librosa.stft(signal, n_fft=frame_size, hop_length=hop_length)
# Convert the STFT to dB scale
spectrogram_db = librosa.amplitude_to_db(np.abs(spectrogram))
# Compute the Spectral Centroid
sc_arr = spectral_centroid(signal, sample_rate, frame_size, hop_length)
# Compute corresponding frequencies:
sc_freq_arr = bin_to_freq(spectrogram_db, sc_arr, sample_rate, frame_size, hop_length)
# Generate the frame indices
frames = range(0, len(sc_arr))
# Convert frames to time
time = librosa.frames_to_time(frames, hop_length=hop_length)
# Create a new figure with a specific size
plt.figure(figsize=(15, 7))
# Display the Spectrogram
librosa.display.specshow(spectrogram_db, sr=sample_rate, hop_length=hop_length, x_axis='time', y_axis='log')
# Add a colorbar to show the magnitude scale
plt.colorbar(format='%+2.0f dB')
# Plot the Spectral Centroid over time
plt.plot(time, sc_freq_arr, color=col)
# Set the title of the plot
plt.title(name + " (Spectral Centroid)")
# Show the plot
plt.show()

plot_spectral_centroid(acoustic_guitar, sr, "Acoustic Guitar")
plot_spectral_centroid(brass, sr, "Brass", col = "white")
plot_spectral_centroid(drum_set, sr, "Drum Set")

In the above code, the spectral centroid function is defined to produce the array of spectral centroids for all time frames. Subsequently, the sc() function is defined to calculate the spectral centroid of one frame through a simple iterative process that multiplies the index values with the magnitude followed by normalization to obtain the average frequency bin. Before plotting the spectral centroid values returned by spectral_centroid(), an additional function called bin to freq is defined as a helper function for plotting. This function converts the average bin values to the corresponding frequency values which can be plotted over the original spectrogram to have a consistent idea about the variation of the spectral centroid across time. The output plots (with the overlay of centroid variation over the spectrogram) are shown below:

Spectral Centroid for Acoustic Guitar [Image by Author]
Spectral Centroid for Brass [Image by Author]
Spectral Centroid for Drum Set [Image by Author]

The spectral centroid is quite analogous to the RMSE metric for time-domain analysis and is commonly used as a descriptor for sound timbre and brightness. Sounds with higher spectral centroids tend to have a brighter or more treble-oriented quality, while lower centroid values are associated with a rather darker or bass-oriented character. Spectral centroid is one of the most important features for audio machine learning, often used in applications involving audio/music genre classification.

Feature 7: Spectral Bandwidth

Now, we will talk about spectral bandwidth/spread, a measure that quantifies information about the spread of energies across the component frequencies of a signal’s spectrum in a given time frame. Think of it this way: If the spectral centroid is the mean/average value, the spectral bandwidth is a measure of its spread/variance about the centroid. Mathematically, for any frame k, the spectral bandwidth is:

where SCₖ denotes the spectral centroid of the kth frame. As before, normalization is done by dividing the weighted sum by the sum of all the weights to facilitate uniform comparison across different signals. The Python code for computing the spectral bandwidth of a signal is shown below:

def spectral_bandwidth(signal, sample_rate, frame_size=1024, hop_length=512):
"""
Compute the Spectral Bandwidth of a signal.

Args:
signal (array): The input signal.
sample_rate (int): The sample rate of the audio.
frame_size (int, optional): The size of each frame in samples. Default is 1024.
hop_length (int, optional): The number of samples between consecutive frames. Default is 512.

Returns:
ndarray: The spectral bandwidths for each frame of the signal.
"""
# Compute the spectrogram of the signal
spec = librosa.stft(signal, n_fft=frame_size, hop_length=hop_length)
# Extract the magnitude and transpose it
modified_spec = np.abs(spec).T
res = []
for sub_arr in modified_spec:
# Compute the spectral bandwidth
sb_val = sb(sub_arr)
# Store the value of spectral bandwidth for current frame
res.append(sb_val)
return np.array(res)

def sb(arr):
"""
Computes the spectral bandwidth in a signal.

Args:
arr (array): Frequency domain array for current frame.

Returns:
float: The spectral bandwidth value for current frame.
"""
res = 0
sc_val = sc(arr)
for i in range(0, len(arr)):
# Compute weighted sum
res += (abs(i - sc_val))*arr[i]
return res/sum(arr)

def plot_spectral_bandwidth(signal, sample_rate, name, frame_size=1024, hop_length=512):
"""
Plot the spectral bandwidth of a signal over time.

Args:
signal (ndarray): The input signal.
sample_rate (int): The sample rate of the audio.
name (str): The name of the signal for the plot title.
frame_size (int, optional): The size of each frame in samples. Default is 1024.
hop_length (int, optional): The number of samples between consecutive frames. Default is 512.
"""
# Compute the Spectral bandwidth
sb_arr = spectral_bandwidth(signal, sample_rate, frame_size, hop_length)
# Generate the frame indices
frames = range(0, len(sb_arr))
# Convert frames to time
time = librosa.frames_to_time(frames, hop_length=hop_length)
# Create a new figure with a specific size
plt.figure(figsize=(15, 7))
# Plot the Spectral Bandwidth over time
plt.plot(time, sb_arr)
# Set the title of the plot
plt.title(name + " (Spectral Bandwidth)")
# Show the plot
plt.show()

plot_spectral_bandwidth(acoustic_guitar, sr, "Acoustic Guitar")
plot_spectral_bandwidth(brass, sr, "Brass")
plot_spectral_bandwidth(drum_set, sr, "Drum Set")

As before, in the above code, the spectral bandwidth function is defined to produce the array of spectral spread for all time frames using the sb helper function, which iteratively calculates the bandwidth of one frame. Finally, the plot spectral bandwidth function is employed to plot these bandwidth values. The output plots are shown below:

Spectral Bandwidth for Acoustic Guitar [Image by Author]
Spectral Bandwidth for Brass [Image by Author]
Spectral Bandwidth for Drum Set [Image by Author]

Spectral bandwidth can be used in various audio analysis/classification tasks owing to its ability to provide information about the spread or width of frequencies present in a signal. A higher spectral bandwidth (as seen for brass and drum sets) indicates a wider range of frequencies, suggesting a more diverse or complex signal. On the other hand, a lower bandwidth suggests a narrower range of frequencies, indicating a more focused or tonally pure signal.

Feature 8: Spectral Flatness

Finally, we will talk about spectral flatness (aka Weiner entropy), a measure that informs about the flatness’ or uniformity of the power spectrum of an audio signal. It helps us know how close the audio signal is to a pure tone (as opposed to noise-like) and is therefore also called the tonality coefficient. For any frame k, the spectral flatness is the ratio of its geometric mean to the arithmetic mean. Mathematically,

The Python code for computing the spectral flatness of a signal is shown below:

def spectral_flatness(signal, sample_rate, frame_size=1024, hop_length=512):
"""
Compute the Spectral Flatness of a signal.

Args:
signal (array): The input signal.
sample_rate (int): The sample rate of the audio.
frame_size (int, optional): The size of each frame in samples. Default is 1024.
hop_length (int, optional): The number of samples between consecutive frames. Default is 512.

Returns:
ndarray: The spectral flatness for each frame of the signal.
"""
# Compute the spectrogram of the signal
spec = librosa.stft(signal, n_fft=frame_size, hop_length=hop_length)
# Extract the magnitude and transpose it
modified_spec = np.abs(spec).T
res = []
for sub_arr in modified_spec:
# Compute the geometric mean
geom_mean = np.exp(np.log(sub_arr).mean())
# Compute the arithmetic mean
ar_mean = np.mean(sub_arr)
# Compute the spectral flatness
sl_val = geom_mean/ar_mean
# Store the value of spectral flatness for current frame
res.append(sl_val)
return np.array(res)

def plot_spectral_flatness(signal, sample_rate, name, frame_size=1024, hop_length=512):
"""
Plot the spectral flatness of a signal over time.

Args:
signal (ndarray): The input signal.
sample_rate (int): The sample rate of the audio.
name (str): The name of the signal for the plot title.
frame_size (int, optional): The size of each frame in samples. Default is 1024.
hop_length (int, optional): The number of samples between consecutive frames. Default is 512.
"""
# Compute the Spectral bandwidth
sl_arr = spectral_flatness(signal, sample_rate, frame_size, hop_length)
# Generate the frame indices
frames = range(0, len(sl_arr))
# Convert frames to time
time = librosa.frames_to_time(frames, hop_length=hop_length)
# Create a new figure with a specific size
plt.figure(figsize=(15, 7))
# Plot the Spectral Flatness over time
plt.plot(time, sl_arr)
# Set the title of the plot
plt.title(name + " (Spectral Flatness)")
# Show the plot
plt.show()

plot_spectral_flatness(acoustic_guitar, sr, "Acoustic Guitar")
plot_spectral_flatness(brass, sr, "Brass")
plot_spectral_flatness(drum_set, sr, "Drum Set")

The structure of the above code is the same as those for other frequency domain extraction methods. The only difference lies in the feature extraction function inside the for loop, which computes the arithmetic and geometric mean using NumPy functions and calculates their ratio to produce the spectral flatness values for each time frame. The output plots are shown below:

Spectral Flatness for Acoustic Guitar [Image by Author]
Spectral Flatness for Brass [Image by Author]
Spectral Flatness for Drum Set [Image by Author]

A high spectral flatness value (one that is closer to 1), indicates a more uniform or balanced distribution of energy across different frequencies in the signal. This is seen consistently for the drum set, suggesting a sound that is more ”noise-like” or broadband, without prominent peaks or emphasis on specific frequencies (as noticed earlier from the lack of periodicity).

On the other hand, a low spectral flatness value (especially for acoustic guitar and somewhat for brass) implies a more uneven power spectrum, with energy being concentrated around a few specific frequencies. This shows the presence of tonal or harmonic components in the sound (as reflected in their periodic time domain structure). In general, music with distinct pitch/frequencies tends to have lower spectral flatness values, while noisier (and non-tonal) sounds exhibit higher spectral flatness values.

Conclusion

In this article, we dived deep into the different strategies and techniques for feature extraction that constitutes an integral part of audio-signal processing in musical engineering. We started with learning about the basics of sound production and propagation that can be effectively translated into pressure variations over time giving rise to its time domain representation. We discussed the digital representation of sound and its vital parameters including sampling rate, frame size, and hop length. Time domain features such as amplitude envelope, root mean square energy, crest factor, peak-to-power ratio, and zero crossing rate were discussed theoretically and evaluated computationally on 3 musical instruments: acoustic guitar, brass, and drum sets. Subsequently, the frequency-domain representation of sound was presented and analyzed through various theoretical discussions on the Fourier transform and spectrograms. This paved the way for a wide array of frequency-domain features including band energy ratio, spectral centroid, bandwidths, and tonality coefficient, each of which can be efficiently utilized to gauge a particular characteristic of the input audio. There’s a lot more to signal processing applications including mel-spectrograms, cepstral coefficients, noise control, audio synthesis, et al. I hope this explanation will serve as a foundation for further exploration of advanced concepts in the field.

Hope you enjoyed reading this article! In case you have any doubts or suggestions, do reply in the comment box.

Please feel free to contact me via mail.

If you liked my article and want to read more of them, please follow me.

Note: All images (except for the cover image) have been made by the author.

References

Crest factor. (2023). In Wikipedia. https://en.wikipedia.org/w/index.php?title=Crest_factor&oldid=1158501578

librosa — Librosa 0.10.1dev documentation. (n.d.). Retrieved June 5, 2023, from https://librosa.org/doc/main/index.html

Spectral flatness. (2022). In Wikipedia. https://en.wikipedia.org/w/index.php?title=Spectral_flatness&oldid=1073105086

The Sound of AI. (2020, August 1). Valerio Velardo. https://valeriovelardo.com/the-sound-of-ai/

--

--