The world’s leading publication for data science, AI, and ML professionals.

How to identify bird species by their songs?

A Kick-off for Applying ML on Sounds

Photo by Jan Meeus on Unsplash
Photo by Jan Meeus on Unsplash

Did you know that sounds can be transformed into images to feed standard Convolutional Networks in Machine Learning to tell what sounds are?

We’ll talk about the tools you need to quickly kick off a new sound classification task and give a big-picture view of how it all fits together.

About the data used in this article

To illustrate sound classification, we will use bird sounds recorded from https://xeno-canto.org/. Xeno-canto is a community-driven collection of bird sounds from around the world, including a large database of bird recordings.

Before using their data, you have to know that each recording may be under a different license and for this article, I picked records under the CC BY 4.0 license.


A bit of basic around sounds and signal processing

A sound is the result of pressure waves from the air when an object vibrates. These pressure waves travel from the source to us. The ear then processes two parameters: the frequency of the pressure wave and its amplitude.

The frequency of the pressure wave received by the ear is translated to a pitch: the higher the frequency, the higher the pitch.

The amplitude of the wave, on the other hand, is responsible for the intensity of the sound, and a big amplitude will create a louder noise.

Going through an example

Below is an example of a bird call recorded by Camille Vacher viewed as an air-pressure over time signal. We say that we visualize the wave in the "time domain" because we look at the evolution of the pressure wave over time. The signal has been truncated to keep only the first "pew" from the bird.

Bird song recorded by Camille Vacher (Creative Commons Attribution 4.0), transcribed as an air pressure time series, Author Illustration
Bird song recorded by Camille Vacher (Creative Commons Attribution 4.0), transcribed as an air pressure time series, Author Illustration

And for reference, this is what we will actually hear:

For a human, the different characteristics of the sounds, and what we would like a bird song classifier to identify are:

  • The duration of the call (Some birds produce longer sounds, while others produce shorter ones)
  • The pitch, and its evolution during the call
  • The potential evolution of the amplitude in time
  • Some features of the evolution of the frequencies/amplitudes of the call over time

Let’s explore some of the characteristics of the bird call above.

Evolution of the frequency of the call

If we zoom in exactly at the beginning of the song, we can observe the change in the time series regime which highlight the main frequency of the bird sound at the beginning of the call.

Bird song recorded by Camille Vacher (Creative Commons Attribution 4.0), transcribed as an air pressure time series, Author Illustration
Bird song recorded by Camille Vacher (Creative Commons Attribution 4.0), transcribed as an air pressure time series, Author Illustration

If we look a bit further in the signal, we can see differences in the dominant frequency as well as a difference in the amplitude of the signal. Let’s have a look at 5 periods of oscillations of the signal above at two different timestamps. To calculate the frequency of the oscillation of the signal we simply look at the time elapsed between a full oscillation and invert it. We are averaging here over 5 periods to get a slightly more robust measure.

Around t=0.0918s, the frequency of oscillation is roughly 1/[(0.0922–0.0913)/5] ~ 5.55kHz.

Bird song recorded by Camille Vacher (Creative Commons Attribution 4.0), transcribed as an air pressure time series, Author Illustration
Bird song recorded by Camille Vacher (Creative Commons Attribution 4.0), transcribed as an air pressure time series, Author Illustration

On the other hand, around 0.229s, the frequency is lower: 1/[(0.2295–0.2284)/5] ~ 4.55kHz. We can also notice that the amplitude of that second signal has an amplitude ~4 times higher than the previous one.

Bird song recorded by Camille Vacher (Creative Commons Attribution 4.0), transcribed as an air pressure time series, Author Illustration
Bird song recorded by Camille Vacher (Creative Commons Attribution 4.0), transcribed as an air pressure time series, Author Illustration

What about the tone color of the sound?

The frequency that we are observing above is not enough to fully characterize a bird song. Think of two instruments playing the same musical note: the main frequency of the air-pressure signal can be the same (same note, ergo same pitch), but your ear can still make the difference between the instruments.

So far we have considered the sounds as a pure mono-frequency signal. In reality, a sound is made of an infinity of signals that add up to each other and give a particular tone color to the sound, which is also used by our brain to differentiate sounds from each other. Those other frequencies are often referred to as "harmonics", and are particularly difficult to distinguish from our "temporal domain" representation.

One clue can still be identified: if our bird song was a perfectly monotonic sine, we should not see a variation in amplitude, as in the figure below:

Example of a mono-frequence periodic signal (f(t) = A * sin(2 pi f t) )
Example of a mono-frequence periodic signal (f(t) = A * sin(2 pi f t) )

If we add to the previous time series another signal with a different frequency, we will observe a modulation of the amplitude while keeping the dominant frequency:

Adding up to period signals with different frequencies: (f(t) = A1  sin(2 pi f1 t) + A2  sin(2 pi f2 t) ), Author Illustration
Adding up to period signals with different frequencies: (f(t) = A1 sin(2 pi f1 t) + A2 sin(2 pi f2 t) ), Author Illustration

The frequencies responsible for the modulation of the amplitude in the time domain are fundamental as they give a "color" to the sound. Going back to our example, those modulations can be observed when we look with the right level of detail:

Bird song recorded by Camille Vacher (Creative Commons Attribution 4.0), transcribed as an air pressure time series, Author Illustration
Bird song recorded by Camille Vacher (Creative Commons Attribution 4.0), transcribed as an air pressure time series, Author Illustration

From sound to image

The Fourrier Transform

The Fourier Transform is a mathematical tool that helps us deconstruct a complex signal into a series of simpler sine and cosine waves, each characterized by a specific frequency and amplitude. Here is a simplified version of the Fourier Transform:

A simplified version of the Fourier Transform, neglecting the phase information
A simplified version of the Fourier Transform, neglecting the phase information

In short, the Fourier Transform is simply telling us that any time series can be decomposed in a continuous sum (the integral of the formula) of primary sines/cosines with different amplitudes. This is exactly what we are looking for (because sound and frequencies spectrum are very closely related)

This method allows us to "break" the signal and identify each of its frequency components, providing a more thorough understanding of the overall sound.

The Fourier Transform for a mixture of two cosine signals

To make it more clear, let’s consider our previous example where we combined two periodic signals with different frequencies. The output, as we’ve observed, was a signal with a modulated amplitude.

Adding up to period signals with different frequencies, Author Illustration
Adding up to period signals with different frequencies, Author Illustration

The idea behind the Fourier Transform is simply to identify and represent the different components of our combined signal, which can later be represented on a diagram showing the amplitude and frequency of each primary cosine identified.

The original signal is broken in primary cosines. Their amplitudes and frequencies are displayed in a diagram, Author Illustration
The original signal is broken in primary cosines. Their amplitudes and frequencies are displayed in a diagram, Author Illustration

The resultant diagram showing the primary components of a complex signal with their frequencies and amplitudes is called a spectral diagram, and it contains the "features" composing our signal.

The spectral diagram for our made-up example
The spectral diagram for our made-up example

What this diagram tells us is simply that our made-up periodic signal is composed of two "primary" cosine signals:

  • One cosine at 2250Hz with an amplitude of 0.5
  • One cosine at 5500Hz with an amplitude of 1

The diagram is sufficient on its own to describe completely our main periodic signal: we jumped from the "time domain" to the "frequency domain".

Extending the concept with a real-life signal

Our bird call, however, is much more complex than this two-frequency time series. It can have an infinity of frequencies mixed together, each contributing to the call’s unique tone color, and the frequency composition can evolve over time.

Instead of applying the Fourier Transform to the whole signal, we will apply it only locally, at a scale that is small enough to have a "regular enough" signal, but long enough to have enough periods of oscillations.

For example, let’s zoom back to the signal at t=0.0918s and t=0.229s and have a look at the spectral diagram. The obtained Fourier Transforms are this time continuous but peak at certain frequencies, which match with the calculations made in the previous chapter of this article.

Time Domain (left) and Spectral Diagram (right) (t=0.0918s), Bird song recorded by Camille Vacher (Creative Commons Attribution 4.0), transcribed as an air pressure time series, Author Illustration
Time Domain (left) and Spectral Diagram (right) (t=0.0918s), Bird song recorded by Camille Vacher (Creative Commons Attribution 4.0), transcribed as an air pressure time series, Author Illustration
Time Domain (left) and Spectral Diagram (right) (t=0.229s), Bird song recorded by Camille Vacher (Creative Commons Attribution 4.0), transcribed as an air pressure time series, Author Illustration
Time Domain (left) and Spectral Diagram (right) (t=0.229s), Bird song recorded by Camille Vacher (Creative Commons Attribution 4.0), transcribed as an air pressure time series, Author Illustration

Secondly, we can determine with more detail the composition of each portion of signals. In particular, we see that the second slice is made of multiple frequency peaks and is "richer" from a harmonic point of view, giving us new information about the "color" we talked about earlier.

Applying the Fourier Transform to a sub-part of the signal as we did above is usually referred to as Short-Time Fourier Transform (SFTF), it’s a strong tool that will be particularly useful to describe the sound locally and follow its evolution over time.

From STFT to Spectrogram

We have now a tool that can be used to identify the different primary components (amplitudes/frequencies) from a slice of a temporal signal locally. We can now apply this method to the whole signal using a sliding window which will extract the features of the sound over time. Note that instead of showing the Spectral diagram as a Scatter Plot, we will now represent it using a Heatmap with the frequency axis displayed vertically and each pixel representing the intensity of that frequency.

From Temporal Signal to STFT Heatmap, Author Illustration
From Temporal Signal to STFT Heatmap, Author Illustration

Using this representation, we can now stack horizontally the STFTs calculated using the rolling window on the entire signal and visualize the evolution of the frequency spectrum over time through an image. The generated figure is called a spectrogram.

The birds call in the time domain, and the associated Spectrogram, Bird song recorded by Camille Vacher (Creative Commons Attribution 4.0), transcribed as an air pressure time series, Author Illustration
The birds call in the time domain, and the associated Spectrogram, Bird song recorded by Camille Vacher (Creative Commons Attribution 4.0), transcribed as an air pressure time series, Author Illustration

In the spectrogram above, each column of pixels represents the STFT of a small portion of the signal centered on a given timestamp.

There are many types of spectrograms with different scales, and different hyper-parameters to apply the time-frequency transformation. Among them, we can mention the Log Frequency Power Spectrogram, the Constant-Q Transform (CQT), the Mel Spectrogram, etc… Each has its own subtilities but work on the same basis of extracting the frequency features and represent them in a (time x frequency heatmap) that can be interpreted as an image.

A few examples

The advantage of the spectrogram is that it condensates all the important features of the sound in one simple image. Analyzing this image tells you about variations over time in amplitudes, pitch, and color of a sound, which is exactly what we (or an ML/DL algorithm) need to recognize its emitter.

Let’s have a look at a few sounds with a 5s duration with their associated spectrogram.

The first sample is a European Chaffinch, recorded by Benoît Van Hecke

European Chaffinch call in the time domain, and the associated Spectrogram, by Benoît Van Hecke
European Chaffinch call in the time domain, and the associated Spectrogram, by Benoît Van Hecke

The second recording is a European Robin by Benoît Van Hecke

European Robin call in the time domain, and the associated Spectrogram, by Benoît Van Hecke
European Robin call in the time domain, and the associated Spectrogram, by Benoît Van Hecke

One last with a 5s record of a Song Thrush, still by Benoît Van Hecke

Song Thrush call, in the time domain, and the associated Spectrogram, by Benoît Van Hecke
Song Thrush call, in the time domain, and the associated Spectrogram, by Benoît Van Hecke

Why Spectrograms + CNNs are more performant than LSTM on sounds classification

If you never worked on a sound classification system before, you might have considered using a recurrent neural network like an LSTM to extract the relevant features from the sound time series directly.

This would be a bad idea, because even if those models are designed to extract temporal dependencies, they are not efficient at extracting frequency features which, as we saw, are crucial for the sound identification task. LSTM would also be computationally expensive and inefficient by nature (because the data is processed sequentially). This means much less data to process for a given amount of time in comparison to a standard CNN.

On the other hand, converting the time series data into a spectrogram, which is essentially a visual representation of frequency information over time, allows us to use Convolutional Neural Networks (CNNs) which are designed for image data and are very effective at capturing spatial patterns, which in the case of a spectrogram, correspond to frequency patterns over time. This step can be seen as a natural "feature engineering" step, guaranteeing better efficiency for the sound classification task.


Conclusion & Further Challenges

In this article, we explored the mechanisms used to extract all the relevant features from a sound and transform a time series into an image.

We only covered the preliminary preprocessing steps, but even if the spectrogram is a powerful transformation it is not a silver bullet. There are still several challenges that would still need to be addressed.

One of the main challenges is the issue of weakly labeled data. In many cases, the exact time of the sound of interest within the recording is not known, and the label only indicates the presence of a sound somewhere in the recording.

Another challenge is the presence of background noise in the recordings. Techniques such as noise reduction or filtering can be used to mitigate this issue, but they also risk distorting the bird song and potentially losing important information. Other alternatives would include data augmentation technics specific to sound classification such as "Mix-up" (a method that creates new sounds as a linear combination of original samples) or adding new background noises.

Finally, the length of recordings can also be a problem, especially when coupled with the weakly labeled issue. While some sounds are only a few seconds long, others can last for minutes. This variation can make it difficult to standardize the input to the model, and can also affect the performance of the model, as it may be more difficult to accurately identify species from shorter recordings.

Despite these challenges, the use of spectrograms and Convolutional Neural Networks offers a promising approach to sound identifications, and I hope that this exploration will serve as a valuable starting point for those embarking on new Machine Learning projects in sound classification. With the right tools and understanding, we can navigate these challenges and unlock the vast potential of audio data.


Related Articles