Last week, I presented a conference paper at ICSEE2018 on a neural network system with the ability to quickly differentiate between speakers in dual microphone audio. This is related to the cocktail party problem. In my work, the idea is that a neural network learns how to separate voices into bins using a small amount of clean training data for each speaker (a very optimistic assumption). A further simplifying assumption is that there is normal background noise, but not something difficult like background music or loud noises. The hard part here, or depending on your point of view, the easy part, is that there should be minimal preprocessing. The neural network makes predictions on each segment of the audio, and when each of the voice signatures are detected in a given conversation, the speaker is recognized.
In the more general cocktail party problem with multiple speakers, which is much harder, the idea is that these speakers can be recognised because we know who is talking by where they are and how loud they talk. In that scenario, the location of the speakers is relevant, whereas in our work the relative position of people talking to the microphones similar, and so we are focusing in on the speaker’s voice, rather than the voice power or relative location.

There is other work in this field on extracting features from single and dual microphone signals, and let’s skip over all of that. Our goal was to see if a certain kind of neural network could do the job with only basic preprocessing.
My coauthors Mohamed Asni, Tony Mathew, Dr. Miodrag Bolic, and Leor Grebler worked with me on this paper for a really long time. I feel like it took us about a year. Counting back to the grant and the idea stages, more than a year. The project stretched on from requirements gathering to solution architecture, and then data collection, and finally the development, analysis, drafting, submission of the paper, and presentation at the conference. These things happen slowly. The paper should show up in IEEE Xplore and Google Scholar before you know it.
At the end of the day, we developed a Deep Learning solution for differentiating human voices in audio originating from two microphone sources simultaneously. To understand the solution better, let’s briefly talk about autoencoders, convolution, MFCC, and more. I’m not going to cover everything we did in the paper, or present the prior art. Instead I want to give you an idea of what we did from a solution architecture perspective.
An Autoencoder (AE) reduces the dimensions of an input to a latent space representation (the encoder part of an AE) and then attempts to reconstruct the compressed data (the decoder). This encoder-decoder system is meant to compress the input into some lower dimensional latent space that preserves the essential information of the input. Put another way, it can be seen as an automated feature extractor. Convolutional AEs, (CAEs) are based on AEs where convolutional layers are used for encoding/decoding, rather than layers in an MLP.
So, we used CAE for automatic feature extraction and generation of accurate latent space representations, but we didn’t do that on raw audio. Mel-frequency cepstrum (MFCC) provides a short-term spectral representation of audio features. MFCC is a compact form of the amplitude spectrum representation of audio. It reduces computational cost when used as a preprocessing step for feature extraction, and it is widely known and used for human speech stuff. And so, as shown in the image below, we preprocessed that raw audio into MFCC before classifying it using a CAE.

The basic aim of this work was to evaluate the CAE’s accuracy as the number of buckets at the network’s output increases. We wanted to do this without heavy preprocessing, using input from two microphone sources simultaneously. Our expectation going into this project was that no matter the size and quality of the dataset, as the number of speakers (buckets) increases, the model’s accuracy would eventually decrease. We expected this because the problem gets harder as more possibilities for the output labels are possible. The intuition is that with 2 speakers you have a 50% chance of guessing right by chance, but with 10 speakers you have only a 10% chance. And so separating out who is talking is harder when it could be one of many people.
The following confusion matrix shows the dual microphone result for differentiating between 3 speakers. The system recognised 2 of the 3 speakers 12 tries out of 12, but the third speaker was confused twice for the first speaker and once for the second speaker.

Let’s talk a bit more about how the above experiment was performed, and how the CAE was designed. The data was collected using two microphones simultaneously, and saved into separate WAV files. We had to collect our own data, as dual microphone datasets are hard to find. We copied the phrases of a common voice dataset, effectively extending it into the dual microphone domain for our narrow application and small dataset. We recorded with a sampling rate of 44100 Hz, with each audio snippet consisting of a 10 second duration. There was a 47 dB average room noise level across the recordings. In our new recordings, we had 6 speakers under the age of 30, 3 of whom were male and 3 were female. The collected samples were converted to MFCC representation for each microphone signal, so that we could compare single microphone and dual microphone performance.
In our system, the decoder (DNN) uses relevant features from the encoded data – the data that was generated by the encoder (CNN) – in order to differentiate speakers in the original audio by placing them within buckets

For each experiment we used K-fold cross validation to make sure the results were valid. The results for 1 microphone are shown below.

And now let’s look at the results for 2 microphones:

The first thing we notice when comparing single and dual microphone results is that the model performed better when given audio from 2 microphone sources as opposed to a single microphone source. That’s good news. It means our idea to use 2 microphones is not dumb. We also see in both results a degradation in performance as the number of possible speakers (buckets) goes up. As the number of speaker classes increases, the model’s accuracy decreases. Digging for a more general conclusion, we found that a CAE can differentiate speakers in audio, given a small sample size of audio collected from 2 microphones simultaneously.
Hopefully this article gave you a better sense of what the paper was about, what we figured out, and how we did it. This work was generously funded by Natural Sciences and Engineering Research Council of Canada (NSERC) and Unified Computer Intelligence Corporation (UCIC.io). Since this project started, Mohamed has come to work at lemay.ai, and stallion.ai.
It was really useful to attend this biennial IEEE conference. It was more technical than the last conference I went to (TMLS2018 – more on that trip here) and a bit more than one third the size of TMLS. But I did enjoy both conferences. I met some really interesting people, saw some great talks, and I can tell you from this experience that a lot is changing in the signal processing field. There is still excellent feature engineering work going on, and also a whole whack of new papers on ML/AI approaches to speech and signal processing. There were also some excellent talks on the special session on deep learning. Some of the presenters I recognised half way through as people I saw lecturing in youtube videos. It’s like nerdy celebrity watching. Very exciting times.
If you liked this article on my recent paper, then have a look at some of my most read past articles, like "How to Price an AI Project" and "How to Hire an AI Consultant."
Until next time!
-Daniel
Other articles you may enjoy: