IN-DEPTH ANALYSIS

Training A Rudimentary Speaker Verification Model With Contrastive Learning

Speaker Verification, Deep Learning, Contrastive Learning

OngKoonHan

Published in

Towards Data Science

11 min readMay 8, 2020

For the group project component of my Android development course in university, our team built and deployed an authentication system that authenticates via a speaker’s voice profile.

Following up from my previous article (see next para) which describes the high-level architecture of the voice authentication system, this article seeks to go in-depth into the development process of the Deep Learning model used.

My previous article can be found here (A Rudimentary Voice Authentication System with Mobile Deployment).

Don’t be afraid to use your voice (Photo by Jason Rosewell on Unsplash)

In this short article, I will describe the different stages involved in developing the voice authentication model and also discuss some personal learning moments gleaned during the process.

Here is an overview of the article:

Problem Statement
High-level Model Design
Data Preprocessing
Voice Encoder via Contrastive Learning
Binary Classifier for Authentication
Model Performance

Problem Statement

Before we begin, we need to be clear about how we are trying to frame the voice authentication problem.

Voice authentication can be done in two main ways (broadly speaking): Speaker Identification and Speaker Verification. These two methods, though very closely related, will result in two systems which have vastly different characteristics when the associated security risks of the two systems are compared.

Problem Definitions:

Speaker Identification: n-classification task: given an input utterance, identity the right speaker out of n known speakers (classes).
Speaker Verification: Binary classification task: given an input utterance by a speaker with a claimed identity, determine if that claimed identity is correct.

We can see that in Speaker Identification, we assume that the given input utterance belongs to a speaker that we already know (an office environment perhaps) and we are trying to pick out the closest match from the n known classes/speakers.

In contrast, in Speaker Verification, we assume that we do not know who the given input utterance belongs to (in fact we don’t need to). What we care about is whether or not a given pair of input utterances come from the same person.

To be precise, from the perspective of the overall Speaker Verification system, the system “knows” that the speaker is claiming to be someone “known”, while from the perspective of the model, the model only receives a pair of speech samples and determines if they come from the same person/source.

Imagine using a regular username and password authentication system. The system knows who you are claiming to be (username) and retrieves a stored copy of the reference password. The password checker, however, just checks if the reference password matches the input password and returns the authentication result back to the verification system.

In this project, the Voice Authentication problem is framed as a Speaker Verification problem.

Aside:

[**At this point, it is worth noting that there is another type of speech analysis problem called Speaker Diarisation, which seeks to separate out the source signals of multiple people speaking simultaneously.

Speaker Diarisation involves a scene where multiple people are speaking simultaneously (imagine placing a mic in-between two tables full of people in conversation), and we are trying to isolate the speech audio waveforms of each unique speaker. This might involve multiple mics capturing the same scene from different perspectives, or just a single mic (the hardest problem).

We can see how this can easily get very complicated. An example would be to identify who is speaking in an audio recording of a given conversation. One would have to isolate the speech signals of each person in the conversation (Speaker Diarisation) before trying to put a name to the speaker (Speaker Identification). Of course, hybrid approaches exist and this is an active area of research.**]

High-level Model Design

Transform Data For Transfer Learning — For this task, I wanted to leverage transfer learning as much as possible to avoid architecting a complex and performant model on my own.

To achieve this goal, the speech audio signals were transformed into spectrograms (more precisely, melspectrograms) which resembled images of some sort. With the audio converted into melspectrograms, I could then use any of the popular image models available in PyTorch like MobileNetV2, DenseNet, etc.

A sample / slice of of the melspectrogram from one of the speakers

On hindsight, I realized that I could have used a Wavelet Transform (WT) based method instead of the melspectrogram which is based on the Fourier Transform (FT) to get a “more well-defined” spectrogram image. One of the demonstrations in this Youtube video compares the difference in “image quality” between a WT and FT of a heart ECG signal and it is the resulting spectrogram from the WT is visually much more “well-defined” than that from the FT.

Leverage On Contrastive Learning — The classic example of contrastive learning that everyone is familiar with is the setup which uses the triplet loss. This setup encodes 3 samples each time: a reference sample, a positive sample, and a negative sample (2 candidate samples). The goal is to then decrease the distance of the encoded vectors between the reference and positive samples, while increasing the distance of the encoded vectors between the reference and negative samples.

In my approach, I did not want to limit the number of candidate samples to just 2. Instead, I used an approach similar to SimCLR (Chen, et. Al., 2020), where multiple candidate samples are used with one of them being the positive sample. The “contrastive classifier” is then forced to pick the positive sample out of the bunch of candidate samples. More details of this in the later section.

2 Stage Transfer Learning Approach — To solve the speaker verification problem, we will train the model in 2 stages.

First, a speaker voice encoder will be trained via contrastive learning. As mentioned, the contrastive learning will involve multiple candidate samples instead of the usual positive-negative pair for the triplet loss setup.

Second, a binary classifier is then trained on top of the pre-trained voice encoder. This will allow for the voice encoder to be trained separately before it is used for this transfer learning task (the binary classifier).

Data Preprocessing

VoxCeleb1 Dataset — To train a model to recognize a speaker’s voice profile (whatever that means), I have chosen to use the VoxCeleb1 public dataset.

The VoxCeleb1 dataset contains audio segments of multiple speakers in the wild, that is, the speakers are speaking in a “natural” or “regular” setting. Speakers in the dataset are being interviewed and audio segments in the dataset are curated such that each segment contains the snippet of the interview where the speaker is talking. The dataset contains multiple interviews per speaker under different interview settings and with various types of equipment being used, giving me the kind of variability that I would like my voice authentication system to work with.

For this project, only the audio data was used (video data was available). There exist other authentication systems which try to incorporate multiple modes of data (such as video combined with audio to detect if the speech is being produced live) but I decided that it was outside the scope of my project.

Audio Waveform To Spectrogram — To be able to leverage on popular image model architectures, the speech audio signals were transformed into melspectrograms which resembled images of some sort.

First, multiple short audio samples from the same speaker were combined into one long audio sample. As the melspectrograms are based on the Short-Time Fourier Transform (STFT), the whole long audio sample can be converted into melspectrograms in one go and smaller spectrogram slices can be obtained from the long spectrogram. As the long audio sample is made up of smaller audio samples from a single unique speaker, taking slices from the long sample should force the model to focus on picking out unique traits of each individual speaker’s voice profile.

A concatenated audio sample from one speaker

Next, the long audio samples are converted into melspectrograms using the LibROSA (Librosa) library. The following are the key parameters used:

Target sampling rate: 22050
STFT window: 2048
STFT hop length: 512
Mels: 128

The power spectrum is then converted to Decibels which is on a log scale. Since we are analyzing with an image network, we would like the “image” features in the spectrogram to be somewhat evenly spread out.

Creating “images” — Sampling is done by slicing off smaller spectrograms from the long spectrogram.

The spectrogram “images” are created as 128 x 128 x 3 arrays, in the format of an RGB image. A random start point was chosen on the long spectrogram and three 128 x 128 spectrograms slices were obtained by sliding the window by half a step (128/2=64) for each new slice. The “images” are then normalized by the largest absolute value between [-1,1].

Initially, I copied the same spectrogram slice 3 times to convert the “greyscale” image into an “RGB” image. However, I decided to pack more information into each spectrogram “image” by putting in 3 slightly different slices for each “RGB” image, as the 3 channels don’t have the usual meaning that a normal RGB image has. The performance seems to improve slightly after using this sliding technique.

Voice Encoder via Contrastive Learning

Data Sampling — The following are some details on the data sampling used for the contrastive learning of the voice encoder. [**Spectrogram slices will be referred to as “images”]

For each epoch, 200 (total 1,000) random full-length spectrograms are loaded into memory (“sub-samples”)(due to resource constraints, not all the 1,000 full-length spectrograms could be loaded at once).

For each row, 1 reference and 5 candidates were used. Candidate images contain 4 negative and one positive image (randomly shuffled), and images are randomly generated from the sub-samples.

Each epoch had 2,000 rows, and the batch size was 15 (MobileNetV2) and 6 (DenseNet121).

Voice Encoder Contrastive Learning Setup

Multi-Siamese Encoder Network — For the encoder network, the base models used were MobileNetV2 and DenseNet121. The encoder layer size was 128 (replacing the Imagenet classifier in the base models).

To facilitate the contrastive learning, a multi-siamese encoder model wrapper was build as a torch module. The wrapper used the same encoder for each image and facilitated the cosine similarity computations between the reference image and the candidate images.

Contrastive Loss + Intra-Class Variance Reduction — The two objectives, contrastive loss and variance reduction, are minimized consecutively (due to resource constraints, ideally the loss from the two objectives should be summed and minimized). The contrastive loss was computed for every batch while the variance reduction (intra-class MSE) was run every 2 batches.

For the contrastive loss, following Chen, et. Al. (2020), the problem is formulated as an n-classification problem where the model tries to identify the positive image from among all the candidates. The cosine similarity is computed for all candidate encodings against the reference encoding and a softmax is computed over the cosine similarities yielding probabilities. The cross entropy loss is minimized as per normal n-classification problems.

For the intra-class variance reduction, the aim is to push images from the same class closer together in the encoding space. Images from the same class/speaker are sampled, and the mean encoding vector is computed. The MSE loss of the encodings is computed against the mean (intra-class variance), and the MSE loss is scaled by 0.20 before backpropagation.

Binary Classifier for Authentication

Data Sampling — The following are some details on the data sampling used for the speaker verification binary classifier.

For each epoch, 200 (total 1,000) random full-length spectrograms are loaded into memory (“sub-samples”) (due to resource constraints, not all the 1,000 full-length spectrograms could be loaded at once).

For each reference image, 2 test images are generated, 1 positive and 1 negative image. This yields 2 pairs/rows for each reference image, the Genuine pair (positive) and the Impostor pair (negative).

Images are randomly generated from the sub-samples. Each epoch had 4,000 rows, and the batch size was 320 (MobileNetV2 and DenseNet121).

Speaker Verification Binary Classifier Setup

Verification Binary Classifier Network — The underlying encoder network is the pre-trained encoder network from the contrastive learning step where the weights were frozen during training.

The binary classifier is set up as a siamese network, where the absolute difference of the encoding vectors from the input pairs was computed. The binary classifier is then built on top of absolute difference layer.

Model Training (Additional Details)

Learning Rate Cycling — Cycling the learning rate improved the model accuracy in both the contrastive learning step and the binary classifier step.

The torch.optim.lr_scheduler.CyclicLR() was used, with a step size (default) of 2000, cycling mode (default) as “Triangular”, with no momentum cycling (Adam optimizer).

The range of the learning rate used for cycling is as follows:

Voice Encoder With Contrastive Learning: 0.0001 to 0.001
Binary Classifier: 0.0001 to 0.01

Model Performance

As was expected, the model performance was not as good as the state of the art models. I believe the contributing factors were that:

The melspectrogram was not the best signal transformation that I could have used. A wavelet transform based method might have produced more high quality spectrogram “images”.
The base image models used were not the state of the art models as I could not use extremely large models on my modestly sized GPU (3GB VRAM). Perhaps a more powerful model like ResNet or ResNeXt might have produced better results.
Only one speech dataset was used (VoxCeleb1). A larger quantity and variety of data could definitely have been used (but alas, the deadline was looming closer).

Here are the Equal Error Rates (EER) of comparable models:

My best model EER 19.74% (VoxCeleb)
Le and Odobez (2018), Best model from scratch EER 10.31% (VoxCeleb)
Jung, et. al. (2017), EER 7.61% (RSR2015 dataset)

Other Findings

Base Model Size — Using a larger base model improved classification performance (unsurprisingly) (MobileNetV2 vs DenseNet121).

Effect Of Intra-Class Variance Reduction — The intra-class variance reduction improved classification performance across both base models. In fact, the performance of MobileNetV2 with variance reduction improves to a level that is comparable to that of DenseNet121.

Below are the binary classification score distributions [P(is_genuine)] using MobileNetV2 and DenseNet121, with and without variance reduction.

MobileNetV2 Base Model — Without / With Variance Reduction (left / right)

DenseNet121 Base Model — Without / With Variance Reduction (left / right)

References

Chen, T., Kornblith, S., Norouzi, M. and Hinton, G., 2020. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709

Jung, J., Heo, H., Yang, I., Yoon, S., Shim, H. and Yu, H., 2017, December. D-vector based speaker verification system using Raw Waveform CNN. In 2017 International Seminar on Artificial Intelligence, Networking and Information Technology (ANIT 2017). Atlantis Press.

Le, N. and Odobez, J.M., 2018, September. Robust and Discriminative Speaker Embedding via Intra-Class Distance Variance Regularization. In Interspeech (pp. 2257–2261).