Photo by Priscilla Du Preez on Unsplash

X-Ray for Podcasts

How to identify different speakers without labels

Jordan Ryda
Towards Data Science
9 min readOct 20, 2019

--

Amazon offers a tool for Kindle and Video it calls “X-Ray” that gives you access to extra information about the scene at hand. The information includes the characters present and helps to navigate between sections of a book or drill into the filmography of a vaguely familiar actor.

One might imagine the demands of producing such a service require some form of labelling (either exhaustively by hand or for a classifier). However wouldn’t it be cool if we could infer the splits without any labelling labour at all? With this we could calculate the extent to which characters steal the limelight, when and for how long, as well as changing conversation dynamics.

X-Ray for Video on Amazon Prime. Pretty pretty good

I chose to look at audio — simpler than video but also one component of a similar video task. To test a simple base case I selected a podcast that contained slowly alternating dialogue between two people and landed upon Sam Harris’s chat with Ezra Klein in #123-Identify & Honesty — two different male voices that our ears have no trouble discerning. The final product should look something like this:

The final audio X-Ray

1. Creating “chunks” to label

Reading audio in is a breeze with scipy — monophonic audio is just a 1D array of chronological values. We now consider the tradeoffs in varying the window size used store the “chunks” of data we will label. The sample frequency Fs here is the standard 44.1 kHz which means our arrays will be of size 44,100 * N for chunk time N seconds. The smaller the window the more precise the location in time and less likely the voices interrupt one another.

But longer time windows can also help in identifying a voice — and not just a sharp vowel sound. We could suspect this involves looking at some sort of average timbre over a sufficiently long period of time. Five seconds ended up giving good results based on the lengthy speaking times in this discussion (if there were more interruptions this could have been too large).

There is also a more fundamental compromise in frequency resolution (think shorter time windows give broader spectral peaks). But this phenomenon is only important if we extract the spectrum for the entire chunk. We can steer clear of this if we analyse sub-chunks that collectively inform the chunk label (and this is exactly what I do).

Chunking audio has the same effect as applying a rectangular window — it changes the green peak to the blue peaks. The frequency resolution in a discrete Fourier transform is proportional to Fs/N. Source http://saadahmad.ca/fft-spectral-leakage-and-windowing/

The Fourier transform is the mathematics that connects “reciprocal coordinates”. Its scaling property captures the tradeoff we observe in many physical phenomena such as in spectral analysis (resolution of frequency vs precision in time); Fraunhofer diffraction (the smaller the slit spacing the wider the diffraction pattern); uncertainty in quantum mechanics (the more precisely we determine the position the greater the uncertainty in momentum).

2. Audio features

The time component of a signal doesn’t reveal much about the intrinsic character of a sound. Sound waves are just pressure disturbances in air and the periodicity of these disturbances does reveal a signature of sorts in the frequency domain. A Fast Fourier Transform (FFT) of each chunk does the trick and produces the frequency spectra depicted below.

Frequency spectra for Ezra Klein (left) at 30:00–30:05 and Sam Harris (right) at 16:40–16:45. We actually only need the positive terms since for purely real input the output coefficients are “Hermitian-symmetric” i.e. every complex conjugate appears and these correspond to redundant negative-frequencies

These spectra are clearly pretty dense and high dimensional and so we might use the bandwidth of the male vocal range to apply a crude cutoff.

fs/N represents the unit element change in frequency as we move between elements in the FFT spectrum

This reduces the dimensionality of the spectrum, but recall there are actually two values for each frequency component — a real and imaginary part. When we view a spectrum we’re looking at the magnitude which accounts for both parts. However, it’s the phase in a Fourier transform (the ratio of the two parts) that carries much of the signal for images as for audio.

We can test this by crossing the magnitude and phase between the Ezra and Sam clips and below are a handful of (unnormalised) mutant clips.

Doing naughty things with audio
Original and mutant clips of crossed real/imaginary parts and magnitudes/phases between two chunks

At this point we could pass the structure of magnitudes and phases to a dimensionality reduction algorithm (see section 4) to visualise the output. This is what I did initially to produce two very fuzzy groups and it wasn’t quick either.

The librosa library has a number of functions to extract features from the frequency spectrum (e.g. the “centre of mass”, the energy distribution etc.) or time series (e.g. zero-crossing rate). We might also consider detecting frequency peaks and peak widths using scipy.signal. The features we choose however must be specific to the task of separating voices by their timbre in a way that is pitch-invariant. If the problem centred on pitch detection we would choose features that distinguish pitch and not timbre.

Examples of audio features as described in Eronen, Antti & Klapuri, Anssi. (2000), Musical instrument recognition using cepstral features and temporal features

In other words, we must ensure the features chosen don’t precisely “fingerprint” the data — we need something fuzzier. We turn to the feature that proved to be the most successful at this — Mel-Frequency Cepstral Coefficients (MFCCs).

3. Extracting Mel-Frequency Cepstral Coefficients (MFCCs)

MFCCs were originally developed to represent sounds made by the human vocal tract. It makes use of the Mel scale, a logarithmic transformation of frequency, that aims to capture the non-linear human perception of sound which results in a lower sensitivity to differences at higher frequencies. This means that a large distance between MFCC vectors relates to a large perceptual change and so capture timbre more robustly and not simply pitch. This and this dive into the mechanics of their computation much better than I can and so I will refrain from doing so here.

Automatic speech recognition tasks typically look at 12–20 cepstral coefficients and the 0th coefficient can be dropped as it only conveys a constant offset of the average power, not relevant to the overall spectrum shape. My results turned out just fine by picking 15 coefficients and dropping the 0th order.

MFCC spectrogram for Ezra Klein (left) at 30:00–30:05 and Sam Harris (right) at 16:40–16:45. Removing the 0th order coefficient in the top row gives the bottom row which exhibits a smaller range between the coefficients

MFCC extraction involves “framing” or “windowing” which applies a sliding window over the input, much like a short-time Fourier transform, to produce a few hundred sets of 15 coefficients (within each audio chunk), one for each consecutive frame. The rationale here is we want to preserve frequency contours of the signal that occur when temporarily stationary. These can be lost if we Fourier transform the entire chunk (as I did initially).

Framing within a single “chunk” of audio being identified. Consecutive frames are often set to overlap partially

Some additional Fourier theory is worth considering here when analysing small windows of audio. Modelling discontinuities such as the sharp corners of square wave requires many high frequency terms. If we create sharp bins throughout the audio we could expect discontinuities at the edges that produces spurious high-frequency contributions as well as spectral leakage and so it makes sense to taper the edges with a window function.

The process of calculating MFCCs from a spectrum has the added benefit of greatly reducing the dimensionality of the input. In every frame we are taking a large number of samples (2048 here), reducing these to 15 coefficients per frame. The final feature vector is built up from the mean of the coefficients across all of these frames. Adding the variance didn’t improve results.

The process for extracting MFCCs. Mel filter banks have a triangular bandwidth response on a Mel-scale and there are typically 40 of them. Source: https://www.ijser.org/paper/Performance-analysis-of-isolated-Bangla-speech-recognition-system-using-Hidden-Markov-Model.html

4. Dimensionality reduction

PCA is the de-facto workhorse of dimensionality reduction — a quick, deterministic, linear transformation that preserves global structure of the data when separating data along the eigenvectors of maximal variance. For example, if we were to group MNIST digits we would see that the centroids for the clusters of 1s is far from the 0s and 4,7,9 clusters are relatively close. However we can’t assume that neighbours close-by are also close in high dimensional space.

A more modern approach is tSNE, a stochastic technique that benefits from preserving local structure — it places neighbours in the high dimensional space close to each other, but at the cost of some global structure. For example, grouping MNIST digits we see cleanly isolated clusters while PCA produces a continuous smear.

UMAP and tSNE on different datasets. The global structure in the MNIST dataset is apparent

UMAP is a newer technique able to capture both the global structure (à la PCA) and local structure (à la tSNE). It’s also significantly faster than the best multicore tSNE implementations (13X faster on MNIST) and custom distance functions can be supplied too. In this two voice problem the decision of dimensionality reduction technique is probably less critical but it’s also nice to get some hands-on experience. Once the MFCCs do the first stage of reduction, the processing time on 3600 chunks x 15 coefficients is very fast.

5. Results

The graph below shows not two distinct clusters but three! The colour of each point indicates the time into the podcast and edges connect consecutive chunks. The dark blue points therefore indicate that the introduction was probably recorded at a different time — with different equipment or in a different setting. What’s nice here is that the introduction cluster (dark blue) where Sam opens is closer to his dialogue cluster. These two clusters can actually be subsumed if we remove some of the higher MFCC coefficients (8+) although at a cost to the global separation between Sam and Ezra.

The three distance metrics I saw perform the most reliably for this task of the many UMAP has built in were manhattan, braycurtis and cosine. Silhouette scores above each plot were used for light guidance on clustering performance. A small dark blue cluster appears sometimes given different MFCC parametric configurations

From here we can apply kmeans for k=3 clusters to map each point to a one of three labels by listening to points in each cluster. But wouldn’t it be nice to link each point to its audio chunk for playback on click? This would help with validation of the clusters and answer questions such as “do points in one cluster closest to another cluster represent a chunk with both voices”? It turned out to be surprisingly challenging, but not insurmountable, using Dash and plotly as well as some javascript and a 3rd party library.

Anyhow…the results!

Dash app with plotly and some javascript. Note the final point selected that is being tugged by both clusters features both Ezra and Sam (braycurtis distance metric used)

Not too shabby? Now that we have our kmeans labels we can compute the oft-quoted debate statistic, duration spoken.

Sam Harris    365 chunks
Ezra Klein 272 chunks
Intro 85 chunks

Sam spoke for 57% of the time excluding the intro! Well it is his podcast after all. And adding back in the temporal component of each chunk, here is a rather satisfying final product: the X-Ray timeline for the conversation. The frequency of exchange during Sam’s phase at 2500s suggests this section is particularly interesting/heated.

The final audio X-Ray, again

Code

I have spruced up my Jupyter notebook which can be found in this gist

References

--

--