The world’s leading publication for data science, AI, and ML professionals.

Speech Dereverberation using Coherent to Diffuse Power Ratio Estimators (CDR)

A Preprocessing Technique for Automatic Speech Recognition systems

Introduction

When we want to record a conversation in a room different acoustic effects occur. For example, we might have some unwanted background noise and reflections of speech on surfaces in the room.

These reflections are called reverberation. They build up on each other and decay over time as sound is absorbed by the surface of objects. It has been shown that our speech intelligibility can be adversely affected in reverberant environments.

In addition, it has been proved that reverberation is a notable source of error in Automatic Speech Recognition (ASR) systems [1]. There are ASR systems that include into their design some method to get rid off of reverberation before starting the speech recognition task.

Consequently, dereverberation, is the process of removing reverberation from the sound and is a crucial task in signal processing.

In the past decade, a variety of algorithms have been proposed to suppress reverberation. In this article is presented a multichannel dereverberation technique that makes use of the Coherent to Diffuse Power Ratio (CDR) [2] estimation which improves performance in ASR systems.

The CDR metric is the ratio between the coherent (desired)and diffuse (undesired) signal components that allows the construction of a postfilter (typically called CDR-based postfilter) **** capable of eliminating reverberation. This relation tells us how much amount of clean speech is over reverberation and takes values from zero to infinitive where zero means high reverberation and infinitive only the presence of clean speech.

It is similar to the SNR metric, **** but the noise is mainly reverberation.

The techniques that tend to estimate the CDR metric are called CDR estimators. To see some CDR estimators you can check this reference [2] which contains original derivations and more in-depth explanations.

The interesting aspect of this technique is that is simple to apply and does not require training any machine learning algorithm.

Dereverberation with CDR estimators in an array of two microphones

How to use CDR estimators for speech dereverberation in an array of two microphones? In the figure 1 we can appreciate a standard pipeline of dereverberation with CDR estimators.

Where G(l, f), Y (l, f), and Z(l, f) are the CDR-based postfilter, the preprocessed signal, and the dereverberated signal respectively in STFT. ** They are related by _Z(l, f) = Y (l, f)G(l, f**_).

Moreover, the CDR(l,f) estimation (in Figure 1 appears with hat which means estimation) depends on other variables (refer to [2]) which are processed inside the block CDR estimation and their understanding is beyond the scope of this article.

The preprocessed signal Y (l, f) is a combined averaging of the squared magnitudes of the input recordings and the phase from one of the microphones as equation 1 shows. This equation takes place inside the block Preprocessing from Figure 1.

In addition to this, the CDR-based postfilter is obtained by taking the CDR(l,f) estimation **** that comes from the block _CDR estimatio_n and two parameters. This is shown in equation 2:

Where µ is the oversubtraction factor and _G_min is ****_ the minimum gain which are set by the user. These parameters tend to be optimized for a better listening experience in practical applications.

When the CDR(l,f) estimation is infinitive then G(l,f) takes value 1 and when CDR(l,f) is zero, G(l,f) takes the maximum between _G_min and ** one minus the root square of µ. Therefore, the values of G(l,f) are between G_mi**_n and 1.

In a nutshell, the CDR-based postfilter works as a reverberation attenuator for low CDR values for each time and frequency bin.

Furthermore, this approach takes into account an array of two microphones but is possible to be applied in an array of more microphones by taking pairs and performing an averaging as it was suggested in this open issue

Visual example of dereverberation

How does the dereverberation look like? In figure 2 is possible to appreciate an example, in STFT domain, of the preprocessed signal and its dereverberated version using the CDR-based post-filtering. These two images were obtained by plotting their corresponding spectrum from the public repository [3]

It is possible to see the reverberation in the preprocessed signal Y (l, f) as spectral smearing. In other words, those prevalent frequency components with high energy (red color) tend to last longer. This is indicative of how these components are present more time than they should be due to reflections. Nevertheless, in the dereverberated signal, those reflections are shorter than in the preprocessed signal, indicating the proper working of the CDR-based postfilter.

Sound example of dereverberation with CDR estimator

If you want to listen to the original recording and the dereverberated version from the figure 2 is possible in the repository [3].

The original speech with reverberation can be listened in the file roomC-2m-75deg.wav and the dereverberated version in out.wav.

Application in Automatic Speech Recognition Systems

As mentioned before, one of the potential applications of speech dereverberation by using CDR estimators is in ASR systems. There are several studies that state how detrimental to speech recognition performance reverberation is.

In the study [4] is presented how the usage of CDR estimators improved considerably the Word Error Rate (WER) which is the well-known metric for speech recognition performance. In the same study, is highlighted that CDR estimators also compete with other traditional dereverberation methods in ASR systems.

In Figure 3 can be seen a comparison between different values of WER for reverberated unprocessed signals (red color) and its dereverberated version after applying a CDR estimator (green color).

Room A is a lecture hall with a reverberation time of 1 second, and Room B is a large foyer with a reverberation time of 3.5 seconds. For more conditions of this experiment and parameters tuned, refer to [4]

Conclusions

If you are building an application with an array of microphones and you need to perform some preprocessing technique to remove reverberation with low computational resources you should consider using this technique.

Moreover, as figure 3 suggested, using CDR estimators for speech dereverberation reduces the WER by nearly 30% in ASR.

A big advantage of this technique is how it can be achieved a high-quality dereverberation result without using a trained model. However, behind a simple application technique, there are hidden complex derivations and understanding.

References

[1] DELCROIX, Marc, et al. Strategies for distant speech recognitionin reverberant environments. EURASIP Journal on Advances in Signal Processing, 2015, vol. 2015, no 1, p. 1–15.

[2] Andreas Schwarz and Walter Kellermann. Coherent-to-diffuse power ratio estimation for dereverberation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(6):1006–1018, 2015.

[3] Andreas Schwarz. cdr-dereverb. https://github.com/andreas12345/cdr-dereverb, 2019.

[4] SCHWARZ, Andreas; BRENDEL, Andreas; KELLERMANN, Walter. Coherence-based dereverberation for automatic speech recognition. In Proc. DAGA. 2014.


Related Articles