Part III. Dereverberation

Part III. Dereverberation
Tomohiro Nakatani I now would like to start part III dereverberation.

Speech recording in reverberant environments
ration When we capture speech using distant microphones, reverberation, which is composed of reflections from walls and ceilings, is inevitably included, and degrades the quality of recorded speech. Dereverberation is a technique to reduce the reverberation, and to improve the quality of the recorded speech. Distant mic Dereverberation is needed to enhance the quality of recorded speech by reducing reverberation included in it Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

Effect of reverberation
Non-reverberant speech captured by a headset Reverberant speech captured by a distant mic These are example spectrograms of non-reverberant and reverberant speech signals. As you can see, the time-frequency structure of non-reverberant speech is seriously smeared by reverberation. Due to this effect, speech becomes less intelligible and ASR becomes very hard. Speech becomes less intelligible and ASR becomes very hard Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

Table of contents in part III
Goal of dereverberation Approaches to dereverberation Signal processing based approaches A DNN-based approach Integration of signal processing and DNN approaches DNN-WPE In the remainder of this part, I will first explain the goal of dereverberation, and then present different approaches to dereverberation, namely signal processing based approaches and a dnn-based approach. After that, I will discuss integration of the two approaches that can achieve better dereverberation. Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

Goal of dereverberation: time domain
Preserve Reduce Desired signal + Direct sound Early reflections Late reverberation Reverberant speech Late reverberation Direct sound Early reflections Impulse response (=30-50 ms) A reverberant speech is in general modeled by time-domain convolution of clean speech s and room impulse response a. As in this figure, the room impulse response can be decomposed into three parts, namely direct sound, early reflections that come within ms after the direct sound, and late reverberation that come after the early reflections. With this decomposition, the reverberant speech is also decomposed into two parts, the first part is composed of direct sound + early reflections, and the second part corresponds to the late reverberation. It is known that while the first part has an effect to improve the intelligibility of speech and ASR performance, the second part degrades both of them. Therefore, the goal of the dereverberation is to preserve the first part while reducing the second part. Hereafter, we refer to the first part as desired signal d, and the second part as late reverberation r [Bradley et al., 2003] Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

Model of reverberation: STFT domain
Time domain convolution is approximated by frequency domain convolution at each frequency [Nakatani et al. 2008] If frame shift << analysis window (e.g., frame shift <= analysis window/4) Desired signal Late reverberation STFT domain (1-ch) + STFT domain (multi-ch) + Next let me explain the STFT domain model of reverberation. It is shown that time domain convolution can be approximated by frequency domain convolution within each frequency if frame shift of the STFT analysis is much shorter than the analysis window. Under this assumption, we can use the convolution model for reverberation also for its STFT domain model at each frequency like this equation, and we can also decompose it into desired signal and late reverberation. With vector representations, we can also model multichannel observed signal x by this equation. Convolutional transfer function: for Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

Approaches to dereverberation
Beamforming (multi-ch) Enhance desired signal from speaker direction Mostly the same as denoising Blind inverse filtering (multi-ch) Cancel late reverberation Multi-channel linear prediction Weighted prediction error (WPE) method DNN-based spectral enhancement (1ch) Estimate clean spectrogram Mostly the same as denoising autoencoder In this tutorial, I present three approarches to dereverberation. The first one is beamforming. This approach enhances the desired signal as one that comes from the target speaker’s direction. It is based mostly on the same mechanism of beamforming for denoising. The second approach is blind inverse filtering. This approach directly cancels the late reverberation. Multi-channel linear prediction is a powerful tool in this approach. The last one is DNN based approach. This approach uses DNN to map the magnitude spectra of reverberant speech to those of clean speech. This approach is mostly the same as the neural network based denoising approach. In the following slides, I will explain these approaches one by one. Reverberant Estimated clean Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

Beamforming (multi-ch) Enhance desired signal from speaker direction Mostly the same as denoising Blind inverse filtering (multi-ch) Cancel late reverberation Multi-channel linear prediction Weighted prediction error (WPE) method DNN-based spectral enhancement (1ch) Estimate clean spectrogram Mostly the same as denoising autoencoder The first one is the beamforming approach. Reverberant Estimated clean Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

Dereverberation based on beamforming
Time domain model of desired signal Assume << STFT window, then Techniques for estimating spatial covariances, and Maximum-likelihood estimator [Schwartz et al., 2016] Eigen-value decomposition based estimator [Heymann, 2017b, Kodrasi and Doclo, 2017, Nakatani et al., 2019a] Initial part of the impulse response Time domain (=30-50 ms) STFT STFT domain : acoustic transfer function Beamforming is applicable to reduce As I explained, in the time domain, the desired signal is modeled by convolution of the clean speech with the initial part of the room impulse response of the order of 30 to 50 ms. So, if we assume that the duration of this initial response, D_tilde, is short enough in comparison with the STFT analysis window, the desired signal in the STFT domain can be modeled by a product of the clean speech s and the STFT of the initial response, denoted by a. Then, taking this a as an acoustic transfer function and the late reverberation r as the additive noise, we can simply apply beamforming to the observed signal x to reduce the late reverberation. Techniques for estimating spatial covariance matrices of the desired signal and the late reverberation have been proposed. They include maximum likelihood estimation approaches, and eigenvalue decomposition approaches. With these techniques, we can perform beamforming for dereverberation. Later I will show effectiveness of this approach, comparing with the other dereverberation approaches. Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

Beamforming (multi-ch) Enhance desired signal from speaker direction Mostly the same as denoising Blind inverse filtering (multi-ch) Cancel late reverberation Multi-channel linear prediction Weighted prediction error (WPE) method DNN-based spectral enhancement (1ch) Estimate clean spectrogram Mostly the same as denoising autoencoder Next, let me present blind inverse filtering approach. Because this is a unique approach developed for dereverberation, I use relatively longer time to describe it in the following slides. Reverberant Estimated clean Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

What is inverse filtering
Reverberant speech (multi-ch) Dereverberated speech (multi-ch) Clean speech RIRs Inverse filter or Viewed as matrix inversion Inversion Viewed as linear transformation (=matrix multiplication) Let me first explain what is inverse filtering. Because convolution of RIRs can be viewed as a linear transformation, or more specifically as a matrix multiplication, the inverse filter can be viewed simply as a matrix inversion. Therefore, as long as RIRs are invertible, the inverse filtering can completely cancel the reverberation. Let me make one note here. The goal of the inverse filtering is in general to estimate the direct signal, s. So, we adopt it as a tentative goal for the purpose of the explanation, but later will change it to our final goal, namely, estimation of the desired signal, d. Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

Represent RIR convolution by matrix multiplication
1-ch representation Multi-ch representation This slide explains how we can view the convolution of RIRs as a matrix multiplication. In the upper figure, H is called a convolution matrix. It contains an impulse response in each row, and the position of the impulse response in the matrix is shifting to the right one by one from the top to bottom rows. By multiplying this matrix to a time series of clean speech, s, we actually calculate a convolution, and obtain a time series of reverberant observation, x. For the multi-ch representation, as in the lower figure, we simply need to stack the multi-ch observed signal x and the corresponding convolution matrices like this. As a result, the reverberant observation x is modeled by multiplication of H and s. Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

Existence of inverse filter [Miyoshi and Kaneda, 1988]
Given , the inverse filter should satisfy Solution exists and is obtained as: When is full column rank (roughly #mics>1) : identity matrix When the convolution matrix H, is given, the inverse filter W is defined as on that satisfies this equation. When it is multiplied with H, we obtain an identity matrix. The solution is obtained like this, and it exists when H is full column rank. This condition roughly corresponds to the case where the number of mics is more than 1. So, when H is given, we can perform inverse filtering. However, H is not given in general. So, the problem of blind inverse filtering is how to estimate w without knowing H. How can we estimate without knowing ? Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

Approaches to blind inverse filtering
Blind RIR estimation + robust inverse filtering Blind RIR estimation is still an open issue Eigen-decomposition [Gannot, 2010] ML estimation approaches [Juang and Nakatani, 2007, Schmid et al., 2012] Robust inverse filtering Regularization [Hikichi et al., 2007] Partial multichannel equalization [Kodrasi et al., 2013] Blind and direct estimation of inverse filter Multichannel linear prediction (LP) based methods Prediction Error (PE) method [Abed-Meraim et al., 1997] Delayed Linear Prediction [Kinoshita et al., 2009] Weighted Prediction Error (WPE) method [Nakatani et al., 2010] Multi-input multi-output (MIMO) WPE method [Yoshioka and Nakatani, 2012] Higher-order decorrelation approaches Kurtosis maximization [Gillespie et al., 2001] Two approaches are extensively studied for blind inverse filtering. One is blind RIR estimation followed by robust inverse filtering. However, blind RIR estimation is still an open issue, and we do not yet have a satisfactory solution to it. In contrast, the second approach directly estimates the inverse filter without estimating the RIR. A series of studies have showed that multi-channel linear prediction is very effective for achieving blind inverse filtering. So, in the following slides, I will focus on this approach. Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

Multichannel LP [Abed-meraim et al, 1997]
: direct signal Predict : reverberation Current signal ・・・・・・・・・・・・ Predictable Past signal (multi-ch) This figure illustrates a basic idea of multi-channel linear prediction. These vertical bars represent a time series of observed signals from past to the present, and the last bar to the right represents the current signal. With simple multichannel linear prediction, the goal of the dereverberation is set to estimate the direct signal (not the desired signal), Then, assuming that the reverberation included in the current signal can be predicted from the past signals based on linear prediction, the dereverberation is performed by subtracting the predictable components from the current observed signal, Dereverberation: Subtract predictable components from observation Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

Definition of multichannel LP
Multichannel autoregressive model Assuming stationary white noise, ML solution becomes With estimated , is estimated (= inverse filtering) as : prediction matrices. Let me give a formal definition of the multichannel linear prediction. With this approach, the observed signal is assumed to follow multichannel autoregressive model. With the model, the current observed signal x is assumed to be generated by a linear prediction from the past observed signals plus the direct signal. Here, W is called prediction matrices. Then, assuming the direct signal is stationary white noise, the ML solution to the prediction matrices W is obtained as ones that minimize the squared prediction error between the current and predicted signals. Once the prediction matrices are obtained, the desired signal is estimated by subtracting the predicted signal from the current observed signal. Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

Problems in conventional LP
Speech is not stationary white noise LP assumes the target signal d to be temporally uncorrelated Speech signal exhibits short-term correlation (30‐50 ms) LP distorts the short-time correlation of speech LP assumes the target signal d to be stationary Speech is not stationary for long-time duration ( ms) LP destroys the time structure of speech Solutions: Use of a prediction delay [Kinoshita et al., 2009] Use of a better speech model [Nakatani et al, 2010] However, because speech is not stationary white noise, there are two major problems when we apply linear prediction to speech dereverberation. First, linear prediction assumes that the target signal is temporally uncorrelated, but in reality speech signal exhibits short-term correlations of the order of ms. So, when we apply linear prediction to speech dereverberation, it distorts the short-time correlations of speech. Next, linear prediction assumes that the target signal is stationary, but speech is not stationary for long-time duration of the order of 200 to 1000 ms. So, the application of LP to speech dereverberation destroys the time structure of speech. To overcome these problems, two solutions have been proposed, namely use of prediction delay and use of a better source model. I’ll explain them in the following slides. Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

Delayed LP (DLP) [Kinoshita et al., 2009]
Predict Current signal ・・・・・・・・・・・・ Unpredictable Predictable Delay D (=30-50 ms) Past signal (multi-ch) The first solution is the use of delayed LP. With the delayed LP, we put a delay period D of the order of 30 to 50 ms just before the current observed signal, and when we predict the current observed signal from the past observed signals, we skip the delay period. With this modification, the linear prediction can only predict the late reverberation included in the current signal, and the desired signal cannot be predicted any longer. As a consequence, the delayed LP can only reduce the late reverberation included in the observed signal without distorting speech inherent temporal correlations. Delayed LP can only predict from past signal Only reduce Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation
Introduction of better source model [Nakatani et al., 2010, Yoshioka et al., 2011] Model of desired signal: time-varying Gaussian (local Gaussian) ML estimation for time-varying Gaussian source : source PSD Minimization of weighted prediction error (WPE) The second solution introduces a better model for the desired signal, which is defined by time-varying Gaussian distribution with time-varying variance \sigma_{t,f} squared, which is also called as the source psd. With this formulation, maximum likelihood estimation of the prediction matrices and the source psd are conducted by this equation. Here the maximization of the likelihood function corresponds to the minimization of the squared prediction error weighted by the inverse of the source psd. So, this method is called weighted prediction error method, or WPE. With these solutions, blind inverse filtering can be performed based only on a few seconds of observation. Blind inverse filtering can be achieved based only on a few seconds of observation Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

Processing flow of WPE Dereverberation Source PSD estimation Prediction matrix This slide shows the processing flow of WPE. Because it is difficult to obtain closed form solution to the ML estimation, iterative estimation is introduced. In the iterative estimation, source psd and the prediction matrices are alternately updated based on these equations. It is experimentally confirmed that a few iterations are sufficient to reach a good solution. Then, the dereverberation is performed using the estimated prediction matrices. Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

Why WPE achieves inverse filtering?
Assumption is not correlated with and with Minimized when Reverb Prediction Here, let me briefly explain why WPE can achieve inverse filtering. The cost function of WPE is defined as the weighted squared prediction error. Then, by assuming that the desired signal is uncorrelated with the late reverberation r and with the signal predicted from the past observed signal, the lower bound of the cost function is shown to be the weighted power of the desired signal, where the minimization is achieved only when this equation is satisfied, that is, the late reverberation included in the observed signal is equal to the signal predicted by linear prediction. This means that linear prediction can perform inverse filtering, and existence of such prediction matrices W is guaranteed when the inverse filter exists. Existence of is guaranteed when the inverse filter exists Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

Extensions Elaboration of probabilistic models Sparse prior for speech PSD [Jukic et al., 2015] Bayesian estimation with student-T speech prior [Chetupalli and Sreenivas, 2019] Frame-by-frame online estimation Recursive least square [Yoshioka et al., 2009], [Caroselli et al., 2017] Kalman filter for joint denoising and dereverberation [Togami and Kawaguchi, 2013], [Braun and Habets, 2018], [Dietzen et al., 2018] Some important extensions are already proposed for WPE. For example, elaboration of probabilistic models are proposed, such as sparse prior for speech psd, and bayesian estimation with student-T speech prior. On the other hand, frame-by-frame online processing is also extensively studied, including recursive least square approach, and Kalman filter based approach. Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

Beamforming (multi-ch) Enhance desired signal while reducing late reverberation Mostly the same as denoising Blind inverse filtering (multi-ch) Cancel late reverberation (Multi-channel) lnear prediction Weighted prediction error method DNN-based spectral enhancement (1ch) Estimate clean spectrogram Mostly the same as denoising autoencoder O.K., next I’ll very briefly explain the dnn-based approach. Reverberant Estimated clean Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

Neural networks based dereverberation
Train neural networks based on huge amount of parallel data Magnitude spectrum, etc. Clean speech PSD, time-freq. mask, etc. Many variations are proposed depending on tasks (masking/ regression), cost functions, and network structures Similar to the other neural network based speech enhancement approach, for dereverberation, we train neural networks to map observed signal to clean speech signal. Many variations are also proposed with this approach , depending on the tasks, cost functions, and network structures. [Weninger et al., 2014, Williamson and Wang, 2017] Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

REVERB Challenge task [Kinoshita et al., 2016]
Speech enhancement ASR Acoustic conditions Reverberation　(Reverberation time 0.2 to 0.7 s.) Stationary noise　（SNR ～20dB) Now, I would like to compare the performance of the three approaches using the REVERB challenge dataset. With the dataset, we can evaluate the speech enhancement approaches in terms of both speech enhancement and ASR performances. In the dataset, reverberation time was varied from 0.2 to 0.7 s, and small amout of stationary noise is also added. 1ch, 2ch, and 8ch scenarios are provided for the dataset. Here we only use the 8ch scenario. 0.5~2.5 m Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

Comparison of three approaches
Simu data Real data FWSSNR CD PESQ WER Observed 3.62 dB 3.97 dB 1.48 5.23 % 18.41 % MVDR 6.59 dB 3.43 dB 1.75 6.65 % 14.85 % WPE 4.79 dB 3.74 dB 2.33 4.35 % 13.24 % WPE+MVDR 7.30 dB 3.01 dB 2.38 3.85 % 9.90 % DNN (soft mask estimation) 7.52 dB 3.11 dB 1.46 7.98 % 23.38 % These are the evaluation results. MVDR beamformer, WPE dereverberation, combination of WPE and MVDR, and DNN based approach are compared. FWSSNE and CD are used as the signal level distortion metrics, PESQ is used as the objective speech quality metric, and WER is used for ASR evaluation. As you can see, MVDR and WPE both greatly improved the quality of speech in terms of all metrics compared with the observed signal, and their combination further improved the quality. On the other hand, neural network approach greatly improved the FWSSNR and CD, but it was not effective for the other metrics. % The following can be skipped The reason why NN based approach does not much contribute to ASR is that NN based acoustic model used in ASR can play almost the same role as the NN based dereverberation frontend, and thus the addition of the frontend does not make much sense. FWSSNR: Frequency-weighted segmental SNR CD: Cepstral distortion PESQ: Perceptual evaluation of speech quality WER: Word error rate (obtained with Kaldi REVERB baseline) Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

Demonstration These are spectrograms of speech signals. The first two are signals captured by a headset and a distant mic, and the remaining four are signals dereverberated by MVDR, WPE, WPE+MVDR, and DNN. All the methods effectively reduced the reverberation. Among all, WPE+MVDR worked the most effectively. It reduced not only reverberation but also background noise. Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

Pros and cons of three approaches
Beamforming Low computational complexity Capable of simultaneous denoising and dereverberation High contribution to ASR Less effective dereverberation WPE Effective dereverberation No denoising capability Computationally demanding Iteration required for source PSD estimation Neural networks Effective dereverberation (source PSD estimation with no iterations) Sensitive to mismatched condition Low contribution to ASR This slide summarizes the pros and cons of the three approaches. While beamforming can perform both dereverberation and denoising simultaneously, WPE was more effective in terms of performing dereverberation. When we combine them, it achieves the best performance for simultaneous denoising and dereverberation. In addition, both beamforming and WPE have high contribution to ASR. A disadvantage of WPE is that it is relatively computationally demanding, and requires iterative estimation for source psd estimation. In contrast, neural network approach is effective for reduction of signal level distortion with no iterative estimation, but it does not contribute to ASR improvement very much. In the following slides, I will explain how we can compensate for these pros and cons by integrating these approaches for better speech enhancement. Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

DNN-WPE [Kinoshita et al., 2017, Heyman et al., 2019]
Dereverberation ASR Backprop of ASR level loss Source PSD estimation DNN( ) Prediction matrix estimation Here, let me show an example of integrating WPE and DNN. The method is called DNN-WPE. It performs dereverberation based mostly on WPE, but the source PSD estimation is replaced by the DNN based approach. One advantage of this modification is that iterative estimation is no longer necessary for WPE. This further allows us to extend it to perform effective online processing. Another advantage of this approach is that the DNN can be optimized jointly with ASR by backprobagating ASR level loss through the dereverberation filtering. This allows us to construct a more effective ASR frontend. Advantages No iterative estimation  Effective for online processing 2. DNN can be optimized jointly with an ASR system Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

Effectiveness of DNN-WPE [Heymann et al., 2019]
ASR loss Training of DNN-WPE AM for ASR WPE - PSD-loss: MSE of PSD estimates - ASR-loss: cross entropy of acoustic mode (AM) output PSD loss DNN REVERB (real) WSJ+VoiceHome Offline Online Unprocessed 17.6 24.3 WPE 13.0 16.2 18.6 20.0 DNN-WPE (PSD loss) 10.8 14.6 18.1 19.3 DNN-WPE (ASR loss) 11.8 13.4 17.7 18.4 This slide shows the effectiveness of DNN-WPE in terms of ASR improvement. We compared WPE and two versions of DNN-WPE. One is trained on minimization of source psd esimation errors denoted by PSD loss, and the other trained on cross entropy of acoustic model output, denoted by ASR loss. Offline and online implementations were tested with all the methods. Here we used two datasets for the evaluation, REVERB challenge dataset and WSJ+VoiceHome dataset. The latter dataset simulates home environments, and contains a lot more background noise than the former. As you can see in this table, DNN-WPE greatly outperformed conventional WPE under all the conditions, and DNN-WPE with ASR loss performed the best for most of the cases. Denoising are not performed, and different ASR backend is used. Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

Frame-online framework for simultaneous denoising and dereverberation
WPD*1: a convolutional beamformer integrates WPE, beamformer, and DNN-based mask estimation WPD *1: Weighted Power minimization Distortionless response　convolutional beamformer WPE Beamformer Source PSD estimation At the end of this part, let me introduce one of our papers that will be presented during this interspeech. The paper presents a frame-online framework for simultaneous denoising and dereverberation. The proposed method, called WPD, integrates all three approaches presented in this tutorial, that is, WPE, beamforming, and neural network based mask estimation, for achieving reliable online processing. If you have interests, please join the presentation. NN based mask estimation Presentation at Interspeech 2019: 12:40-13:00, Mon, Sep. 16 [Nakatani et al, 2019b] Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

Software WPE Matlab p-code for iterative offline, and block-online processing Python code w/ and w/o tensorflow for iterative offline, block-online, and frame-online processing WPE, DNN-WPE Python code with pytorch for offline and frame-online processing Joint optimization of beamforming and dereverberation with end-to-end ASR enabled with espnet ( Here, I listed some useful open software, with which you can easily test the updated dereverberation approaches, including WPE and DNN-WPE. Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

Table of contents Introduction by Tomohiro Noise reduction by Reinhold Dereverberation by Tomohiro Break (30 min) Source separation by Reinhold Meeting analysis by Tomohiro Other topics by Reinhold Summary by Reinhold & Tomohiro QA This is the end of part III. We now would like to have coffee break. Please return here in 30 min. Then, we will restart the tutorial for the part IV, source separation. Haeb-Umbach and Nakatani, Speech Enhancement - Dereverberation

Part III. Dereverberation

Similar presentations

Presentation on theme: "Part III. Dereverberation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Part III. Dereverberation

Similar presentations

Presentation on theme: "Part III. Dereverberation"— Presentation transcript:

Similar presentations

About project

Feedback