A Tutorial on Bayesian Speech Feature Enhancement

A Tutorial on Bayesian Speech Feature Enhancement
SCALE Workshop, January 2010 A Tutorial on Bayesian Speech Feature Enhancement Friedrich Faubel

I Motivation

Speech Recognition System Overview
A speech recognition system converts speech to text. It basically consists of two components: Front End: extracts speech features from the audio signal Decoder: finds that sentence (sequence of acoustical states), which is the most likely explanation for the observed sequence of speech features Front End Decoder Text Speech

Speech Feature Extraction Windowing

Speech Feature Extraction Time Frequency Analysis
Performing spectral analysis separately for each frame yields a time-frequency representation

Speech Feature Extraction Perceptual Representation
Emulation of the logarithmic frequency and intensity perception of the human auditory system

Background Noise Background noise distorts speech features
Result: features don’t match the features used during training Consequence: severely degraded recognition performance

Overview of the Tutorial
I - Motivation II - The effect of noise to speech features III - Transforming probabilities IV - The MMSE solution to speech feature enhancement V - Model-based speech feature enhancement VI - Experimental results VII - Extensions

II Interaction Function The Effect of Noise

Interaction Function + =
Principle of Superposition: signals are additive noise clean speech noisy speech + =

Interaction Function In the signal domain we have the following relationship: noisy speech noise clean speech

Interaction Function In the signal domain we have the following relationship:

Interaction Function In the signal domain we have the following relationship: After Fourier transformation, this becomes:

Interaction Function In the signal domain we have the following relationship: After Fourier transformation, this becomes: Taking the magnitude square on both sides, we get:

Interaction Function Taking the magnitude square on both sides, we get:

Interaction Function Taking the magnitude square on both sides, we get: Hence, in the power spectral domain we have:

Interaction Function Taking the magnitude square on both sides, we get: Hence, in the power spectral domain we have: phase term

Interaction Function Taking the magnitude square on both sides, we get: Hence, in the power spectral domain we have: relative phase

Interaction Function The relative phase between two waves describes their relative offset in time (delay) time relative phase

Interaction Function = = = =
When 2 sound sources are present the following can happen: = = amplification amplification = = attenuation cancellation

Interaction Function Taking the magnitude square on both sides, we get: Hence, in the power spectral domain we have: relative phase

Interaction Function Taking the magnitude square on both sides, we get: Hence, in the power spectral domain we have: zero in average

Interaction Function Taking the magnitude square on both sides, we get: Hence, in the power spectral domain we have: In the log power spectral domain that becomes:

Interaction Function Taking the magnitude square on both sides, we get: Hence, in the power spectral domain we have: In the log power spectral domain that becomes: Acero, 1990

Interaction Function Taking the magnitude square on both sides, we get: Hence, in the power spectral domain we have: In the log power spectral domain that becomes: But is that really right?

Interaction Function The mean of a nonlinearly transformed random variable is not necessarily equal to the nonlinear transform of the random variable’s mean. nonlinear transform

Interaction Function Phase-averaged relationship between clean and noisy speech:

III Transforming Probabilities

Transforming Probabilities Motivation
In the signal domain we have the following relationship: In the log Mel domain that translates to: nonlinear interaction function

noise power noisy speech power clean speech power

noisy speech power clean speech power noise power

clean speech power noise power noisy speech power

Transformation results in a non-Gaussian probability distribution for noisy speech features.

Transforming Probabilities Introduction
Transformation of a random variable Transformation Probability density function

Transformation of a random variable Transformation Probability density function The transformation maps each x to a y:

Transformation of a random variable Transformation Probability density function The transformation maps each x to a y: Conversely, each y can be identified with

Transformation of a random variable Transformation Probability density function Idea: use to map distribution of y to distribution of x

Transformation of a random variable Transformation Probability density function Idea: use to map distribution of y to distribution of x change of variables

Transformation of a random variable Transformation Probability density function Idea: use to map distribution of y to distribution of x Jacobian determinant

Transformation of a random variable Transformation Probability density function Idea: use to map distribution of y to distribution of x Fundamental Transformation Law of Probability

Transforming Probabilities Monte Carlo
Idea: approximate probability distribution by samples drawn from the distribution. discrete probability mass pdf

Idea: approximate probability distribution by samples drawn from the distribution. pdf cumulative density function

Idea: approximate probability distribution by samples drawn from the distribution. Then: transform each sample pdf transformed pdf

Idea: approximate probability distribution by samples drawn from the distribution. Then: transform each sample histogram transformed pdf

Transforming Probabilities Local Linearization
Idea: Locally linearize the interaction function around the mean of speech and noise, using a first order Taylor series expansion. Note: a linear transformation of a Gaussian random variable results in a Gaussian random variable.

Idea: Locally linearize the interaction function around the mean of speech and noise, using a first order Taylor series expansion. Moreno, 1996 Vector Taylor Series Approach Note: a linear transformation of a Gaussian random variable results in a Gaussian random variable.

Idea: Locally linearize the interaction function around the mean of speech and noise, using a first order Taylor series expansion.

Transforming Probabilities The Unscented Transform
Idea: similar as in Monte Carlo, select points in a determi nistic fashion and in such a way that they capture the mean and covariance of the distribution select points

select points

select points transform points

select points transform points Re-estimate parameters of the Gaussian distribution

Comparison to local linearization: local linearization unscented transform

select points transform points Re-estimate parameters of the Gaussian distribution

transform points

The points selected by the un-scented transform lie on lines around the center point. transform points

The points selected by the un-scented transform lie on lines around the center point. After nonlinear transformation, the points might no longer lie on a line transform points

The points selected by the un-scented transform lie on lines around the center point. After nonlinear transformation, the points might no longer lie on a line transform points Hence we can measure the degree of nonlinearity as the average distance of each three points from a linear fit of the three points.

transform points Hence we can measure the degree of nonlinearity as the average distance of each three points from a linear fit of the three points.

Hence we can measure the degree of nonlinearity as the average distance of each three points from a linear fit of the three points. transform points

Hence we can measure the degree of nonlinearity as the average distance of each three points from a linear fit of the three points. This can be shown to be closely related to the R2 measure used in linear regression. transform points

true distribution Gaussian fit High degree of nonlinearity Gaussian fit does not well represent the transformed distribution

Transforming Probabilities An Adaptive Level of Detail Approach
Idea: splitting a Gaussian into two Gaussian components decreases the covariance and thereby the nonlinearity.

Idea: splitting a Gaussian into two Gaussian components decreases the covariance and thereby the nonlinearity. 2 Gaussians

Algorithm, Adaptive Level of Detail Transform [ALoDT] start with one Gaussian g transform that Gaussian with the UT identify Gaussian component with highest dnl split that component into 2 Gaussians g1, g2 transform g1 and g2 with the UT while #(Gaussians) < N: repeat step 3.

Density approximation with the Adaptive Level of Detail Transform unscented transform

Density approximation with the Adaptive Level of Detail Transform ALoDT-2

Kullback Leibler divergence between approximated and true distribution (Monte Carlo with 10M samples). Adaptive Level of Detail Transform N 1 2 4 8 16 32 KLD 0.190 0.078 0.025 0.017 0.007 0.004 decrease by a factor of 48

IV Speech Feature Enhancement The MMSE Solution

Speech Feature Enhancement The MMSE Solution
Idea: train speech recognition system on clean speech try to map distorted features to clean speech features Systematic Approach: derive an estimator for clean speech given noisy speech

Let be an estimator for clean speech , given noisy speech .

Let be an estimator for clean speech , given noisy speech . Then the expected mean square error introduced by using instead of the true is:

Then the expected mean square error introduced by using instead of the true is:

Then the expected mean square error introduced by using instead of the true is: Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion:

Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion:

Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: But how to obtain this distribution?

Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: Idea: assume that the joint distribution of S and Y is Gaussian

Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: Idea: assume that the joint distribution of S and Y is Gaussian Afify, 2007 Stereo-Based Stochastic Mapping

Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: Idea: assume that the joint distribution of S and Y is Gaussian

Idea: assume that the joint distribution of S and Y is Gaussian

Idea: assume that the joint distribution of S and Y is Gaussian Then the conditional distribution of S|Y is again Gaussian: with conditional mean and covariance matrix

Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: Under the Gaussian assumption, this integral is easily obtained:

Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: Under the Gaussian assumption, this integral is easily obtained: This is exactly what you get with the vector Taylor series approach Moreno, 1996

Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: Under the Gaussian assumption, this integral is easily obtained: Problem: speech is known to be multi modal

Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: Introduce the index k of the mixture component as a hidden variable.

Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: Then rewrite this as

Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: pull the sum out of the integral

Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: independent of s

Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: pull this out of the integral

Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: Probability that clean speech originated from the kth Gaus-sian given the noisy speech spectrum y.

Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: Clean speech estimate of the k-th Gaussian:

Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: Bayes’ theorem

Minimizing the MSE with respect to yields the optimal estimator with respect to the MMSE criterion: joint distribution

V Model-Based Speech Feature Enhancement

Model-Based Speech Feature Enhancement
Distribution of clean speech is modeled as Gaussian Mixture

Distribution of clean speech is modeled as Gaussian Mixture + +

Distribution of clean speech is modeled as Gaussian Mixture Noise is modeled as a single Gaussian

Distribution of clean speech is modeled as Gaussian Mixture Noise is modeled as a single Gaussian Presence of noise changes the clean speech distribution according to the interaction function

Distribution of clean speech is modeled as Gaussian Mixture Noise is modeled as a single Gaussian Presence of noise changes the clean speech distribution according to the interaction function Construct the joint distribution of clean and noisy speech based on this model

Construct the joint distribution of clean and noisy speech based on this model

Noise Estimation: Find that noise distribution, which is the most likely explanation for the observed, noisy speech features

Noise Estimation: Find that noise distribution, which is the most likely explanation for the observed, noisy speech features mean and covariance of the noise

Noise Estimation: Find that noise distribution, which is the most likely explanation for the observed, noisy speech features Problem: the observations are also dependent on speech!

Problem: the observations are also dependent on speech! hidden variable

Noise Estimation: Find that noise distribution, which is the most likely explanation for the observed, noisy speech features Problem: the observations are also dependent on speech!

Noise Estimation: Find that noise distribution, which is the most likely explanation for the observed, noisy speech features Problem: the observations are also dependent on speech! Hence, the Expectation Maximization algorithm is used. Rose, 1994 Moreno, 1996

Expectation Step: construct the joint distribution by using the current noise parameter estimate Then calculate

Expectation Step: construct the joint distribution by using the current noise parameter estimate Then calculate Maximization Step: Reestimate by ac-cumulating statistics of the instantaneous noise estimates for each possible , weighted by the probability that clean speech originated from this Gaussian.

Maximization Step: Reestimate by ac-cumulating statistics of the instantaneous noise estimates for each possible , weighted by the probability that clean speech originated from this Gaussian.

Maximization Step: Reestimate by ac-cumulating statistics of the instantaneous noise estimates for each possible , weighted by the probability that clean speech originated from this Gaussian:

Maximization Step: Reestimate by ac-cumulating statistics of the instantaneous noise estimates for each possible , weighted by the probability that clean speech originated from this Gaussian: But how to obtain this distribution?

Maximization Step: Reestimate by ac-cumulating statistics of the instantaneous noise estimates for each possible , weighted by the probability that clean speech originated from this Gaussian: So, we have , need

Maximization Step: Reestimate by ac-cumulating statistics of the instantaneous noise estimates for each possible , weighted by the probability that clean speech originated from this Gaussian: But that is just the conditional Gaussian distribution with conditional mean and covariance

VI Experimental Results

Experimental Results Speech Recognition Experiments
clean speech from MC-WSJ-AV corpus noise from the NOISEX-92 database (artifically added) MFCC with 13 components, stacking of 15 frames, LDA cepstral mean and variance normalization 1743 acoustical states; Gaussians

Experimental Results WER, destroyer engine noise

Experimental Results WER, factory noise

VII Extensions

Extensions Sequential noise estimation:
Sequential expectation maximization (SEM), Kim, 1998

Sequential expectation maximization (SEM), Kim, 1998 Interacting Multiple Model (IMM) Kalman Filter, Kim, 1999

Sequential expectation maximization (SEM), Kim, 1998 Interacting Multiple Model (IMM) Kalman Filter, Kim, 1999 Particle filter, Yao, 2001

Sequential expectation maximization (SEM), Kim, 1998 Interacting Multiple Model (IMM) Kalman Filter, Kim, 1999 Particle filter, Yao, 2001 Improve speech recognition through: Combination with Joint Uncertainty Decoding, Shinohara, 2008

Sequential expectation maximization (SEM), Kim, 1998 Interacting Multiple Model (IMM) Kalman Filter, Kim, 1999 Particle filter, Yao, 2001 Improve speech recognition through: Combination with Joint Uncertainty Decoding, Shinohara, 2008 Combination with bounded conditional mean imputation?

A Tutorial on Bayesian Speech Feature Enhancement

Similar presentations

Presentation on theme: "A Tutorial on Bayesian Speech Feature Enhancement"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Tutorial on Bayesian Speech Feature Enhancement

Similar presentations

Presentation on theme: "A Tutorial on Bayesian Speech Feature Enhancement"— Presentation transcript:

Similar presentations

About project

Feedback