Presentation on theme: "Microphone Array Post-filter based on Spatially- Correlated Noise Measurements for Distant Speech Recognition Kenichi Kumatani, Disney Research, Pittsburgh."— Presentation transcript:
Microphone Array Post-filter based on Spatially- Correlated Noise Measurements for Distant Speech Recognition Kenichi Kumatani, Disney Research, Pittsburgh Bhiksha Raj, Carnegie Mellon University Rita Singh, Carnegie Mellon University John McDonough Carnegie Mellon University
Organization of Presentation Our Goal: Distant Speech Recognition (DSR) Backgrounds Conventional Post-filtering Methods Motivations Our Post-filtering Method DSR Experiments on Real Array Data Conclusions
Our Goal ~ Distant Speech Recognition (DSR) System Distant speech Recognition result Speaker’s position Enhanced speech Merits of this Approach: By using the geometry of the microphone array and speaker’s position, our system has the following merits: stable performance in real environments and straightforward extension to the use of other information sources. Speech Recognition Speaker Tracking Beamforming Microphone array Goal: Replace the close-talking microphone with the far-field sensors to make human- machine interfaces more interactive. Overview of our DSR System: Post-filtering Avoid being blind!
Backgrounds of this Work Beamforming would not provide the optimal solution in a sense of the minimum mean square error (MMSE). Post-filtering can further improve speech recognition performance. Backgrounds : Beamforming Time Delay Compensation Multi-channel Input Vector X Post-filter Estimation Post-filtering H Basic Block Chart: Estimate the power spectral densities (PSD) of target and noise signals to build the Wiener filter. Key issue:
Conventional Post-filter Design Method 1 Zelinski Post-filter : Zelinski assumed that -The target and noise signals are uncorrelated, -The noise signals are uncorrelated between different channels, and -The noise PSD is the same among all the channels. Then, the cross- and auto- spectral densities between two channels can be simplified as 0 0 0 0 0 0 By substituting them into the Wiener filter formulation, we have the Zelinski post-filter:
Lefkimmiatis et al. more accurately model the diffuse noise field by applying the coherence to the denominator of the McCowan post-filter. McCowan Post-filter : Conventional Post-filter Design Method 2 Issues of the Zelinski Post-filter : In many situations, the noise signals are spatially correlated. McCowan and Bourlard introduced the coherence of the diffuse noise field: and compute the cross- and auto- spectral densities as Then, the McCowan post-filter can be written as where is an PSD estimate of the target signal for each sensor pair. This is different from the Zelinski method. an indicator of the similarity of signals at different positions Lefkimmiatis Post-filter:
Motivation of our Method Common Problem of Conventional Methods: The static noise field model will not match to every situation. Figures show the magnitude-squared coherence Example of Noise Coherence in a Car: Engine idling State Driving at a speed of 65 mph It is clear that the actual noise field is neither uncorrelated nor diffuse field. Our Motivation: measure the most dominant noise signal instead of those static noise field assumptions. observed in a car.
Our Strategy - How can we measure a noise signal? 1.Estimate a speaker’s position, 2.Build a beamformer and steer a beam toward the target source, 3.Find where the most dominant interfering source is, and 4.Build another beamformer to measure a noise signal. microphones Speaker Noise Beamformer 1 for the target speech Post-filter Enhanced speech Further Noise Removal Beamformer 2 (Noise Extractor) Separated noise Steering direction for the noise source
Our Post-filter System - wawa X w SD B H H H w null HpHp H Post-filter estimation We build a maximum negentropy beamformer for a target source and null-steering beamformer for extracting the noise signal. Maximum Negentropy Beamformer Null-steering Beamformer For the target source For the noise source
Our Post-filter System - Maximum Negentropy (MN) Beamformer (Speech emphasizer) - wawa X w SD B H H H w null HpHp H Post-filter estimation MN Beamformer for the target source For the noise source Build a super-directive beamformer for the quiescent vector w SD. Compute the blocking matrix B to maintain the distortionless constraint for the look direction B H w SD = 0. Find the active weight vector which provides the maximum negentropy of the outputs: w a = argmax Y SDMN =( w SD - B w a ) H X. We can enhance a structured-information signal coming from the direction of interest without signal cancelation and distortion. Advantage: The distribution of clean speech is non-Gaussian and that of noisy and reverberant speech becomes Gaussian. Negentropy is an indicator of how far the distribution of signals is from Gaussian. Maximum Negentropy Beamformer: Maximum Negentropy Criterion:
Our Post-filter System - Null-Steering Beamformer (Noise extractor) - wawa X w SD B H H H w null HpHp H Post-filter estimation For the noise source Null-steering Beamformer (Noise Extractor): Place a null on the direction of interest (DOI) while maintaining the unity gain for the direction of the noise source. Assuming the array manifold vectors for the target source v and for the noise source v N, we obtain such a beamformer’s weight by solving the linear equation: [ v v N ] H w null = [ 0 1 ] T. We can extract a noise signal only by eliminating the target signal arriving directly from the source point. Advantage:
Our Post-filter System - wawa X w SD B H H H w null HpHp H Post-filter estimation For the target source For the noise source We can design the post-filter as Now that we have estimates of the target signal Y SDMN =( w SD - B w a ) H X and an noise observation Y null = w null X, H Our post-filter design:
Speech Recognition Results Word Error Rates in Different Conditions Word Error Rate
Conclusions We used actual noise measurements for the microphone array post-filter. It turned out that the noise fields in car conditions are neither uncorrelated nor spherically isotropic (diffuse). It has been demonstrated that our post-filter method can provide the best recognition performance among the popular post-filter methods. This is because our method can update a noise PSD adaptively without any static noise coherence assumption.
Speech Samples (65-Wind) Single Distant Channel Post-filtered Speech Extracted Noise Signal
Actual Speech Distribution ~ Super-Gaussian Distributions of clean speech with super-Gaussian distributions The distribution of speech is not Gaussian but non-Gaussian. It has “spikey” and “heavy-tailed” characteristics. *The histograms are computed from the real part of actual subband samples. How about maximizing a degree of super-Gaussianity?
Why do we need non-Gaussianity measures? The reasoning is briefly grounded on 2 points: 1.The distribution of independent random variables (r.v.s.) will approach Gaussian in the limit as more components are added. 2.Information-bearing signals have a structure which makes them predictable. If we want original independent components which bear information, we have to look for a signal that is not Gaussian. Distributions of clean and noise-corrupted speech Distributions of clean and reverberated speech The distributions of noise- corrupted and reverberated speech are closer to the Gaussian than clean speech.
Negentropy Criterion for super-Gaussianity Definition of negentropy: Negentropy is defined as the difference between entropy of Gaussian and Super-Gaussian r.v.s: Entropy of Gaussian r.v Entropy of super-Gaussian r.v Higher negentropy indicates how far the distribution of the r.v.s. is from Gaussian. Definition of entropy: Entropy of r.v. Y is defined as: Entropy indicates a degree of uncertainty of information. Negentropy is generally more robust than the other criterion.
Analysis of the MN Beamforming Algorithm Simulated environment by the image method The signal cancellation will occur because of the strong reflection. 30° 70.9° Target source 4m Image Reflection Observe that MN beamforming can enhance the target signal by strengthening the reflection, which suggests it does not suffer from the signal cancellation. 650Hz1600Hz
Measures for non-Gaussianity Kurtosis of r.v. is defined as: Definition of kurtosis: where K is the number of frames. Super-Gaussian: positive kurtosis, Sub-Gaussian: those with negative kurtosis, The Gaussian pdf : zero kurtosis. Kurtosis can measure the degree of non-Gaussianity. Empirical approximation of kurtosis: is positive value Negentropy Empirical kurtosis