Presentation is loading. Please wait.

Presentation is loading. Please wait.

ROBUST SIGNAL REPRESENTATIONS FOR AUTOMATIC SPEECH RECOGNITION Richard Stern Department of Electrical and Computer Engineering and School of Computer.

Similar presentations


Presentation on theme: "ROBUST SIGNAL REPRESENTATIONS FOR AUTOMATIC SPEECH RECOGNITION Richard Stern Department of Electrical and Computer Engineering and School of Computer."— Presentation transcript:

1

2 ROBUST SIGNAL REPRESENTATIONS FOR AUTOMATIC SPEECH RECOGNITION Richard Stern Department of Electrical and Computer Engineering and School of Computer Science Carnegie Mellon University Pittsburgh, Pennsylvania Telephone: (412) ; FAX: (412) Institute for Mathematics and its Applications University of Minnesota September 19, 2000

3 Carnegie Mellon Slide 2ECE and SCS Robust Speech Group Introduction As speech recognition is transferred from the laboratory to the marketplace robust recognition is becoming increasingly important “Robustness” in 1985: –Recognition in a quiet room using desktop microphones Robustness in 2000: –Recognition »over a cell phone »in a car »with the windows down »and the radio playing »at highway speeds

4 Carnegie Mellon Slide 3ECE and SCS Robust Speech Group What I’ll talk about today... Why we use cepstral-like representations Some “classical” approaches to robustness Some “modern” approaches to robustness Some alternate representations Some remaining open issues

5 Carnegie Mellon Slide 4ECE and SCS Robust Speech Group The source-filter model of speech A useful model for representing the generation of speech sounds: Pitch Pulse train source Noise source Vocal tract model Amplitude p[n]

6 Carnegie Mellon Slide 5ECE and SCS Robust Speech Group Implementation of MFCC processing Compute magnitude-squared of Fourier transform Apply triangular frequency weights that represent the effects of peripheral auditory frequency resolution Take log of outputs Compute cepstra using discrete cosine transform Smooth by dropping higher-order coefficients

7 Carnegie Mellon Slide 6ECE and SCS Robust Speech Group Implementation of PLP processing Compute magnitude-squared of Fourier transform Apply triangular frequency weights that represent the effects of peripheral auditory frequency resolution Apply compressive nonlinearities Compute discrete cosine transform Smooth using autoregressive modeling Compute cepstra using linear recursion

8 Carnegie Mellon Slide 7ECE and SCS Robust Speech Group Rationale for cepstral-like parameters The cepstrum is the inverse transform of the log of the magnitude of the spectrum Useful for separating convolved signals (like the source and filter in the speech production model) –“Homomorphic filtering” Alternatively, cepstral processing be thought of as the Fourier series expansion of the magnitude of the Fourier transform

9 Carnegie Mellon Slide 8ECE and SCS Robust Speech Group An example

10 Carnegie Mellon Slide 9ECE and SCS Robust Speech Group The vowel /uh/ in “one” after windowing

11 Carnegie Mellon Slide 10ECE and SCS Robust Speech Group The raw spectrum

12 Carnegie Mellon Slide 11ECE and SCS Robust Speech Group Signal representations in MFCC processing ORIGINAL SPEECH MEL LOG MAGS AFTER CEPSTRA

13 Carnegie Mellon Slide 12ECE and SCS Robust Speech Group Additional parameters typically used Delta cepstra and delta-delta cepstra Power and delta power Comment: These features restore (some) temporal dependencies … more heroic approaches exist as well (e.g. Alwan, Hermansky)

14 Carnegie Mellon Slide 13ECE and SCS Robust Speech Group Challenges in robust recognition “Classical” problems: –Additive noise –Linear filtering “Modern” problems: –Transient degradations –Very low SNR “Difficult” problems: –Highly spontaneous speech –Speech masked by other speech

15 Carnegie Mellon Slide 14ECE and SCS Robust Speech Group “Classical” robust recognition: A model of the environment “Clean” speech x[m] h[m] Linear Filtering n[m] Additive Noise + z[m] Degraded speech

16 Carnegie Mellon Slide 15ECE and SCS Robust Speech Group AVERAGED FREQUENCY RESPONSE FOR SPEECH AND NOISE Close-talking microphone: Desktop microphone:

17 Carnegie Mellon Slide 16ECE and SCS Robust Speech Group Power spectra: Effect of noise and filtering on cepstral or log spectral features: or where is referred to as the “environment function” Representation of environmental effects in cepstral domain x[m] h[m] n[m] + z[m]

18 Carnegie Mellon Slide 17ECE and SCS Robust Speech Group Another look at environmental distortions: Additive environmental compensation vectors Environment functions for the PCC-160 cardiod desktop mic: Comment: Functions depend on SNR and phoneme identity

19 Carnegie Mellon Slide 18ECE and SCS Robust Speech Group Highpass filtering of cepstral features Examples: CMN (CMU et al., RASTA, J-RASTA (OGI/ICSI/IDIAP et al.), multi-level CMN (Microsoft, et al.) Comments: –Application to cepstral features compensates for linear filtering; application to spectral features compensates for additive noise –“Great value for the money” z x ^ Highpass filter

20 Carnegie Mellon Slide 19ECE and SCS Robust Speech Group Two common cepstral highpass filters CMN (Cepstral Mean Normalization): RASTA (Relative Spectral Processing, 1994 version):

21 Carnegie Mellon Slide 20ECE and SCS Robust Speech Group “Frequency response” of CMN and RASTA filters Comment: Both RASTA and CMN have zero DC response

22 Carnegie Mellon Slide 21ECE and SCS Robust Speech Group Principles of model-based environmental compensation Attempt to estimate parameters characterizing unknown filter and noise that when applied in inverse fashion will maximize the likelihood of the observations x[m] h[m] n[m] + z[m]

23 Carnegie Mellon Slide 22ECE and SCS Robust Speech Group Model-based compensation for noise and filtering: The VTS algorithm The VTS algorithm (Moreno, Raj, Stern, 1996): –Approximate f(x,n,q) by the first several terms of its Taylor series expansion, assuming that n and q are known –The effects of f(x,n,q) on the statistics of the speech features then can be obtained analytically –The EM algorithm is used to find the values of n and q that maximize the likelihood of the observations –The statistics of the incoming cepstral vectors are re-estimated using MMSE techniques

24 Carnegie Mellon Slide 23ECE and SCS Robust Speech Group The good news: VTS improves recognition accuracy in “stationary” noise Comment: More accurate modeling of VTS improves recognition accuracy at all SNRs compared to CDCN and CMN (1990)

25 Carnegie Mellon Slide 24ECE and SCS Robust Speech Group But the bad news: Model-based compensation doesn’t work very well in transient noise CDCN does not improve speech recognition errors in music very much

26 Carnegie Mellon Slide 25ECE and SCS Robust Speech Group So what can we do about transient noises? Two major approaches: –Sub-band recognition (e.g. Bourlard, Morgan, Hermansky et al.) –Missing-feature recognition (e.g. Cooke, Green, Lippmann et al.) At CMU we’ve been working on a variant of the missing-feature approach

27 Carnegie Mellon Slide 26ECE and SCS Robust Speech Group MULTI-BAND RECOGNITION Basic approach: –Decompose speech into several adjacent frequency bands –Train separate recognizers to process each band –Recombine information (somehow) Comment: –Motivated by observation of Fletcher (and Allen) that the auditory system processes speech in separate frequency bands Some implementation decisions: –How many bands? –At what level to do the splits and merges? –How to recombine and weight separate contributions?

28 Carnegie Mellon Slide 27ECE and SCS Robust Speech Group MISSING-FEATURE RECOGNITION General approach: –Determine which cells of a spectrogram-like display are unreliable (or “missing”) –Ignore missing features or make best guess about their values based on data that are present

29 Carnegie Mellon Slide 28ECE and SCS Robust Speech Group ORIGINAL SPEECH SPECTROGRAM

30 Carnegie Mellon Slide 29ECE and SCS Robust Speech Group SPECTROGRAM CORRUPTED BY WHITE NOISE AT SNR 15 dB Some regions are affected far more than others

31 Carnegie Mellon Slide 30ECE and SCS Robust Speech Group IGNORING REGIONS IN THE SPECTROGRAM THAT ARE CORRUPTED BY NOISE All regions with SNR less than 0 dB deemed missing (dark blue) Recognition performed based on colored regions alone

32 Carnegie Mellon Slide 31ECE and SCS Robust Speech Group Filling in missing features at CMU (Raj) We modify the incoming features rather than the internal models (which is what has been done at Sheffield) Why modify the incoming features? –More flexible feature set (can use cepstral rather than log spectral features) –Simpler processing –No need to modify recognizer

33 Carnegie Mellon Slide 32ECE and SCS Robust Speech Group Recognition accuracy using compensated cepstra, speech corrupted by white noise Large improvements in recognition accuracy can be obtained by reconstruction of corrupted regions of noisy speech spectrograms Knowledge of locations of “missing” features needed SNR (dB) Accuracy (%) Cluster Based Recon. Temporal Correlations Spectral Subtraction Baseline

34 Carnegie Mellon Slide 33ECE and SCS Robust Speech Group Recognition accuracy using compensated cepstra, speech corrupted by music Recognition accuracy goes up from 7% to 69% at 0 dB with cluster based reconstruction SNR (dB) Accuracy (%) Cluster Based Recon. Temporal Correlations Spectral Subtraction Baseline

35 Carnegie Mellon Slide 34ECE and SCS Robust Speech Group So how can we detect “missing” regions? Current approach: –Pitch detection to comb out harmonics in voiced segments –Multivariate Bayesian classifiers using several features such as »Ratio of power at harmonics relative to neighboring frequencies »Extent of temporal synchrony to fundamental frequency How well we’re doing now with blind identification: –About half way between baseline results and results using perfect knowledge of which data are missing –About 25% of possible improvement for background music

36 Carnegie Mellon Slide 35ECE and SCS Robust Speech Group Missing features versus multi-band recognition Multi-band approaches are typically implemented with a relatively small number of channels while …. …. with missing feature approaches, every time-frequency point can be considered or ignored The full-combination method for multi-band recognition considers every possible combination of present or missing bands, eliminating the need for blind identification of optimal combination of inputs Nevertheless, missing-feature approaches may provide superior recognition accuracy because they enable a finer partitioning of the observation space if we could solve the identification problem

37 Carnegie Mellon Slide 36ECE and SCS Robust Speech Group Some other types of representations Physiologically-motivated representations (“ear models”) –Seneff, Ghitza, Lyon/Slaney, Patterson, etc. Feature extraction using “smart” nonlinear transformations –Hermansky et al.

38 Carnegie Mellon Slide 37ECE and SCS Robust Speech Group Physiologically-motivated speech processing In recent years signal processing motivated by knoweldge of human auditory perception has become more popular –Abilities of human audition form a powerful existence proof

39 Carnegie Mellon Slide 38ECE and SCS Robust Speech Group Some auditory principles that system developers consider Structure of auditory periphery: –Linear bandpass filtering –Nonlinear rectification with saturation/gain control –Further analysis Dependence of bandwidth of peripheral filters on center frequency Nonlinear phenomena: –Saturation –Lateral suppression Temporal response: –Synchrony and phase locking at low frequencies

40 Carnegie Mellon Slide 39ECE and SCS Robust Speech Group An example: The Seneff model

41 Carnegie Mellon Slide 40ECE and SCS Robust Speech Group Timing information in the Seneff model Seneff model includes the effects of synchrony at low frequencies Synchrony detector in Seneff model records extent to which response in a frequency band is phase-locked with the channel’s center frequency Local synchrony has been shown to represent vowels more robustly in the peripheral auditory system in the presence of additive noise (e.g. Young and Sachs) Related work by Ghitza, DeMori, and others shows improvements in recognition accuracy relative to features based on mean rate, but at the expense of much more computation

42 Carnegie Mellon Slide 41ECE and SCS Robust Speech Group COMPUTATIONAL COMPLEXITY OF AUDITORY MODELS Number of multiplications per ms of speech: Comment: auditory computation is extremely expensive

43 Carnegie Mellon Slide 42ECE and SCS Robust Speech Group Some other comments on auditory models “Correlogram”-type representations (channel-by-channel running autocorrelation functions) being explored by some researchers (Slaney, Patterson, et al.) –Much more information in display Auditory models have not yet realized their full potential because... –Feature set must be matched to classification system ….. features generally not Gaussian –All aspects of available feature must be used –Research groups need both auditory and ASR experts

44 Carnegie Mellon Slide 43ECE and SCS Robust Speech Group “Smart” feature extraction using non-linear transformations (Hermansky group) Complementary approaches using temporal slices (mostly): –Temporal linear discriminant analysis (LDA) to obtain maximally- discriminable basis functions over a ~1-sec interval in each critical band »Three vectors with greatest eigenvalues are used as RASTA-like filters in each of 15 critical bands »Karhunen-Loeve transform used to reduce dimensionality down to 39 based on training data –TRAP features »Use MLP to provide nonlinear mapping from temporal trajectories to phoneme likelihoods –Modulation-filtered spectrogram (MSG) »Pass spectrogram features through two temporal modulation filters (0-8 Hz and 8-16 Hz)

45 Carnegie Mellon Slide 44ECE and SCS Robust Speech Group Use of nonlinear feature transformations in Aurora evaluation Multiple feature sets combined by averaging feature values after nonlinear mapping –Best system combines transformed PLP features, transformed MSG features, plus TRAP features (63% improvement over baseline!) Aurora evaluation system used reduced temporal span and other shortcuts to meet delay, processing time, and memory specs of evaluation (40% net improvement over baseline) Comment: Procedure effectively moves some of the “training” to the level of the features …. generalization to larger tasks remains to be verified

46 Carnegie Mellon Slide 45ECE and SCS Robust Speech Group Feature combination versus compensation combination: The CMU SPINE System

47 Carnegie Mellon Slide 46ECE and SCS Robust Speech Group SPINE evaluation conditions

48 Carnegie Mellon Slide 47ECE and SCS Robust Speech Group The CMU SPINE system (Singh) Three feature sets considered: –Mel cepstra –PLP cepstra –Mel cepstra of lowpass filtered speech Four compensation schemes: –Codeword Dependent Codebook Normalization (CDCN) –Vector Taylor Series (VTS) –Singular Value Decomposition (SVD) –Karhunen-Loeve Transform-based noise cancellation (KLT) Additional features from ICSI/OGI: –PLP cepstra subjected to MLP and KL transform for orthogonalization

49 Carnegie Mellon Slide 48ECE and SCS Robust Speech Group Summary of CMU and CMU-ICSI-OGI SPINE results (MFCC) ICSI/OGI 4 Comp. 3 Feat. 4 Feat.

50 Carnegie Mellon Slide 49ECE and SCS Robust Speech Group Comments Some techniques we haven’t discussed: –VTLN –Microphone arrays –Time-frequency representations (e.g. wavelets) –Robustness to Lombard speech, speaking style, etc. –Many others Some hard problems not addressed: –Very low SNR ASR –Highly spontaneous speech (!) »A representation or pronunciation modeling issue?

51 Carnegie Mellon Slide 50ECE and SCS Robust Speech Group Summary Despite many shortcomings, cepstral-based features are well motivated, typically augmented by cepstral highpass filtering “Classical” model-based robustness techniques work reasonably well in combating quasi-stationary degradations “Modern” multiband and missing-feature techniques show great promise in coping with transient interference, etc. Auditory models remain appealing, although their potential has not yet been realized “Smart” features can provide dramatic improvements, at least in small tasks Feature combination will be key component of future systems

52


Download ppt "ROBUST SIGNAL REPRESENTATIONS FOR AUTOMATIC SPEECH RECOGNITION Richard Stern Department of Electrical and Computer Engineering and School of Computer."

Similar presentations


Ads by Google