SPOKEN LANGUAGE SYSTEMS MIT Computer Science and Artificial Intelligence Laboratory Relating Reliability in Phonetic Feature Streams to Noise Robustness.

SPOKEN LANGUAGE SYSTEMS MIT Computer Science and Artificial Intelligence Laboratory Relating Reliability in Phonetic Feature Streams to Noise Robustness Alex Park August 26 th, 2003

Introduction | Stream Recognizer | Voicing Module | Recognition Experiments | Conclusions SLS Overview Motivation for using layered, phonetic feature stream approach Building a recognizer based on phonetic features –MFCC-based GMM feature detectors (baseline) –Sample feature stream outputs –Training digit recognizer using concatenated feature streams as input Robust alternatives for voicing feature stream module –Saul sinusoid detector –Autocorrelation –GMM classifier using alternative features Evaluation of stream reliability using distortion between clean and noisy speech –Hard question: what is ground truth for continuous measurements? Relating stream extraction reliability to word recognition accuracy Conclusions and Future Work Introduction

Introduction | Stream Recognizer | Voicing Module | Recognition Experiments | Conclusions SLS Motivation Failure of recognizers in noise is due to mismatch between features observed in training and testing –In order to reduce mismatch, we can evaluate and optimize the reliability of features presented to the acoustic models at a “middle layer” Current recognizers typically use one set of front end features to train acoustic models at the phone level –Typical front end features can only be evaluated by looking at WER, which is influenced by many factors. Global optimization can mask serious inconsistencies in the speech representation under noise –Phonetic features can change asynchronously, especially in spontaneous speech Why phonetic features? –Are perceivable by humans and relevant to speech –Several examples of phonetic feature/phone class detection exist *Bursts (Niyogi 2002), Nasality (Glass 1986), Voicing (Saul 2003) –Other researchers have recently proposed acoustic modelling frameworks based on related feature streams (articulatory, acoustic, distinctive) *Articulatory (Livescu 2003, Metze 2002), acoustic (Kirchoff 2002) –Why not? Introduction

Introduction | Stream Recognizer | Voicing Module | Recognition Experiments | Conclusions SLS Training MFCC GMM Feature Classifiers Sparse set of 6 phonetic features chosen for simplicity –For a less constrained task, more features should probably be used –More extensive training data would also improve quality of each feature detector For each feature F, train two GMMs, p(x|+F) and p(x|-F), using frame- level feature MFCC feature vectors Trained on 410 TIMIT sentences from 40 speakers (126k frames) Use Bayes’ rule (w/ equal priors) to determine posterior probs, which are computed every 10 ms +F GMM -F GMM Transcribed speech (Training Data) Training Unknown frame, x Testing Posterior probability p(x|-F) p(x|+F) FeatureTIMIT labels Frications, sh, z, zh, f, th, … Roundingw, ow, uw, … Nasaln, m, ng, … Liquid/Glideel, l, uw, … Burstg, k, p, … Voiceaa, ae, ah, … p(+F|x) = p(x|+F) + p(x|-F) p(x|+F) Stream Recognizer

Introduction | Stream Recognizer | Voicing Module | Recognition Experiments | Conclusions SLS Sample Outputs: MFCC-based Streams Feature streams on Aurora utt. (“six three five seven one zero four”) For now, focus on this stream Stream Recognizer

Introduction | Stream Recognizer | Voicing Module | Recognition Experiments | Conclusions SLS Recognizer Training Phonetic feature posterior probability outputs used as feature vectors to train Aurora HMM recognizer –Standard train script included for Aurora 2 evaluation (8440 clean training utts) –Eleven whole word models and one silence model –18 states each, 3 mixtures, 6 dimensional diagonal Gaussian emission probs Probably not an optimal model structure for given feature set. –Also, used HCompV instead of HInit with time aligned transcriptions frication burst voicing round nasal glide a1a2:a5a6a1a2:a5a6 … … “one” “two” “oh” …… … … :::: Feature Extraction Modules Whole word HMMs Clean Training Data Feature Vector Concat. Train Stream Recognizer

Introduction | Stream Recognizer | Voicing Module | Recognition Experiments | Conclusions SLS Preliminary recognition results Tested across all 4 noise conditions, 7 SNR levels on Aurora testa –Accuracy is 88% on clean data (obtained 91% earlier using 9 feature streams, but reduced to 6 for simplicity) Poor performance compared to Aurora baseline, but interesting considering sparsity of feature set used to train HMMs Many factors should be addressed to improve stream-based recognizer –More feature streams –Deltas and delta-deltas –Relationship between feature streams –Discriminative lexical ability for different word models –Noise compensation in feature extraction Stream Recognizer

Introduction | Stream Recognizer | Voicing Module | Recognition Experiments | Conclusions SLS A closer look: Stream corruption under noise Effect of noise on output of MFCC-based voicing feature module p(Voice) Clean 15 dB SNR Clean 5 dB SNR Clean -5 dB SNR p(Voice) Voicing Module

Introduction | Stream Recognizer | Voicing Module | Recognition Experiments | Conclusions SLS In search of a better voicing module Several possible alternatives to MFCC based voicing module –Autocorrelation (AutoCorr) –Sinusoid Uncertainty (Saul, 2003) Alternative GMM classifier (AltGMM) –trained like MFCC classifier, but using above features –6 dimensional, 10 mixture diagonal gaussians each for p(x|+F), p(x|-F) Evaluated voicing detection using phonetic transcription as reference In clean conditions, MFCC GMM has best detection performance Is this the best module to use? Voicing Module MethodEqual Error Rate GMM11.14% Sinusoid18.11% Autocorr16.78% AltGMM24.84%

Introduction | Stream Recognizer | Voicing Module | Recognition Experiments | Conclusions SLS Which module is better? A or B ? voicedunvoiced truth Evaluating stream robustness Several problems with using global frame detection accuracy to rate module performance –Would like to have some continuous measure of voicing (degree of voicing) instead of binary decision –Ground truth is hard to come by – voiced phone labels not necessarily voiced! To evaluate reliability, try using distortion between clean and noisy voicing probability for the same utterance. –For each frame, measure difference between clean, f c (t), and noisy, f n (t), estimate. –If |f c (t)-f n (t)| > 0.2  label f(t) as gross error –If |f c (t)-f n (t)| < 0.2  use |f c (t)-f n (t)| as a measure of the distortion caused by noise. N.B. Consistency doesn’t guarantee accuracy, we still need to check voicing score for frame k Module A Module B noisier cleaner What about now? ?? Voicing Module

Introduction | Stream Recognizer | Voicing Module | Recognition Experiments | Conclusions SLS Distortion Comparison Compared the frame distortion for voicing modules at each noise level –Percentage of frames labelled as gross errors (distortion > 0.2) –Average distortion for remaining frames (distortion < 0.2) Despite higher performance in clean data, MFCC module is most erratic For consistency, Alt. GMM module outperforms MFCC module in noise 76% 30% Voicing Module

Introduction | Stream Recognizer | Voicing Module | Recognition Experiments | Conclusions SLS A “better” voicing module? Output of AltGMM module trained on AutoCorr and SinUn features p(Voice) Clean 15 dB SNR Clean 5 dB SNR Clean -5 dB SNR p(Voice) Voicing Module

Introduction | Stream Recognizer | Voicing Module | Recognition Experiments | Conclusions SLS voicing Test Utterance Recognition Performance Comparison Trained 3 additional recognizers, one for each alternative voicing module Performed recognition experiments to compare voicing modules No significant difference in accuracy at any noise level… -_- ; Need to perform additional experiments to understand effect of voicing modules on recognition frication burst voicing round nasal glide a1a2:a5a6a1a2:a5a6 Feature Extraction Modules Feature Vector Recognition Experiments

Introduction | Stream Recognizer | Voicing Module | Recognition Experiments | Conclusions SLS Test Utterance Oracle Experiment What happens if we assume the voicing module is perfectly reliable? –i.e., same output under any noise condition Accuracy not improved from normal scenario –Having robust voicing feature only is not enough to improve recognition –Corruption of other feature streams likely skewing overall acoustic model scores How can we isolate the contribution of this feature stream? frication burst voicing round nasal glide a1a2:a5a6a1a2:a5a6 Noisy Clean Feature Extraction Modules Feature Vector Recognition Experiments

Introduction | Stream Recognizer | Voicing Module | Recognition Experiments | Conclusions SLS Inverse Oracle Experiment Assume other feature streams are computed consistently –Allow voicing module to contribute actual output Significant difference in performance between 4 voicing modules –Even with 5 of 6 clean features, MFCC voicing module degrades quickly in noise –Recognition performance of each methods is correlated with distortion results frication burst voicing round nasal glide a1a2:a5a6a1a2:a5a6 Clean Noisy Feature Extraction Modules Feature Vector Test Utterance 67% 22% Recognition Experiments

Introduction | Stream Recognizer | Voicing Module | Recognition Experiments | Conclusions SLS Conclusions and Future Work Small set of phonetic features can obtain somewhat high (~88%) recognition accuracy for constrained digit task even when integrated in non-optimal manner (HMM) Reliable extraction of feature streams is essential for robust recognition Combining statistical training with feature-specific measurements can improve reliability for feature stream extraction Even if other 5 streams computed perfectly, messing up voicing can drastically degrade recognition accuracy Integrate feature streams with a more appropriate acoustic modelling layer (i.e. feature based graphical models or DBN) Optimize individual feature stream modules with relevant measurements –Nasality – broad F1 bandwidth, low spectral slope in F1:F2 region, stable low frequency energy –Rounding – Low F1, F2. –Retroflex – Low F3, rising formants. Combine feature streams with SNR based measure of reliability Lots to be done! Conclusions

Introduction | Stream Recognizer | Voicing Module | Recognition Experiments | Conclusions SLS References J.R. Glass and V.W. Zue (1986). “Detection and Recognition of Nasal Consonants in American English," In Proc. ICASSP ‘86, Tokyo, Japan. P. Niyogi and M.M. Sondhi (2002). “Detecting Stop Consonants in Continuous Speech,” J. Acoust. Soc. Am. vol 111, pp 1063. L. K. Saul, D. D. Lee, C. L. Isbell, and Y. LeCun (2003). “Real time voice processing with audiovisual feedback: Toward autonomous agents with perfect pitch” in S. Becker, S. Thrun, and K. Obermayer (eds.), Advances in Neural Information Processing Systems 15. MIT Press: Cambridge, MA. K. Kirchoff, G.A. Fink, G. Sagerer (2002). “Combining acoustic and articulatory feature information for robust speech recognition, ” Speech Communications, May 2002. K. Livescu, J. R. Glass, J. Bilmes (2003). “Hidden Feature Models for Speech Recognition Using Dynamic Bayesian Networks,” to be presented at Eurospeech ’03, Geneva, Switzerland. F. Metze, A. Waibel (2002). “A Flexible Stream Architecture for ASR Using Articulatory Features,” In Proc. ICSLP ’02, Denver, Colorado. Conclusions

Introduction | Stream Recognizer | Voicing Module | Recognition Experiments | Conclusions SLS Band-limited Sinusoid Fitting (Saul 2003) Filter bandwidths allow at least one filter to resolve single harmonics Frames of filtered signals fit with sinusoid of frequency   and error   At each step, lowest   gives voicing probability,   gives pitch estimate Algorithm is fast and gives accurate pitch tracks Low Pass Filter max(x,0) Half-wave rectify Signal Preconditioning                   Sliding temporal window Octave Filterbank (8) 134-407 Hz : : 25-75 Hz 264-800 Hz Output min(  i * ) F 0 = 2  4 * p(V) = f(  4 * ) Voicing Module

Introduction | Stream Recognizer | Voicing Module | Recognition Experiments | Conclusions SLS Supp. recognition results I (Actual streams) Stream Recognizer

Introduction | Stream Recognizer | Voicing Module | Recognition Experiments | Conclusions SLS Supp. recognition results II (Oracle voice) Stream Recognizer

Introduction | Stream Recognizer | Voicing Module | Recognition Experiments | Conclusions SLS Supp. recognition results III (Inv. Oracle voice) Stream Recognizer

Introduction | Stream Recognizer | Voicing Module | Recognition Experiments | Conclusions SLS Supp. distortion results I (Gross error rate) Stream Recognizer

Introduction | Stream Recognizer | Voicing Module | Recognition Experiments | Conclusions SLS Supp. distortion results II (Avg Frame Distortion) Stream Recognizer

SPOKEN LANGUAGE SYSTEMS MIT Computer Science and Artificial Intelligence Laboratory Relating Reliability in Phonetic Feature Streams to Noise Robustness.

Similar presentations

Presentation on theme: "SPOKEN LANGUAGE SYSTEMS MIT Computer Science and Artificial Intelligence Laboratory Relating Reliability in Phonetic Feature Streams to Noise Robustness."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SPOKEN LANGUAGE SYSTEMS MIT Computer Science and Artificial Intelligence Laboratory Relating Reliability in Phonetic Feature Streams to Noise Robustness.

Similar presentations

Presentation on theme: "SPOKEN LANGUAGE SYSTEMS MIT Computer Science and Artificial Intelligence Laboratory Relating Reliability in Phonetic Feature Streams to Noise Robustness."— Presentation transcript:

Similar presentations

About project

Feedback