An Analysis of the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical.

Slides:



Advertisements
Similar presentations
Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Advertisements

Advances in WP1 Trento Meeting January
Advanced Speech Enhancement in Noisy Environments
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.
Distribution-Based Feature Normalization for Robust Speech Recognition Leveraging Context and Dynamics Cues Yu-Chen Kao and Berlin Chen Presenter : 張庭豪.
PERFORMANCE ANALYSIS OF AURORA LARGE VOCABULARY BASELINE SYSTEM Naveen Parihar, and Joseph Picone Center for Advanced Vehicular Systems Mississippi State.
HIWIRE MEETING Paris, February 11, 2005 JOSÉ C. SEGURA LUNA GSTC UGR.
An Energy Search Approach to Variable Frame Rate Front-End Processing for Robust ASR Julien Epps and Eric H. C. Choi National ICT Australia Presenter:
Advances in WP1 Turin Meeting – 9-10 March
F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)
MODULATION SPECTRUM EQUALIZATION FOR ROBUST SPEECH RECOGNITION Source: Automatic Speech Recognition & Understanding, ASRU. IEEE Workshop on Author.
Advances in WP1 and WP2 Paris Meeting – 11 febr
May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments MITRE / MS State - ISIP Burhan Necioglu Bryan George George Shuttic The MITRE.
Classification of place of articulation in unvoiced stops with spectro-temporal surface modeling V. Karjigi , P. Rao Dept. of Electrical Engineering,
English vs. Mandarin: A Phonetic Comparison Experimental Setup Abstract The focus of this work is to assess the performance of three new variational inference.
Abstract The emergence of big data and deep learning is enabling the ability to automatically learn how to interpret EEGs from a big data archive. The.
Speech and Language Processing
Automatic detection of microchiroptera echolocation calls from field recordings using machine learning algorithms Mark D. Skowronski and John G. Harris.
ETSI STQ-Aurora Distributed Speech Recognition (DSR) Bernhard Noé Distributed Speech Recognition.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.
REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.
Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Author: Naveen Parihar Inst. for Signal and Info. Processing Dept.
Compensating speaker-to-microphone playback system for robust speech recognition So-Young Jeong and Soo-Young Lee Brain Science Research Center and Department.
Ekapol Chuangsuwanich and James Glass MIT Computer Science and Artificial Intelligence Laboratory,Cambridge, Massachusetts 02139,USA 2012/07/2 汪逸婷.
NONLINEAR DYNAMIC INVARIANTS FOR CONTINUOUS SPEECH RECOGNITION Author: Daniel May Mississippi State University Contact Information: 1255 Louisville St.
LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,
IMPROVING RECOGNITION PERFORMANCE IN NOISY ENVIRONMENTS Joseph Picone 1 Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
DIALOG SYSTEMS FOR AUTOMOTIVE ENVIRONMENTS Presenter: Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
NOISE DETECTION AND CLASSIFICATION IN SPEECH SIGNALS WITH BOOSTING Nobuyuki Miyake, Tetsuya Takiguchi and Yasuo Ariki Department of Computer and System.
A NEW FEATURE EXTRACTION MOTIVATED BY HUMAN EAR Amin Fazel Sharif University of Technology Hossein Sameti, S. K. Ghiathi February 2005.
Speaker Verification Speaker verification uses voice as a biometric to determine the authenticity of a user. Speaker verification systems consist of two.
Noise Reduction Two Stage Mel-Warped Weiner Filter Approach.
DIALOG SYSTEMS FOR AUTOMOTIVE ENVIRONMENTS Presenter: Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi State.
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
Robust Feature Extraction for Automatic Speech Recognition based on Data-driven and Physiologically-motivated Approaches Mark J. Harvilla1, Chanwoo Kim2.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
PHASE-BASED DUAL-MICROPHONE SPEECH ENHANCEMENT USING A PRIOR SPEECH MODEL Guangji Shi, M.A.Sc. Ph.D. Candidate University of Toronto Research Supervisor:
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng.
Noise Reduction in Speech Recognition Professor:Jian-Jiun Ding Student: Yung Chang 2011/05/06.
Confidence Measures As a Search Guide In Speech Recognition Sherif Abdou, Michael Scordilis Department of Electrical and Computer Engineering, University.
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
January 2001RESPITE workshop - Martigny Multiband With Contaminated Training Data Results on AURORA 2 TCTS Faculté Polytechnique de Mons Belgium.
An Analysis of the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical.
Page 1 of 10 ASR – effect of five parameters on the WER performance of HMM SR system Sanjay Patil, Jun-Won Suh Human and Systems Engineering Experimental.
1 LOW-RESOURCE NOISE-ROBUST FEATURE POST-PROCESSING ON AURORA 2.0 Chia-Ping Chen, Jeff Bilmes and Katrin Kirchhoff SSLI Lab Department of Electrical Engineering.
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
Speech Processing Speech Recognition
EEG Recognition Using The Kaldi Speech Recognition Toolkit
Two-Stage Mel-Warped Wiener Filter SNR-Dependent Waveform Processing
Automatic Speech Recognition: Conditional Random Fields for ASR
A maximum likelihood estimation and training on the fly approach
Network Training for Continuous Speech Recognition
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

An Analysis of the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi State University Contact Information: Box 9571 Mississippi State University Mississippi State, Mississippi Tel: Fax: EUROSPEECH 2003 URL: isip.msstate.edu/publications/conferences/eurospeech/2003/evaluation/ isip.msstate.edu/publications/conferences/eurospeech/2003/evaluation/

INTRODUCTION ABSTRACT In this paper, we analyze the results of the recent Aurora large vocabulary evaluations (ALV). Two consortia submitted proposals on speech recognition front ends for this evaluation: (1) Qualcomm, ICSI, and OGI (QIO), and (2) Motorola, France Telecom, and Alcatel (MFA). These front ends used a variety of noise reduction techniques including discriminative transforms, feature normalization, voice activity detection, and blind equalization. Participants used a common speech recognition engine to post-process their features. In this paper, we show that the results of this evaluation were not significantly impacted by suboptimal recognition system parameter settings. Without any front end specific tuning, the MFA front end outperforms the QIO front end by 9.6% relative. With tuning, the relative performance gap increases to 15.8%. Both the mismatched microphone and additive noise evaluation conditions resulted in a significant degradation in performance for both front ends.

INTRODUCTION MOTIVATION ALV goal was at least a 25% relative improvement over the baseline MFCC front end Two consortia participated:  QIO: QualComm, ICSI, OGI  MFA: Motorola, France Telecom, Alcatel Generic baseline LVCSR system with no front end specific tuning Would front end specific tuning change the rankings? MFCC: Overall WER – 50.3% 8 kHz – 49.6%16 kHz – 51.0% TS1TS2TS1TS2 58.1%41.0%62.2%39.8% QIO: Overall WER – 37.5% 8 kHz – 38.4%16 kHz – 36.5% TS1TS2TS1TS2 43.2%33.6%40.7%32.4% MFA: Overall WER – 34.5% 8 kHz – 34.5%16 kHz – 34.4% TS1TS2TS1TS2 37.5%31.4%37.2%31.5% ALV Evaluation Results

EVALUATION PARADIGM THE AURORA – 4 DATABASE Acoustic Training: Derived from 5000 word WSJ0 task TS1 (clean), and TS2 (multi-condition) Clean plus 6 noise conditions Randomly chosen SNR between 10 and 20 dB 2 microphone conditions (Sennheiser and secondary) 2 sample frequencies – 16 kHz and 8 kHz G.712 filtering at 8 kHz and P.341 filtering at 16 kHz Development and Evaluation Sets: Derived from WSJ0 Evaluation and Development sets 14 test sets for each 7 recorded on Sennheiser; 7 on secondary Clean plus 6 noise conditions Randomly chosen SNR between 5 and 15 dB G.712 filtering at 8 kHz and P.341 filtering at 16 kHz

EVALUATION PARADIGM BASELINE LVCSR SYSTEM Standard context-dependent cross- word HMM-based system: Acoustic models: state-tied 4-mixture cross-word triphones Language model: WSJ0 5K bigram Search: Viterbi one-best using lexical trees for N-gram cross- word decoding Lexicon: based on CMUlex Real-time: 4 xRT for training and 15 xRT for decoding on an 800 MHz Pentium Monophone Modeling State-Tying CD-Triphone Modeling CD-Triphone Modeling Mixture Modeling (2,4) Training Data

EVALUATION PARADIGM WI007 ETSI MFCC FRONT END The baseline HMM system used an ETSI standard MFCC-based front end: Zero-mean debiasing 10 ms frame duration 25 ms Hamming window Absolute energy 12 cepstral coefficients First and second derivatives Input Speech Fourier Transf. Analysis Cepstral Analysis Zero-mean and Pre-emphasis Energy /

FRONT END PROPOSALS QIO FRONT END 10 msec frame duration 25 msec analysis window 15 RASTA-like filtered cepstral coefficients MLP-based VAD Mean and variance normalization First and second derivatives Fourier Transform RASTA Mel-scale Filter Bank DCT Mean/Variance Normalization Input Speech / MLP-based VAD Qualcomm, ICSI, OGI (QIO) front end:

FRONT END PROPOSALS MFA FRONT END 10 msec frame duration 25 msec analysis window Mel-warped Wiener filter based noise reduction Energy-based VADNest Waveform processing to enhance SNR Weighted log-energy 12 cepstral coefficients Blind equalization (cepstral domain) VAD based on acceleration of various energy based measures First and second derivatives Input Speech Noise Reduction Cepstral Analysis Waveform Processing Blind Equalization Feature Processing VADNest VAD /

EXPERIMENTAL RESULTS FRONT END SPECIFIC TUNING Pruning beams (word, phone and state) were opened during the tuning process to eliminate search errors. Tuning parameters:  State-tying thresholds: solves the problem of sparsity of training data by sharing state distributions among phonetically similar states  Language model scale: controls influence of the language model relative to the acoustic models (more relevant for WSJ)  Word insertion penalty: balances insertions and deletions (always a concern in noisy environments)

EXPERIMENTAL RESULTS FRONT END SPECIFIC TUNING - QIO Parameter tuning  clean data recorded on Sennhieser mic. (corresponds to Training Set 1 and Devtest Set 1 of the Aurora-4 database)  8 kHz sampling frequency 7.5% relative improvement QIO FE # of Tied States State Tying ThresholdsLM Scale Word Ins. Pen. WER SplitMergeOccu. Base % Tuned %

EXPERIMENTAL RESULTS FRONT END SPECIFIC TUNING - MFA MFA FE # of Tied States State Tying ThresholdsLM Scale Word Ins. Pen. WER SplitMergeOccu. Base % Tuned % Parameter tuning  clean data recorded on Sennhieser mic. (corresponds to Training Set 1 and Devtest Set 1 of the Aurora-4 database)  8 kHz sampling frequency 9.4% relative improvement Ranking is still the same (14.9% vs. 12.5%) !

EXPERIMENTAL RESULTS COMPARISON OF TUNING Front End Train Set TuningAverage WER over 14 Test Sets QIO1No43.1% QIO2No38.1% Avg. WERNo38.4% QIO1Yes45.7% QIO2Yes35.3% Avg. WERYes40.5% MFA1No37.5% MFA2No31.8% Avg. WERNo34.7% MFA1Yes37.0% MFA2Yes31.1% Avg. WERYes34.1% Same Ranking: relative performance gap increased from 9.6% to 15.8% On TS1, MFA FE significantly better on all 14 test sets (MAPSSWE p=0.1%) On TS2, MFA FE significantly better only on test sets 5 and 14

EXPERIMENTAL RESULTS MICROPHONE VARIATION Train on Sennheiser mic.; evaluate on secondary mic. Matched conditions result in optimal performance Significant degradation for all front ends on mismatched conditions Both QIO and MFA provide improved robustness relative to MFCC baseline SennheiserSecondary ETSI QIO MFA

EXPERIMENTAL RESULTS ADDITIVE NOISE ETSI QIO MFA TS2TS3TS4TS5TS6TS7 Performance degrades on noise condition when systems are trained only on clean data Both QIO and MFA deliver improved performance TS2TS3TS4TS5TS6 TS7 Exposing systems to noise and microphone variations (TS2) improves performance

SUMMARY AND CONCLUSIONS WHAT HAVE WE LEARNED? Front end specific parameter tuning did not result in significant change in overall performance (MFA still outperforms QIO) Both QIO and MFA front ends handle convolution and additive noise better than ETSI baseline Both QIO and MFA front ends achieved ALV evaluation goal of improving performance by at least 25% relative over ETSI baseline WER is still high ( ~ 35%), further research on noise robust front end is needed

SUMMARY AND CONCLUSIONS AVAILABLE RESOURCES Speech Recognition Toolkits: compare front ends to standard approaches using a state of the art ASR toolkitSpeech Recognition Toolkits ETSI DSR Website: reports and front end standardsETSI DSR Website Aurora Project Website: recognition toolkit, multi-CPU scripts, database definitions, publications, and performance summary of the baseline MFCC front endAurora Project Website

SUMMARY AND CONCLUSIONS BRIEF BIBLIOGRAPHY N. Parihar, Performance Analysis of Advanced Front Ends, M.S. Dissertation, Mississippi State University, December 2003.Performance Analysis of Advanced Front Ends N. Parihar, J. Picone, D. Pearce, and H.G. Hirsch, “Performance Analysis of the Aurora Large Vocabulary Baseline System,” submitted to the Eurospeech 2003, Geneva, Switzerland, September 2003.Performance Analysis of the Aurora Large Vocabulary Baseline System N. Parihar and J. Picone, “DSR Front End LVCSR Evaluation - AU/384/02,” Aurora Working Group, European Telecommunications Standards Institute, December 06, 2002.DSR Front End LVCSR Evaluation - AU/384/02 D. Pearce, “Overview of Evaluation Criteria for Advanced Distributed Speech Recognition,” ETSI STQ-Aurora DSR Working Group, October 2001.Overview of Evaluation Criteria for Advanced Distributed Speech Recognition G. Hirsch, “Experimental Framework for the Performance Evaluation of Speech Recognition Front-ends in a Large Vocabulary Task,” ETSI STQ- Aurora DSR Working Group, December 2002.Experimental Framework for the Performance Evaluation of Speech Recognition Front-ends in a Large Vocabulary Task “ETSI ES v1.1.2 Distributed Speech Recognition; Front-end Feature Extraction Algorithm; Compression Algorithm,” ETSI, April 2000.ETSI ES v1.1.2 Distributed Speech Recognition; Front-end Feature Extraction Algorithm; Compression Algorithm

SUMMARY AND CONCLUSIONS BIOGRAPHY Naveen Parihar is a M.S. student in Electrical Engineering in the Department of Electrical and Computer Engineering at Mississippi State University. He currently leads the Core Speech Technology team developing a state-of-the-art public-domain speech recognition system. Mr. Parihar’s research interests lie in the development of discriminative algorithms for better acoustic modeling and feature extraction. Mr. Parihar is a student member of the IEEE. Joseph Picone is currently a Professor in the Department of Electrical and Computer Engineering at Mississippi State University, where he also directs the Institute for Signal and Information Processing. For the past 15 years he has been promoting open source speech technology. He has previously been employed by Texas Instruments and AT&T Bell Laboratories. Dr. Picone received his Ph.D. in Electrical Engineering from Illinois Institute of Technology in He is a Senior Member of the IEEE and a registered Professional Engineer.