SRI 2001 SPINE Evaluation System Venkata Ramana Rao Gadde Andreas Stolcke Dimitra Vergyri Jing Zheng Kemal Sonmez Anand Venkataraman.

Slides:

Advertisements

Similar presentations

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.

Advertisements

Building an ASR using HTK CS4706

Speech Recognition with Hidden Markov Models Winter 2011

1 Bayesian Adaptation in HMM Training and Decoding Using a Mixture of Feature Transforms Stavros Tsakalidis and Spyros Matsoukas.

The SRI 2006 Spoken Term Detection System Dimitra Vergyri, Andreas Stolcke, Ramana Rao Gadde, Wen Wang Speech Technology & Research Laboratory SRI International,

Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.

Speaker Adaptation in Sphinx 3.x and CALO David Huggins-Daines

Feature Selection, Acoustic Modeling and Adaptation SDSG REVIEW of recent WORK Technical University of Crete Speech Processing and Dialog Systems Group.

Speaker Adaptation for Vowel Classification

Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR Overview Motivation Tone has a crucial role in Mandarin speech.

EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.

WEB-DATA AUGMENTED LANGUAGE MODEL FOR MANDARIN SPEECH RECOGNITION Tim Ng 1,2, Mari Ostendrof 2, Mei-Yuh Hwang 2, Manhung Siu 1, Ivan Bulyko 2, Xin Lei.

HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University.

Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,

May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

Why is ASR Hard? Natural speech is continuous

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Introduction to Automatic Speech Recognition

The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments MITRE / MS State - ISIP Burhan Necioglu Bryan George George Shuttic The MITRE.

1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.

1M4 speech recognition University of Sheffield M4 speech recognition Martin Karafiát*, Steve Renals, Vincent Wan.

March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

7-Speech Recognition Speech Recognition Concepts

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

1M4 speech recognition University of Sheffield M4 speech recognition Vincent Wan, Martin Karafiát.

Csc Lecture 7 Recognizing speech. Geoffrey Hinton.

1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Data Sampling & Progressive Training T. Shinozaki & M. Ostendorf University of Washington In collaboration with L. Atlas.

HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE NEURAL NETWORKS RESEARCH CENTRE The development of the HTK Broadcast News.

8.0 Search Algorithms for Speech Recognition References: of Huang, or of Becchetti, or , of Jelinek 4. “ Progress.

Round-Robin Discrimination Model for Reranking ASR Hypotheses Takanobu Oba, Takaaki Hori, Atsushi Nakamura INTERSPEECH 2010 Min-Hsuan Lai Department of.

Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.

1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

Microsoft The State of the Art in ASR (and Beyond?) Cambridge University Engineering Department Speech Vision and Robotics Group Steve Young Microsoft.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.

ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.

A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

Feature Transformation and Normalization Present by Howard Reference : Springer Handbook of Speech Processing, 3.3 Environment Robustness (J. Droppo, A.

Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,

Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

H ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION Overview Goal Build a highly accurate Mandarin speech recognizer for broadcast news (BN) and broadcast.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Bayes Risk Minimization using Metric Loss Functions R. Schlüter, T. Scharrenbach, V. Steinbiss, H. Ney Present by Fang-Hui, Chu.

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

Statistical Models for Automatic Speech Recognition

Statistical Models for Automatic Speech Recognition

8-Speech Recognition Speech Recognition Concepts

EE513 Audio Signals and Systems

Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,

Learning Long-Term Temporal Features

Presenter : Jen-Wei Kuo

Presenter: Shih-Hsiang(士翔)

Presentation transcript:

SRI 2001 SPINE Evaluation System Venkata Ramana Rao Gadde Andreas Stolcke Dimitra Vergyri Jing Zheng Kemal Sonmez Anand Venkataraman

Talk Overview " System Description Components " Segmentation " Features " Acoustic models " Acoustic Adaptation " Language models " Word posteriors " System Combination Processing Steps " Results Dryrun, Evaluation What worked What didn't Fourier cepstrum revisited " Evaluation Issues " Future Work " Conclusions

System Description

Segmentation " Segmentation is done in multiple steps Classify and segment waveform into foreground/background using a 2-class HMM Recognize foreground segments Compute word posterior probabilities (from confusion networks derived from N-best lists) Resegment the foreground segments eliminating word hypotheses with posteriors below a threshold (optimized on dryrun data)

Acoustic Features " 3 feature streams with separate acoustic models: Mel cepstrum PLP cepstrum (implementation from ICSI) Fourier cepstrum " Each feature stream has 39 dimensions consisting of 13 cepstra, 13 deltas and 13 delta-deltas " Features were normalized for each speaker Cepstral mean and variance normalization Vocal tract length normalization By transforms estimated using constrained MLLR

Acoustic Models " 6 different acoustic models: 3 frontends crossword + non-crossword " All models gender-independent " SPINE1 training + eval + SPINE2 training data " Bottom-up clustered triphone states ("genones") " Non-crossword models contained about 1000 genones with 32 gaussians/genones " Crossword models contained about 1400 genones with 32 gaussians/genones

" All models were first trained using the standard maximum likelihood (ML) training " Subsequently, one additional iteration of discriminative training, using maximum mutual information estimation (MMIE) Discriminative Acoustic Training

Acoustic Adaptation " Adaptation was applied in two different ways Feature normalization using constrained MLLR " Feature normalization transforms were computed using a reference model, trained from VTL and cepstral mean and variance normalized data. " A global model transform was computed using the constrained MLLR algorithm and its inverse was used as the feature transform. " Equivalent to speaker-adaptive training (Jin et al, 1998).

Model adaptation using modified MLLR " Acoustic models were adapted using a variant of MLLR which does variance scaling in addition to mean transformation. " 7 phone classes were used to compute the transforms. Acoustic Adaptation (continued)

Language Models " 3 language models (4 evaluation systems): SRI LM1: trained on SPINE1 training + eval data, SPINE2 training + dry run data (SRI1, SRI2) SRI LM2: trained on SPINE1 training + eval data, SPINE2 training data (SRI3) CMU LM: modified to include multiword n-grams (SRI4) " Trigrams used in decoding, 4-grams in rescoring. " Note: SRI4 had bug in LM conversion. Official result: 42.1% Corrected result: 36.5%.

Class-based Language Model " Goal: Overcome mismatch between 2000 and 2001 task vocabulary (new grid vocabulary) " Approach (similar to CU and IBM): Map 2000 and 2001 grid vocabulary to word classes 2 classes: grid words and spelled grid words Expand word classes with uniform probabilities for 2001 grid vocabulary " Eval system used only single word class for non- spelled grid words (unlike IBM, CU). " X/Y labeling of grid words gives additional 0.5% win over SRI2 (27.2% final WER).

Automatic Grid Word Tagging " Problem: grid words are ambiguous A We are at bad and need, bad and need, versus A That's why we missed so bad Solution: Build HMM tagger for grid words Ambiguous grid words are generated by two states: GRIDLABEL or self. State transitions given by trigram LM. HMM parameters estimated from unambiguous words.

Other LM Issues " Interpolating SPINE1 + SPINE2 models with optimized weighting is better than pooling data. " Automatic grid word tagging is better than blindly replacing grid words with classes ("naïve" classes) " Dry run performance, first decoding pass:

Word Posterior-based Decoding " Word posterior computation: N-best hypotheses obtained for each acoustic model Hypothesis rescored with new knowledge sources: pronunciation probabilites and class 4-gram LM Hypotheses aligned into word confusion "sausages". Score weights and posterior scaling factors jointly optimized for each system, for minimum WER " Decoding from sausages: Pick highest posterior word at each position Reject words with posteriors below threshold (likely incorrect word, noise or background speech)

Word Posterior-based Adaptation and System Combination " System combination: Two or more systems combined by aligning multiple N-best lists into a single sausage (N-best ROVER) Word posteriors are weighted averages over all systems Final combination weights all three systems equally " Adaptation: 2 out of 3 system were combined round-robin to generate improved hypotheses for model readaptation of the third system Maintains system diversity for next combination step

Processing Steps 1. Segment waveforms. 2. Compute VTL and cepstral mean and variance normalizations. 3. Recognize using GI non-CW acoustic models and 3-gram multiword language models. Following steps are done for all 3 features 4. Compute feature transformations for all speakers. 5. Recognize using transformed features.

Processing Steps 6. Adapt the CW and non-CW acoustic models for each speaker. 7. Use the non-CW acoustic models and 2-gram language models to generate lattices. Expand the lattices using 3-gram language models. 8. Dump N-best hypotheses from the lattices using CW speaker-adapted acoustic models. 9. Rescore the N-best using multiple KSs and combine them using ROVER to produce 1-best.

Processing Steps 10. Readapt the acoustic models using hypotheses from Step 9. For each feature model, use the hypotheses from the other two feature models. 11. Dump N-best from lattices using the acoustic models from Step Combine the N-best using N-best ROVER.

Processing Steps Following steps are for SRI1 only 13. Adapt acoustic models trained on all data, including dry run data using the hypotheses from Step Dump N-best hypotheses. 15. Combine all systems to generate final hypotheses. Do forced alignment to generate CTM file.

Results

SPINE 2001 Dry Run Results

SPINE2001 Evaluation Results

What Worked? " Improved segmentation: New segments were less than 1% absolute worse in recognition than true (reference) segments. Last year, we lost 5.4% in segmentation.

What Worked? (continued) " Feature SAT Typical win was 4% absolute or more. " 3-way system combination. WER reduced by 3% absolute or more. " Class-based language model Improvement of 2%, 4-5% in early decoding stages. " Acoustic model parameter optimization Win of 2% absolute or more.

What Worked? (continued) " MMIE training MMIE trained acoustic models were about 1% abs. better than ML trained models. " Word rejection with posterior threshold 0.5% win in segmentation 0.1% win in final system combination " Acoustic readaptation after system combination 0.4% absolute win. " SPINE2001 system was about 15% absolute better than our SPINE2000 system.

SPINE1 Performance " SPINE1 evaluation result: 46.1% " SPINE1 workshop result: 33.7% Energy-based segmentation Cross-word acoustic models " Current system on SPINE1 eval set: 18.5% Using only SPINE1 training data

What Did Not Work " Spectral subtraction " Duration modeling Marginal improvement, unlike our Hub5 results " Too little training data? " Dialog modeling Small win observed in initial experiments but no improvement in dry run.

Fourier Cepstrum Revisited " Fourier cepstrum = IDFT(Log(Spectral Energy)) " Past research (Davis & Mermelstein 1980) showed that Fourier cepstrum is inferior to MFC. " None of current ASR systems use Fourier cepstra. " Our experiments support this, but we also found that adaptation can improve the performance significantly.

Fourier cepstral features (continued)

" Why does feature adaptation produce significant performance improvement? Does DCT decorrelate features better than DFT? What is the role of frequency warping in MFC? " Can we reject any new feature based on a single recognition experiment?

Evaluation Issues " System development was complicated by lack of proper development set (that is not part of the training set). " Suggestion: use previous year's eval set for development (assuming task stays the same). " Make standard segmenter available to sites who want to focus on recognition.

Future Work " Noise modeling " Optimize front-ends and system combination for noise conditions " New features " Language model is very important, but task- specific: how to "discover" structure in the data? " Model interaction between conversants

Conclusions " 15% abs. improvement since SPINE1 Workshop. " Biggest winners: Segmentation Acoustic adaptation System combination Class-based language modeling " Contrary to popular belief, Fourier cepstrum performs as well as MFCC or PLP. " New features need to be tested in a full system!