An overview of decoding techniques for LVCSR

Slides:



Advertisements
Similar presentations
Hybrid BDD and All-SAT Method for Model Checking Orna Grumberg Joint work with Assaf Schuster and Avi Yadgar Technion – Israel Institute of Technology.
Advertisements

LABORATOIRE DINFORMATIQUE CERI 339 Chemin des Meinajariès BP AVIGNON CEDEX 09 Tél (0) Fax (0)
A MEMORY-EFFICIENT  -REMOVAL ALGORITHM FOR WEIGHTED FINITE-STATE AUTOMATA Thomas Hanneforth, Universität Potsdam.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
an incremental version of A*
Sphinx-3 to 3.2 Mosur Ravishankar School of Computer Science, CMU Nov 19, 1999.
Yasuhiro Fujiwara (NTT Cyber Space Labs)
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Application of HMMs: Speech recognition “Noisy channel” model of speech.
Multipurpose Strategic Planning In the Game of Go Paper presentation Authors: Shui Hu and Paul E. Lehner Presentation by: Adi Botea.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Query Processing Presented by Aung S. Win.
Introduction to Job Shop Scheduling Problem Qianjun Xu Oct. 30, 2001.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
1 Boosting-based parse re-ranking with subtree features Taku Kudo Jun Suzuki Hideki Isozaki NTT Communication Science Labs.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree.
HIERARCHICAL SEARCH FOR LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION Author :Neeraj Deshmukh, Aravind Ganapathiraju and Joseph Picone.
8.0 Search Algorithms for Speech Recognition References: of Huang, or of Becchetti, or , of Jelinek 4. “ Progress.
Efficient Language Model Look-ahead Probabilities Generation Using Lower Order LM Look-ahead Information Langzhou Chen and K. K. Chin Toshiba Research.
UNSUPERVISED CV LANGUAGE MODEL ADAPTATION BASED ON DIRECT LIKELIHOOD MAXIMIZATION SENTENCE SELECTION Takahiro Shinozaki, Yasuo Horiuchi, Shingo Kuroiwa.
Kanpur Genetic Algorithms Laboratory IIT Kanpur 25, July 2006 (11:00 AM) Multi-Objective Dynamic Optimization using Evolutionary Algorithms by Udaya Bhaskara.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Technology Mapping. 2 Technology mapping is the phase of logic synthesis when gates are selected from a technology library to implement the circuit. Technology.
The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech Frank Seide IEEE Transactions on Speech and Audio Processing 2005.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Chapter 12 search and speaker adaptation 12.1 General Search Algorithm 12.2 Search Algorithms for Speech Recognition 12.3 Language Model States 12.4 Speaker.
Dynamic Tuning Of Language Model Score In Speech Recognition Using A Confidence Measure Sherif Abdou, Michael Scordilis Department of Electrical and Computer.
Confidence Measures As a Search Guide In Speech Recognition Sherif Abdou, Michael Scordilis Department of Electrical and Computer Engineering, University.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
1 Double-Patterning Aware DSA Template Guided Cut Redistribution for Advanced 1-D Gridded Designs Zhi-Wen Lin and Yao-Wen Chang National Taiwan University.
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Hybrid BDD and All-SAT Method for Model Checking
The Design and Analysis of Algorithms
IP Routers – internal view
Juicer: A weighted finite-state transducer speech decoder
RE-Tree: An Efficient Index Structure for Regular Expressions
LECTURE 15: HMMS – EVALUATION AND DECODING
8.0 Search Algorithms for Speech Recognition
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.
Course: Autonomous Machine Learning
Tight Coupling between ASR and MT in Speech-to-Speech Translation
Training Tree Transducers
The connected word recognition problem Problem definition: Given a fluently spoken sequence of words, how can we determine the optimum match in terms.
Mohamed Kamel Omar and Lidia Mangu ICASSP 2007
Hidden Markov Models Part 2: Algorithms
Statistical Models for Automatic Speech Recognition
Heuristic Optimization Methods Pareto Multiobjective Optimization
N-Gram Model Formulas Word sequences Chain rule of probability
LECTURE 14: HMMS – EVALUATION AND DECODING
Blay Whitby 2003 Search Blay Whitby 2003
Connected Word Recognition
LECTURE 15: REESTIMATION, EM AND MIXTURES
Research on the Modeling of Chinese Continuous Speech Recognition
CMU Y2 Rosetta GnG Distillation
Machine Learning: Lecture 6
Dynamic Programming Search
A word graph algorithm for large vocabulary continuous speech recognition Stefan Ortmanns, Hermann Ney, Xavier Aubert Bang-Xuan Huang Department of Computer.
Presenter : Jen-Wei Kuo
Fast Min-Register Retiming Through Binary Max-Flow
Goodfellow: Chapter 14 Autoencoders
Presentation transcript:

An overview of decoding techniques for LVCSR Author :Xavier L.Aubert

Outline Introduction General formulation of the decoding problem Representation of the knowledge sources classification of decoding methods Heuristic techniques to further reduce the search space Experiment results Conclusion

introduction Decoding is basically a search process to uncover the word sequence that has the maximum posterior probability for the given acoustic observation Why needs decoding strategies : search space and time This study has been structure along two main axes Static vs. dynamic expansion of the search space Time-synchronous vs. asynchronous decoding

General formulation of the decoding problem Heuristic LM factor

General formulation of the decoding problem (cont) Recombination principle : Select the “best” among several paths in the network as soon as it appears that these paths have identically scored extensions. Pruning principle : discard the unpromising paths

General formulation of the decoding problem (cont) Main actions to be performed by any decoder Generating hypothetical word sequences, usually by successive extensions. Scoring the “active” hypotheses using the knowledge sources. Recombining i.e. merging paths according to the knowledge sources. Pruning to discard the most unpromising paths. Creating “back-pointers” to retrieve the best sentence.

Representation of the knowledge sources Use of stochastic m-gram LM Prefix-tree organization of the lexicon Context-dependent phonetic constraints

Use of stochastic m-gram LM Introducing constraints upon the word sequences Two implications : The search network is fully branched at the word level, each word being possibly followed by any other; The word probabilities depend on the m-1 predecessors :

Prefix-tree organization of the lexicon Figure 1. Prefix tree structure of the lexicon with LM look-ahead (Aubert 2002)

Context-dependent phonetic constraints The last triphone arc of the predecessor word must be replicated Figure 2. cross-word (CW) vs. non-CW triphone transitions (Aubert 2002)

classification of decoding methods Figure 3. Classification “tree” of decoding techniques (Aubert 2002)

Static network expansion Sparsity of knowledge sources and network redundancies Central role of the m-gram LM in the search network Weighted finite state transducer method (WFST)

Sparsity of knowledge sources and network redundancies There are two main sources of potential reduction of the network size: exploiting the sparsity of the knowledge sources detecting and taking advantage of the redundancies.

Central role of the m-gram LM in the search network Figure 4. Interpolated backing-off bigram using a null node (Aubert 2002)

Central role of the m-gram LM in the search network(cont) Figure 5. Bigram network with null node and successor trees (Antoniol et al., 1995).

Weighted finite state transducer method (WFST) Semiring: is a ring that may lack negation. It has two associative operation and that are closed over the set K, they have identities and , respectively, and distributes over . For example , is a probability semiring

WFST(cont) Formally, a WFST over the semiring K is giving by an input alphabet , an output alphabet , a finite set of state Q , an finite set of transitions an initial state i Q, a set of final states F Q, an initial weight and a final weight function

WFST(cont) transition input:output final states Initial state Figure 6. Weighted finite-state transducer examples. (Mohri et al. 2000)

WFST(cont) Three operation : Composition/Intersection Determinization Minimization

WFST(cont) Composition/Intersection Figure 7. Example of transducer composition. (Mohri et al. 2000)

WFST(cont) Determinization Figure 8. Example of transducer Determinization . (Mohri et al. 2000)

WFST(cont) Minimization Figure 8. Example of transducer Minimization. (Mohri et al. 2000)

Dynamic search network expansion Re-entrant lexical tree (word-conditioned search) Start-synchronous tree (time-conditioned search) A comparison of time-conditioned and word condition search techniques Asynchronous stack decoding

Re-entrant lexical tree score of the best path up to time t that ends in state s of the lexical tree for the two-word history (u,v) starting time of the best path up to time t that ends in state s of the lexical tree for the two-word history (u,v) The dynamic programming recursion within the tree copies : Where is the product of transition and emission probabilities of the underlying HMM. Denotes the optimum predecessor state for the hypothesis (t,s) and two-word history(u,v)

Re-entrant lexical tree (cont) Recombination equation at the word boundaries : where p(w|u,v) is the conditional trigram probability for the word triple (u,v,w) and denotes a terminal state of the lexical tree for the word w To start up new words, we hava to pass on the score and the time index:

Re-entrant lexical tree (cont) The word history conditioned DP organization The per-state stack organization Integration of CW contexts

The word history conditioned DP organization The order of the three dependent coordinates : LM-State->Arc-Id->State-Id Figure 9. Search organization conditioned on word histories (Ney et al., 1992).

The per-state stack organization The order of the three dependent coordinates : Arc-Id->State-Id->LM-State Figure 10. Search organization using the per-state stack (Alleva et al., 1996).

Integration of CW contexts Figure 11. CW transitions with an optional “long” pause. (Aubert 2002)

Start-synchronous tree probability that the acoustic vectors are generated by a word/state sequence with uv as the last two words and as the word Boundary The dynamic programming equation : Recombination equation : where

A comparison of time-conditioned and word condition search techniques For the same number of active states, the average number of active trees per time frame in time-conditioned method is typically much lower than in the word- conditioned method. When we consider the computational effort for the LM recombination ,it is much greater in the time-conditioned search.

Asynchronous stack decoding Implementing a best-first tree search which proceeds by extending, word by word, one or several selected hypotheses without the constraint that they all end at the same time. Three problems to be solved in a stack decoder : Which theory(ies) should be selected for extension How to efficiently compute one-word continuations How to get “reference” score values for pruning

Asynchronous stack decoding(cont) Which theory(ies) should be selected for extension : essentially depends on which information is available regarding the not yet decoded part of the sentence How to efficiently compute one-word continuations : computed with the start-synchronous tree method or using a fast-match algorithm How to get “reference” score values for pruning : consisting of progressively updating the best likelihood scores that can be achieved along the time axis by a path having complete word extensions

Heuristic techniques to further reduce the search space Decoupling the LM from the acoustic-phonetic constraints Acoustic look-ahead pruning

Decoupling the LM from the acoustic-phonetic constraints Interaction between LM contribution and word boundaries : word boundary optimization step Re-entrant tree : Start-synchronous tree : carried out explicitly over all start time that are produced by different “start trees” Delayed LM incorporation with heuristic boundary optimization : the LM will be applied after word expansion has been completed Where is the start time of w for each m-tuple ending with w at t is the word model w ending at t

Acoustic look-ahead pruning Principle of a fast acoustic match : providing a short list of word candidates to extend the most likely theories Phoneme look-ahead in time-synchronous decoders : Figure 12. Combined acoustic and LM look-ahead in time-synchronous search(Aubert & Blasig, 2000), LM stands for language models, LA for look-ahead and AC for acoustic model.

Experiment results

Conclusion Pros and cons of decoding techniques static network expansion using WFST time-synchronous dynamic search stack decoders

Conclusion (cont) avenues that are currently being studied and appear definitely worth pursuing: hybrid expansion strategies increasing importance of word-graphs integration of very long range syntactical constraints: