An overview of decoding techniques for LVCSR Author :Xavier L.Aubert
Outline Introduction General formulation of the decoding problem Representation of the knowledge sources classification of decoding methods Heuristic techniques to further reduce the search space Experiment results Conclusion
introduction Decoding is basically a search process to uncover the word sequence that has the maximum posterior probability for the given acoustic observation Why needs decoding strategies : search space and time This study has been structure along two main axes Static vs. dynamic expansion of the search space Time-synchronous vs. asynchronous decoding
General formulation of the decoding problem Heuristic LM factor
General formulation of the decoding problem (cont) Recombination principle : Select the “best” among several paths in the network as soon as it appears that these paths have identically scored extensions. Pruning principle : discard the unpromising paths
General formulation of the decoding problem (cont) Main actions to be performed by any decoder Generating hypothetical word sequences, usually by successive extensions. Scoring the “active” hypotheses using the knowledge sources. Recombining i.e. merging paths according to the knowledge sources. Pruning to discard the most unpromising paths. Creating “back-pointers” to retrieve the best sentence.
Representation of the knowledge sources Use of stochastic m-gram LM Prefix-tree organization of the lexicon Context-dependent phonetic constraints
Use of stochastic m-gram LM Introducing constraints upon the word sequences Two implications : The search network is fully branched at the word level, each word being possibly followed by any other; The word probabilities depend on the m-1 predecessors :
Prefix-tree organization of the lexicon Figure 1. Prefix tree structure of the lexicon with LM look-ahead (Aubert 2002)
Context-dependent phonetic constraints The last triphone arc of the predecessor word must be replicated Figure 2. cross-word (CW) vs. non-CW triphone transitions (Aubert 2002)
classification of decoding methods Figure 3. Classification “tree” of decoding techniques (Aubert 2002)
Static network expansion Sparsity of knowledge sources and network redundancies Central role of the m-gram LM in the search network Weighted finite state transducer method (WFST)
Sparsity of knowledge sources and network redundancies There are two main sources of potential reduction of the network size: exploiting the sparsity of the knowledge sources detecting and taking advantage of the redundancies.
Central role of the m-gram LM in the search network Figure 4. Interpolated backing-off bigram using a null node (Aubert 2002)
Central role of the m-gram LM in the search network(cont) Figure 5. Bigram network with null node and successor trees (Antoniol et al., 1995).
Weighted finite state transducer method (WFST) Semiring: is a ring that may lack negation. It has two associative operation and that are closed over the set K, they have identities and , respectively, and distributes over . For example , is a probability semiring
WFST(cont) Formally, a WFST over the semiring K is giving by an input alphabet , an output alphabet , a finite set of state Q , an finite set of transitions an initial state i Q, a set of final states F Q, an initial weight and a final weight function
WFST(cont) transition input:output final states Initial state Figure 6. Weighted finite-state transducer examples. (Mohri et al. 2000)
WFST(cont) Three operation : Composition/Intersection Determinization Minimization
WFST(cont) Composition/Intersection Figure 7. Example of transducer composition. (Mohri et al. 2000)
WFST(cont) Determinization Figure 8. Example of transducer Determinization . (Mohri et al. 2000)
WFST(cont) Minimization Figure 8. Example of transducer Minimization. (Mohri et al. 2000)
Dynamic search network expansion Re-entrant lexical tree (word-conditioned search) Start-synchronous tree (time-conditioned search) A comparison of time-conditioned and word condition search techniques Asynchronous stack decoding
Re-entrant lexical tree score of the best path up to time t that ends in state s of the lexical tree for the two-word history (u,v) starting time of the best path up to time t that ends in state s of the lexical tree for the two-word history (u,v) The dynamic programming recursion within the tree copies : Where is the product of transition and emission probabilities of the underlying HMM. Denotes the optimum predecessor state for the hypothesis (t,s) and two-word history(u,v)
Re-entrant lexical tree (cont) Recombination equation at the word boundaries : where p(w|u,v) is the conditional trigram probability for the word triple (u,v,w) and denotes a terminal state of the lexical tree for the word w To start up new words, we hava to pass on the score and the time index:
Re-entrant lexical tree (cont) The word history conditioned DP organization The per-state stack organization Integration of CW contexts
The word history conditioned DP organization The order of the three dependent coordinates : LM-State->Arc-Id->State-Id Figure 9. Search organization conditioned on word histories (Ney et al., 1992).
The per-state stack organization The order of the three dependent coordinates : Arc-Id->State-Id->LM-State Figure 10. Search organization using the per-state stack (Alleva et al., 1996).
Integration of CW contexts Figure 11. CW transitions with an optional “long” pause. (Aubert 2002)
Start-synchronous tree probability that the acoustic vectors are generated by a word/state sequence with uv as the last two words and as the word Boundary The dynamic programming equation : Recombination equation : where
A comparison of time-conditioned and word condition search techniques For the same number of active states, the average number of active trees per time frame in time-conditioned method is typically much lower than in the word- conditioned method. When we consider the computational effort for the LM recombination ,it is much greater in the time-conditioned search.
Asynchronous stack decoding Implementing a best-first tree search which proceeds by extending, word by word, one or several selected hypotheses without the constraint that they all end at the same time. Three problems to be solved in a stack decoder : Which theory(ies) should be selected for extension How to efficiently compute one-word continuations How to get “reference” score values for pruning
Asynchronous stack decoding(cont) Which theory(ies) should be selected for extension : essentially depends on which information is available regarding the not yet decoded part of the sentence How to efficiently compute one-word continuations : computed with the start-synchronous tree method or using a fast-match algorithm How to get “reference” score values for pruning : consisting of progressively updating the best likelihood scores that can be achieved along the time axis by a path having complete word extensions
Heuristic techniques to further reduce the search space Decoupling the LM from the acoustic-phonetic constraints Acoustic look-ahead pruning
Decoupling the LM from the acoustic-phonetic constraints Interaction between LM contribution and word boundaries : word boundary optimization step Re-entrant tree : Start-synchronous tree : carried out explicitly over all start time that are produced by different “start trees” Delayed LM incorporation with heuristic boundary optimization : the LM will be applied after word expansion has been completed Where is the start time of w for each m-tuple ending with w at t is the word model w ending at t
Acoustic look-ahead pruning Principle of a fast acoustic match : providing a short list of word candidates to extend the most likely theories Phoneme look-ahead in time-synchronous decoders : Figure 12. Combined acoustic and LM look-ahead in time-synchronous search(Aubert & Blasig, 2000), LM stands for language models, LA for look-ahead and AC for acoustic model.
Experiment results
Conclusion Pros and cons of decoding techniques static network expansion using WFST time-synchronous dynamic search stack decoders
Conclusion (cont) avenues that are currently being studied and appear definitely worth pursuing: hybrid expansion strategies increasing importance of word-graphs integration of very long range syntactical constraints: