HIERARCHICAL SEARCH FOR LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION Author :Neeraj Deshmukh, Aravind Ganapathiraju and Joseph Picone.

Slides:



Advertisements
Similar presentations
LABORATOIRE DINFORMATIQUE CERI 339 Chemin des Meinajariès BP AVIGNON CEDEX 09 Tél (0) Fax (0)
Advertisements

Heuristic Search techniques
Hierarchical Cellular Tree: An Efficient Indexing Scheme for Content-Based Retrieval on Multimedia Databases Serkan Kiranyaz and Moncef Gabbouj.
A Hierarchical Multiple Target Tracking Algorithm for Sensor Networks Songhwai Oh and Shankar Sastry EECS, Berkeley Nest Retreat, Jan
Sphinx-3 to 3.2 Mosur Ravishankar School of Computer Science, CMU Nov 19, 1999.
Problem Solving by Searching Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 3 Spring 2007.
Yasuhiro Fujiwara (NTT Cyber Space Labs)
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
15.0 Utterance Verification and Keyword/Key Phrase Spotting References: 1. “Speech Recognition and Utterance Verification Based on a Generalized Confidence.
Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.
Bag-Of-Word normalized n-gram models ISCA 2008 Abhinav Sethy, Bhuvana Ramabhadran IBM T. J. Watson Research Center Yorktown Heights, NY Presented by Patty.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Application of HMMs: Speech recognition “Noisy channel” model of speech.
Probabilistic Pronunciation + N-gram Models CSPP Artificial Intelligence February 25, 2004.
Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.
Sphinx 3.4 Development Progress Report in February Arthur Chan, Jahanzeb Sherwani Carnegie Mellon University Mar 1, 2004.
Why is ASR Hard? Natural speech is continuous
C o n f i d e n t i a l Developed By Nitendra NextHome Subject Name: Data Structure Using C Title: Overview of Data Structure.
Using Abstraction to Speed Up Search Robert Holte University of Ottawa.
The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments MITRE / MS State - ISIP Burhan Necioglu Bryan George George Shuttic The MITRE.
1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.
Improving Utterance Verification Using a Smoothed Na ï ve Bayes Model Reporter : CHEN, TZAN HWEI Author :Alberto Sanchis, Alfons Juan and Enrique Vidal.
Chapter 12 Recursion, Complexity, and Searching and Sorting
1M4 speech recognition University of Sheffield M4 speech recognition Vincent Wan, Martin Karafiát.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.
The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree.
8.0 Search Algorithms for Speech Recognition References: of Huang, or of Becchetti, or , of Jelinek 4. “ Progress.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
Efficient Language Model Look-ahead Probabilities Generation Using Lower Order LM Look-ahead Information Langzhou Chen and K. K. Chin Toshiba Research.
Coarse-to-Fine Efficient Viterbi Parsing Nathan Bodenstab OGI RPE Presentation May 8, 2006.
2. Regular Expressions and Automata 2007 년 3 월 31 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.33 ~ 56.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
Hidden Markov Models: Decoding & Training Natural Language Processing CMSC April 24, 2003.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Seminar Speech Recognition a Short Overview E.M. Bakker
The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech Frank Seide IEEE Transactions on Speech and Audio Processing 2005.
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Chapter 12 search and speaker adaptation 12.1 General Search Algorithm 12.2 Search Algorithms for Speech Recognition 12.3 Language Model States 12.4 Speaker.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Paper_topic: Parallel Matrix Multiplication using Vertical Data.
Tight Coupling between ASR and MT in Speech-to-Speech Translation Arthur Chan Prepared for Advanced Machine Translation Seminar.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Parallel Programming in Chess Simulations Part 2 Tyler Patton.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Probabilistic Pronunciation + N-gram Models CMSC Natural Language Processing April 15, 2003.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
Sridhar Raghavan and Joseph Picone URL:
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Bayes Risk Minimization using Metric Loss Functions R. Schlüter, T. Scharrenbach, V. Steinbiss, H. Ney Present by Fang-Hui, Chu.
A NONPARAMETRIC BAYESIAN APPROACH FOR
An overview of decoding techniques for LVCSR
8.0 Search Algorithms for Speech Recognition
Chapter 16: Distributed System Structures
Automatic Speech Recognition
EEG Recognition Using The Kaldi Speech Recognition Toolkit
Dynamic Programming Search
Speaker Identification:
A word graph algorithm for large vocabulary continuous speech recognition Stefan Ortmanns, Hermann Ney, Xavier Aubert Bang-Xuan Huang Department of Computer.
Presenter : Jen-Wei Kuo
Presentation transcript:

HIERARCHICAL SEARCH FOR LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION Author :Neeraj Deshmukh, Aravind Ganapathiraju and Joseph Picone

2 Outline The speech recognition problem Search algorithms A time-synchronous Viterbi-based decoder Performance analysis Summary

3 The speech recognition problem Many of the fundamentals of speech communication process are still not clearly and defy rigorous mathematical descriptions. It’s important that the system be able to handle a large vocabulary, and be independent of speaker and language characteristic such as accents, speaking styles, dysfluencies, syntax, and grammar.

4 Search algorithms (cont) Typical search algorithms 1. Viterbi search 2. Stack decoders 3. Multi-pass search 4. Forward-backward search

5 Search algorithms (cont) Viterbi search : A class of breadth-first search techniques Time-synchronous Viterbi beam search is used to reduce the search space The main problem is that the state-level information cannot be merged readily to reduce the number of required computations

6 Search algorithms (cont) Stack decoders : A class of depth-first search techniques Need to normalizing the score of a path based on the number of frames of data it spans Suffering form problems of speed, size, accuracy and robustness for large vocabulary spontaneous speech application

7 Search algorithms (cont) Multi-pass search : Computationally inexpensive acoustic models are initially used to produce a list of likely word hypotheses (ex : bigram language model and context-independent phone) Refined using more detailed and computationally demanding model (ex : trigram and cross-word triphone)

8 Search algorithms (cont) Multi-pass search : Figure 1 : An example of the N-best list of hypotheses generated for a simple utterance

9 Search algorithms (cont) Multi-pass search : Figure 2 : the resulting word graph from figure 1

10 Search algorithms (cont) Forward-backward search An approximate time-synchronous search in the forward-pass direction to facilitate a more complex and expensive search in the backward direction. The forward pass can be made extremely suboptimal and efficient.

11 A time-synchronous viterbi-based decoder Complexity of search Search space organization Search space reduction

12 Complexity of search Lexicon Language model Network decoding N-gram decoding Acoustic model Context-independent model Context-dependent model (word-internal) Cross-word model

13 Complexity of search (cont) Network decoding A grammar that defines the structure of the language used of the words or a word graph generated by a previous recognition process They cannot be merged into a single path if such instances of the same triphone correspond to different node in the network The complexity of the search and memory requirements are directly proportional to the size of the expanded network.

14 Complexity of search (cont) Network decoding Figure 3 : An example of network decoding using word-internal context-dependent models.

15 Complexity of search (cont) N-gram decoding : typically consist of only a subset of the possible N-grams, and the likelihood of the other word can be estimated using a back-off model. Ex : bigram

16 N-gram decoding Paths with very different origins can be merged later in time if the have the same current instance, which is now defined by the phone model and the N-gram history. Ex : for bigram To implement the language model is to cache the N- gram scores of all the active words in memory, and leave the rest of the language model on disk Complexity of search (cont)

17 Complexity of search (cont) Cross-word acoustic models Figure 4 : A small part of the expanded network from Figure 3 using cross-word triphones.

18 Complexity of search (cont) Cross-word acoustic models Figure 5 : An overview of the relative complexity of the search problem that shows the impact of various types of acoustic and language models..

19 Search space organization Lexical trees Figure 6 : An example lexical tree used in the decoder, The dark circles represent starts and ends of words, the word identity is unknown till a word-end lexical node is reached..

20 Search space organization (cont) Language model lookahead : the delay in the application of the LM score at the word end allows for undesirable growth in the complexity of the search To overcome this problem, the nodes internal to a word store the maximum LM score of all the words covered by that lexical node

21 Search space organization (cont) Acoustic evaluation The likelihood score evaluated is stored locally with the state information and reused whenever that state is revisited in that frame

22 Search space reduction Pruning Path merging Word graph compaction

23 Search space reduction (cont) Pruning : to identify low-scoring partial paths that have a very low probability of getting any better, and stop propagating them further. Some commonly used heuristics are: Setting pruning beams based on the hypothesis score Limiting the total number of model instances active at a given time (maximum active phone model instance pruning) Setting an upper bound on the number of words allowed to end at a given frame (maximum active word-end pruning)

24 Search space reduction (cont) Setting pruning beams based on the hypothesis score State level : Phone level : Word level :

25 Search space reduction (cont) Setting pruning beams based on the hypothesis score : identity of a word is known with a much higher likelihood at the end of the word compared to it’s beginning. It is beneficial to curb the fan-out caused the language model list of possible next words. Word-level threshold is usually tighter compared to the state and phone-level beams

26 Search space reduction (cont) Setting pruning beams based on the hypothesis score Figure 7 : Effect of beam widths on the recognition accuracy and complexity of the search.

27 Search space reduction (cont) maximum active phone model instance (MAPMI) pruning : each partial path is identified : current node in the lexical tree the identity of the phone model being evaluated and the last completely evaluated word defined by that path. to limit the number of these partial path (instance) active at any time

28 Search space reduction (cont) maximum active phone model instance (MAPMI) pruning : Figure 8 : Effect of MAPMI pruning on memory usage as illustrated on a 68 frames long utterance from the SWB corpus for word graph generation.

29 Search space reduction (cont) maximum active phone model instance (MAPMI) pruning : Figure 9 : Effect of MAPMI pruning on the recognition accuracy and complexity of the search, on a subset of the SWB corpus

30 Search space reduction (cont) maximum active word-end pruning : to propagate only a few word ends that associated with the highest likelihood path scores.

31 Search space reduction (cont) Path merging : by sharing the evaluation of similar parts of different hypotheses the decoder can prevent the computational load. Take the word level for example : if more than one active path leads to the end of a word, then only the best path among them is propagated further.

32 Search space reduction (cont) Path merging Lexical tree : automatically ensures that all the partial hypotheses represented here have identical linguistic context During Word graph generation: only one path is propagated, multiple path histories are preserved through a sorted backpointer list.

33 Search space reduction (cont) Word graph compaction : a word graph often contains multiple instances of the same word sequence, each with a different alignment with respect to time.

34 Search space reduction (cont) Word graph compaction : a word graph : Figure 10 : An illustration of the word graph

35 Search space reduction (cont) Word graph compaction : a word graph that removes the time stamp Figure 11 : An illustration of the word graph compaction from figure 10

36 Performance analysis Several experiment on two different corpora : the OGI Alphadigits and SWITCHBOARD, using both word-internal as well as cross- word triphone models. Hardware : 333MHz Pentium II processor with 512MB of memory.

37 Performance analysis (cont) OGI Alphadigits corpus (OGI-AD) : A database of telephone speech collected from approximately 3000 subject. The vocabulary consisted of the letters of the alphabet as well as the digits 0 through 9. Each subject spoke a list of either 19 or 29 alphanumeric strings.

38 Performance analysis (cont) OGI-AD : Language model : Figure 12 : The language model for the Alphadigits corpus is a fully connected grammar.

39 Performance analysis (cont) OGI-AD results Table 1. An analysis of performance on the OGI-AD task for network decoding.

40 Performance analysis (cont) Memory varies with the length of the utterance Figure 13 : Memory and run-time for word graph rescoring as a function of utterance length. we use word-internal models here

41 Performance analysis (cont) Memory varies with the length of the utterance Figure 14 : Memory and run-time for word graph rescoring as a function of utterance length. we use cross-word models here

42 Performance analysis (cont) Memory varies with the length of the utterance Figure 15 : Memory and run-time for word graph generation as a function of utterance length. we use word-internal models here

43 Performance analysis (cont) Switchboard (SWB) corpus : Consists of recognition of spontaneous conversational speech collected over standard telephone lines. Decoding with cross-word acoustic models is a challenge on this task.

44 Performance analysis (cont) SWB is currently one of most challenging benchmarks for systems. Reasons are : Acoustics : a variety of transducers and noisy channels Language model Pronunciation variation

45 Performance analysis (cont) Switchboard (SWB) result : Table 2. An analysis of performance on the LDC-SWB task for rescoring word graphs generated using a bigram language model.

46 Switchboard (SWB) result : Table 3. Summary of the decoder performance on the LDC-SWB task for word graph generation using a bigram language model.

47 Summary Future direction in search can be summarized in one word : real – time More intelligent pruning algorithms Multi-pass systems Fast-matching strategies within the acoustic model Vector quantization-like approach