Presentation on theme: "The SRI 2006 Spoken Term Detection System Dimitra Vergyri, Andreas Stolcke, Ramana Rao Gadde, Wen Wang Speech Technology & Research Laboratory SRI International,"— Presentation transcript:
The SRI 2006 Spoken Term Detection System Dimitra Vergyri, Andreas Stolcke, Ramana Rao Gadde, Wen Wang Speech Technology & Research Laboratory SRI International, Menlo Park, CA
Dec. 14, 2006STD-06 Workshop2 Outline STD system overview –STT systems BNews system description CTS system description ConfMtg system description –Indexing N-gram index from word lattices NNet based posterior estimation –Retrieval Time and memory requirements ATWV Results Future work
Dec. 14, 2006STD-06 Workshop3 SRI STD System STT Audio Word Lattices INDEXER N-gram Index with posteriors RETRIEVER Search Terms with Times and Probabilities Indexing step
Dec. 14, 2006STD-06 Workshop4 English BN STT System Single front-end : PLP (52 39 dim) HLDA, feature-space SAT Gender-independent acoustic modeling Decision-tree clustered within-word and cross-word triphones MLE followed by alternating MPE-MMIE acoustic training Acoustic training: Hub4, TDT2+TDT4+TDT4a, BNr1234 subset –MLE training: 3300 hours, MPE training: 1700 hours –2500 x 200 Gaussian for nonCW triphones –3000 x 160 Gaussians for CW triphones Word bigram and 5-gram LM trained on Hub4, TDT, BNr1234 transcripts, Hub4 LM training data, and NABN (cutoff date Nov. 30, 2003) –62k words, 29M bigrams, 27M trigrams, 15M 4-grams, 2.4M 5-grams Duration rescoring (word-specific phone durations) Two-pass decoding First decoding stage unadapted nonCW model with bigram LM Adapted CW models to nonCW output after 5-gram LM and duration model lattice rescoring Lattice constrained decoding with MLLR adapted, SAT, cw model
Dec. 14, 2006STD-06 Workshop5 English BN STT System 2-gram Lattices PLP MPE nonCW PLP MPE CW Adapted 5-gram Lattices Legend Decoding/rescoring step Hyps for MLLR or output Lattice generation/use Lattice or 1-best output Runtimes: 2.5xRT for unadapted lattices 5.4xRT for adapted lattices ~10% relative WER improvement after adaptation Both decoding stages use shortlists. 4-gram Lattices
Dec. 14, 2006STD-06 Workshop6 English CTS STT System Two front-ends: –MFCC + voicing + MLP-features ( dim) –PLP (52 39 dim) HLDA, feature-space SAT Gender-dependent acoustic modeling Decision-tree clustered within-word and cross-word triphones MLE followed by alternating MPE-MMIE acoustic training Acoustic training: all Hub5 + Fisher training –2500 x 128 x 2 Gaussians for nonCW triphones –3000 x 128 x 2 Gaussians for CW triphones Prosodic rescoring (word-specific phone durations, pause trigram) Word bigram and 4-gram LM Interpolated + pruned LM trained on CTS, BN, and Web data –48k words, 16M bigrams, 16M trigrams, 12M 4grams First lattice generation uses phone-loop MLLR nonCW MFCC and 2-gram LM Second constrained lattice generation uses cross-adapted CW SAT PLP models.
Dec. 14, 2006STD-06 Workshop7 English Meeting STT System [Stolcke et al., MLMI’05; Janin et al., MLMI’06] Based on CTS system architecture (2-pass system) Combination of CTS (narrow-band) and BN (wide-band) base models Acoustic models adapted to distant-mic meeting recordings using MMI-MAP MLP features adapted for meeting recordings by incremental training Mixture language model trained on meetings, CTS, and Web data System used in RT-06S meeting evaluation, co- developed with ICSI
Dec. 14, 2006STD-06 Workshop8 English CTS & Confmtg STT Systems Legend Decoding/rescoring step Hyps for MLLR or output Lattice generation/use Lattice or 1-best output 2-gram Lattices MFCC-MLP MPE nonCW PLP MPE CW Adapted 4-gram lattices 3-gram Lattices CTS runtime: 1.8xRT for unadapted lattices 2.5xRT for adapted lattices Confmtg runtime: 5.4xRT for unadapted lattices 6.8xRT for adapted lattices CTS system uses Gaussian shortlists in first pass only Confmtg system does not use shortlists.
Dec. 14, 2006STD-06 Workshop9 English STT Result Summary (WER) 10.5% eval %10.7%BN STD-dev06eval % eval % eval %17.0%CTS STD-dev06dev % dev % eval04s 44.2%Confmtg STD-dev06 STD-dev06 WER measured using references constructed from RTTM files Systematic differences compared to standard STT references For example, BN scoring does not exclude commercial segments Note: STT systems were not especially tuned for STD; used configurations inherited from STT evaluations.
Dec. 14, 2006STD-06 Workshop10 Indexing of Word Lattices SRILM lattice-tool dumps all word 1-grams to 5-grams in lattices, along with side information –Posterior probabilities based on normalized recognizer scores –Start/end times, channel, waveform name –0.5s time tolerance to merge same N-grams with different times –Pronunciations (to detect OOV words, not used yet) –N-grams with posterior < are omitted to keep index size reasonable Index = term occurrence table sorted by N-gram Indexing function incorporated in SRILM release –Lattice-tool –write-ngram-index option –Downloadable from
Dec. 14, 2006STD-06 Workshop11 Score Calibration Neural net maps posteriors to unbiased STD scores –Input features used: audio source (bnews/cts/confmtg), LM joint probability, LM N-gram length, #words, duration, lattice posterior –Used LnkNet software for training MLP to predict correctness of hypothesized term (1 hidden layer with 10 nodes) –Cross entropy objective function Neural net trained using the dev06 term list Training on raw data improved Occurrence Weighted Value, not A-Term Weighted Value –Also required re-tuning the posterior threshold. Resample training data to approximate ATWV –Downsample/upsample within occurrences of each term to have equal number of training samples for each term. –Posterior threshold 0.5 ended up being optimal for ATWV (at least on the training data).
Dec. 14, 2006STD-06 Workshop12 Searching & Retrieval Convert the search terms into a sorted list Run the Unix “join” command between the index list obtained in indexing and the term list YES/NO decision based on the posterior threshold 0.5 Run time almost independent of the size of the search list (depends on the index size)
Dec. 14, 2006STD-06 Workshop13 Time and Memory Requirements The system was run on 3GB, 3.4 GHz Intel hyperthreading CPU Both index size and search time can be significantly reduced if we keep only candidates with high posterior STT runtimes were incorrectly measured in submitted sysdesc s26760 s58560 sSTT run time Search time needed for all terms Indexing 13 s 530K 37Mb 602K 37Mb 944K 74Mb Index size (# terms/MB) 2711 sNNet run time 493 sIndex from lattice Confmtg (2h) CTS (3h) BN (3h)
Dec. 14, 2006STD-06 Workshop14 STD Results Extra dev consists of RT02, RT03 (BN+CTS), dev04 (CTS+ConfMtg), RT04s (ConfMtg) Difficult to debug eval06 (no references were given), but the result on meetings seems much lower than on dev sets. Possibly overtrained neural net on meetings condition. Occ.WV/ATWV 0.790/ / / / / / / /0.817 Extra dev --- / / / eval / / / / / / / /0.818 dryrun / / / / / / / /0.865 dev No NNet With NNet Confmtg No NNet With NNet No NNet With NNet No NNet With NNet All CTS BN Thres.
Dec. 14, 2006STD-06 Workshop15 Future Work Current system does not cover detection of terms with OOVs. Possible approaches: –Map the unknown search terms to the known vocabulary (OGI work, gave about 2-3% improvement on BNews). –Use of phone recognition and phone-based indexing for OOVs –Hybrid word+graphone recognizer outputs both words and “graphone” units that can match OOVs (Bisani & Ney 2005) Improve the score mapper –Bigger devset needed to avoid overtraining –Other models (decision tree, logistic regression) Found some mismatch between ASR vocabulary and term lists. Apply normalization rules to fix common problems (found about 0.3% relative improvement with few simple rules) Tune STT systems for indexing speed