Progress of Sphinx 3.X From X=5 to X=6 Arthur Chan Evandro Gouvea David J. Huggins-Daines Alex I. Rudnicky Mosur Ravishankar Yitao Sun.

Slides:



Advertisements
Similar presentations
LABORATOIRE DINFORMATIQUE CERI 339 Chemin des Meinajariès BP AVIGNON CEDEX 09 Tél (0) Fax (0)
Advertisements

CALO Decoder Progress Report for March Arthur (Decoder and ICSI Training) Jahanzeb (Decoder) Ziad (ICSI Training) Moss (ICSI Training) Carnegie Mellon.
Development of CMU Sphinx From 2004 to 2006 Jul An Observer’s Perspective Arthur Chan Evandro Gouvea David Huggins-Daines Mosur Ravishankar Alex Rudnicky.
Brief Overview of Different Versions of Sphinx Arthur Chan.
CALO Recorder/Decoder Progress Report for Summer 2004 (July and August) Yitao Sun (Recorder/Decoder) Jason Cohen (Recorder/End-pointer) Thomas Quisel (Recorder)
3 rd Progress Meeting For Sphinx 3.6 Development Arthur Chan, David Huggins-Daines, Yitao Sun Carnegie Mellon University Jan 25, 2006.
2 nd Progress Meeting For Sphinx 3.6 Development Arthur Chan, David Huggins-Daines, Yitao Sun Carnegie Mellon University Jun 7, 2005.
Speed-up Facilities in s3.3 GMM Computation Seach Frame-Level Senone-Level Gaussian-Level Component-Level Not implemented SVQ-based GMM Selection Sub-vector.
Speaker Adaptation in Sphinx 3.x and CALO David Huggins-Daines
From Main() to the search routine in Sphinx 3 (s3accurate) Arthur Chan July 8, 2004.
Progress of Sphinx 3.X, From X=4 to X=5 By Arthur Chan Evandro Gouvea Yitao Sun David Huggins-Daines Jahanzeb Sherwani.
Progress Presentation of Sphinx 3.6 (2005 Q2) Arthur Chan Carnegie Mellon University Jun 7, 2005.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Technical Aspects of the CALO Recorder By Satanjeev Banerjee Thomas Quisel Jason Cohen Arthur Chan Yitao Sun David Huggins-Daines Alex Rudnicky.
Sphinx 3.4 Development Progress Arthur Chan, Jahanzeb Sherwani Carnegie Mellon University Mar 4, 2004.
CALO Decoder Progress Report for June Arthur (Decoder, Trainer, ICSI Training) Yitao (Live-mode Decoder) Ziad (ICSI Training) Carnegie Mellon University.
Sphinx 3.4 Development Progress Report in February Arthur Chan, Jahanzeb Sherwani Carnegie Mellon University Mar 1, 2004.
15-Jul-04 FSG Implementation in Sphinx2 FSG Implementation in Sphinx2 Mosur Ravishankar Jul 15, 2004.
XML, DITA and Content Repurposing By France Baril.
Introduction to Systems Analysis and Design Trisha Cummings.
Adaptation Techniques in Automatic Speech Recognition Tor André Myrvoll Telektronikk 99(2), Issue on Spoken Language Technology in Telecommunications,
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:
1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.
Arthur Kunkle ECE 5525 Fall Introduction and Motivation  A Large Vocabulary Speech Recognition (LVSR) system is a system that is able to convert.
1 COMPSCI 110 Operating Systems Who - Introductions How - Policies and Administrative Details Why - Objectives and Expectations What - Our Topic: Operating.
SWE 316: Software Design and Architecture – Dr. Khalid Aljasser Objectives Lecture 11 : Frameworks SWE 316: Software Design and Architecture  To understand.
Feasibility Study.
CMU Shpinx Speech Recognition Engine Reporter : Chun-Feng Liao NCCU Dept. of Computer Sceince Intelligent Media Lab.
Compiler course 1. Introduction. Outline Scope of the course Disciplines involved in it Abstract view for a compiler Front-end and back-end tasks Modules.
Comparison of the SPHINX and HTK Frameworks Processing the AN4 Corpus Arthur Kunkle ECE 5526 Fall 2008.
1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.
L8 - March 28, 2006copyright Thomas Pole , all rights reserved 1 Lecture 8: Software Asset Management and Text Ch. 5: Software Factories, (Review)
Data Sampling & Progressive Training T. Shinozaki & M. Ostendorf University of Washington In collaboration with L. Atlas.
T Iteration Demo Group name [PP|I1|I2] Iteration
HIERARCHICAL SEARCH FOR LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION Author :Neeraj Deshmukh, Aravind Ganapathiraju and Joseph Picone.
02/09/2010 Industrial Project Course (234313) Virtualization-aware database engine Final Presentation Industrial Project Course (234313) Virtualization-aware.
ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
K. Ingram 1 Oct 2001 Software Development Tools. K. Ingram 2 Oct 2001 Contents l Tools – what are they, why are they needed? l Software Development Tools.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
T Iteration Demo Group name [PP|I1|I2] Iteration
Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.
Testing plan outline Adam Leko Hans Sherburne HCS Research Laboratory University of Florida.
Getting ready. Why C? Design Features – Efficiency (C programs tend to be compact and to run quickly.) – Portability (C programs written on one system.
ALPHABET RECOGNITION USING SPHINX-4 BY TUSHAR PATEL.
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
December 13, G raphical A symmetric P rocessing Prototype Presentation December 13, 2004.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
Hierarchical Multi-Stream Posterior Based Speech Recognition System
Juicer: A weighted finite-state transducer speech decoder
An overview of decoding techniques for LVCSR
Chapter 9 – Real Memory Organization and Management
Progress Report of Sphinx in Summer 2004 (July 1st to Aug 31st )
Introduction to Systems Analysis and Design
CALO Decoder Progress Report for April/May
Sphinx 3.X (X=4) Four-Layer Categorization Scheme of Fast GMM Computation Techniques in Large Vocabulary Continuous Speech Recognition Systems
Progress Report of Sphinx in Q (Sep 1st to Dec 30th)
Sphinx Recognizer Progress Q2 2004
An Introduction to Software Architecture
Case Study 1 By : Shweta Agarwal Nikhil Walecha Amit Goyal
Speaker Identification:
Presentation transcript:

Progress of Sphinx 3.X From X=5 to X=6 Arthur Chan Evandro Gouvea David J. Huggins-Daines Alex I. Rudnicky Mosur Ravishankar Yitao Sun

If you want to leave now…… Take home message 1 Sphinx 3.6 Rocks!

Here is another one…… Take home message 2 We need Better Acoustic Models We need Better Acoustic Models.

This talk (~37 pages) Overview (6 pages) Better Software Architecture (9 pages) Speed of Sphinx 3.6 (3 pages) Accuracy Improvement (7 pages) Functionalities Improvement (3 pages) Documentation (4 pages) Sphinx 3.X (X>6) and Conclusion (~5 pages) Discussion (10 mins?)

Overview of CMU Sphinx

What is CMU Sphinx? Definition 1 : Large vocabulary speech recognizers with high accuracy and speed performance. Definition 2 : A collection of tools and resources that enables developers/researchers to build successful speech recognition systems

Family of CMU Sphinx Decoders Sphinx {II – IV} PocketSphinx (by Dave at Oct 2005) Acoustic Model Trainer SphinxTrain Documentation Hieroglyphs Robust/SphinxTrain Tutorial

Sphinx Developers Sphinx is maintained by Volunteer programmers/researchers who like speech recognition Funded by different projects Motivated by different reasons All contribution go to the same codebase Goal : Sustainable development of Sphinx Sphinx Developer Meetings are held regularly secretly to decide the way to go in Sphinx

What is Sphinx 3.X? An extension of Sphinx 3’s recognizers “Sphinx 3.X (X=6)” means “Sphinx 3.6” Provide more functionalities such as Real-time speech recognition Speaker adaptation Developers Application Interfaces (APIs) Different search algorithms 3.X (X>3) is motivated by Project CALO and GALE

Development History of Sphinx 3.X S3 -Sphinx 3 flat- lexicon recognizer (s3 slow) S3.2 -Sphinx 3 tree- lexicon recognizer (s3 fast) S3.3 -live-mode demo S3.4 -fast GMM, class- based LM, dynamic LM S3.5 –some support on speaker adaptation -live mode APIs 3.X/3.0 merge - Better Search Architecture/Implementation -More support for Speaker Adaptation - Gentle Re-factoring of code-base -Somme support on FSG decoding and confidence -Better Documentation/Tutorial lm_convert (lm3g2dmp) dp 3.6

This talk – Progress of Sphinx 3.6 From the perspective of a developer an observer Sphinx 3.6 Where are we now? Where will we go? Summary of 5 talks

Software Architecture of Sphinx 3.X (X=6)

Motivation of Re-Architecting Sphinx 3.X We start to need a new search algorithms New search algorithm development could have risk. We don’t want to throw away the old one. Mere replacement could cause backward compatibility problem. Code has grown to a stage where Some changes could be very hard. Multiple programmers become active at the same time CVS conflict could become often if things are controlled by “if-else” structure

Architecture of Sphinx 3.X (X<6) Batch sequential Architecture (Shaw 96) Each executable has customized sub- routines decode livepretend Decode_anytopo align allphone GMM Computation 1 approx_cont_mgau Search 1 Process Controller 1 GMM Computation 2 (Using gauden & senone Method 1) Search 2 Process Controller 2 GMM Computation 3 (Using gauden & senone Method 2) Search 3 Process Controller 3 GMM Computation 4 (Using gauden & senone Method 3) Search 4 Process Controller 4 Command Line 1Command Line 2Command Line 3Command Line 4 Initialization 1 (kb and kbcore) Initialization 2Initialization 3Initialization 4

Architecture Diagram of Sphinx 3.6 Applications Controllers/ Abstractions ImplementationsLibraries decode livepretend align allphone dag astar livedecode API Search Controller Process Controller Search Initializer Command Line Processor User Defined Applications Fast Single Stream GMM Computation Multi Stream GMM Computation FSG Search Flat Lexicon Search Dictionary Library Search Library LM Library AM Library Utility Library Feature Library Miscellaneous Library decode (anytopo) Tree Lexicon Search

Separation of Mechanism and Implementation Search Mechanism Module (srch.c) -A class provides Atomic Search Operations (ASOs) in the form of function pointers -Configured by just setting function pointers - A single interface for applications Search Implementation Module (srch.c) Search Implementation Module (srch.c) Search Implementation Module (srch.c) Search Implementation Module (srch.c) Search Implementation Modules (srch_????.c) -Could have many of them -Possibilities: A, Decoding with different implementations B, Concept of search including -alignment, -phoneme recognition -keyword spotting.

Search Mechanism Module – What does it do? Computation of One Frame Select Active CD Senone Compute Approx. GMM Score (CI senone) Compute Detail GMM Score (CD senone) Compute Detail HMM Score (CD) Propagate Graph (Phone- Level) Rescoring At word End using High-Level KS (e.g. LM) Propagate Graph (Word- Level) Search For One Frame GMM Compute

Search Implementations Implemented (-op_mode) Finite State Grammar Search (Mode 2) Flat Lexicon Search (Mode 3) Tree Search (Mode 4) Not in 3.6 Aligner (Mode 0) Phoneme recognition (Mode 1) A new tree search (Mode 5)

Different ways to implement search implementations 1, Use default implementation Just specify all atomic search operations (ASOs) provided 2, Override “search_one_frame” Only need to specify GMM computation and how to “search_one_frame” 3, Override the whole mechanism For people who dislike the default so much Override how to “search”

Consequence of Re-factoring Calling decode Could use flat-lexicon decoding as well decode_anytopo still exists For backward compatibility decode_anytopo = decode allphone, align, decode_anytopo could use fast GMM computation decode could use S3’s SCHMM Command-line is now synchronized

Summary on the Architecture Sphinx 3.6 A gentle re-factoring has carried out. A more flexible architecture A better playground for AM and search people S2 SCHMM computation routine? NN, SVM, ML techniques for AM?

Speed of Sphinx 3.6

Speed in Sphinx 3.6 Further work on Context-Independent Senone-based GMM Selection (CIGMMS) 20-30% Speed Up 3 tricks were proposed Fixed amount of CD senone compute. Use of best Gaussian index Tightening factor of CI-phone beam Published in “On Improvements of CI-based GMM Selection “ (Chan 2005)  but not very well received Alright, there are accuracy lost

A note on Sphinx 3.6 Speed Performance Sphinx 3.X works under 1xRT in most tasks. E.g. Smartnote/Sphinx Integration Broadcast News UNTUNED RESULT: 1.5xRT Sphinx 3.X is still slower than Sphinx 2 Fast setup of Sphinx 2: use 256 codeword SCHMM Fast setup of Sphinx 3: use senone FCHMM Historical notes: Comparable SCHMM setup has 4096 codewords Need benchmarking to truly judge

Speed - Conclusion Sphinx 3.X is in a reasonable level Sphinx 2 should still be used in speed- critical condition Further work GALE/CALO will still be around in 3.6/3.7 Accuracy become more motivated than speed

Accuracy Improvement During Sphinx 3.6

Our Immediate Problem What help us more in accuracy? Acoustic modeling ? Speaker Adaptation ? Search Improvement ?

Accuracy Improvement of Sphinx 3.6 – Speaker Adaptation Speaker adaptation techniques are shown to be crucia Even in tough task (e.g. CALO) 10-15% relative improvement Gain similar to LM/AM modeling work

Accuracy Improvement of Sphinx 3.6 – Speaker Adaptation (cont.) Dave has done a great job on Multiple-class MLLR MAP adaptation Things to watch Ziad’s VTLN implementation

Conclusion in Speaker Adaptation Observation in 3.6 Speaker adaptation is very important. What we still need: Maximum likelihood linear transformation (MLLT) Combination of MLLT, MLLR, MAP and VTLN Proved to be additive

Accuracy Improvement of Sphinx Search Our Attempts in Flat Lexicon Decoder Full triphones 2.5% rel. gain But 100xRT Full trigram Will give another 5-10 times slowdown Diff between Tree vs Flat Lex. Decoder 5% relative Conclusion: Further improvement in search is limited

Accuracy Improvement in Sphinx 3.6 -Modeling Mainly on addition of data (Major contributor) interpolation of LM (very decent gain) Things to watch: Yi’s LDA Yet to explore Speaker Adaptive Training (SAT) Semi-tied Covariance (STC) Matrix Conclusion: Commodity techniques are still not widely used in Sphinx (Bad sign).

Conclusion of Accuracy Improvement has a healthy development in speaker adaptation Improvement in search is hard Need 10x effort on acoustic modeling Commodity techniques are still not there Three final keywords: MLLT, SAT, STC Priorities: Adaptation > AM, LM > 2 stage Search >> 1 st Stage

Other Extensions in Sphinx 3.6

FSG search 3.6 supports FSG search Adapted from Sphinx 2’s implementation Current Issues No lextree implementation Static allocation of all HMMs; not allocated “on demand” FSG transitions represented by NxN matrix Other wish list No histogram pruning No state-based implementation Need more testing

Confidence Annotation conf Adapted from Rong with permission Compute Word Posterior Probability of a word given lattice Still under work

Language Model Related Now fully supports Text-based LM reading Inter-conversion of LM in TXT & DMP format lm_convert = lm3g2dmp++ LM switching API in live_decode_API

Documentation/Tutorial

Hieroglyphs A collection of documentation of using Sphinx 3, SphinxTrain and CMU LM Tool kit 1 st Draft is completed All chapter are filled with information. Writing the 2 nd Draft “Chief Editor”: Arthur Chan Does it even exist?

Hieroglyph: An outline Chapter 1: Licensing of Sphinx, SphinxTrain and LM Toolkit Chapter 2: Introduction to Sphinx Chapter 3: Introduction to Speech Recognition Chapter 4: Recipe of Building Speech Application using Sphinx Chapter 5: Different Software Toolkits of Sphinx Chapter 6: Acoustic Model Training Chapter 7: Language Model Training Chapter 8: Search Structure and Speed-up of the Speech recognizer Chapter 9: Speaker Adaptation Chapter 10: Research using Sphinx Chapter 11: Development using Sphinx Appendix A: Command Line Information Appendix B: FAQ

Book Reviews of Hieroglyphs “You wrote the worst preface I have ever seen in my life. “ Dr. Evandro Gouvea “The content is o. k., but the writing is still ……” Prof. Alex I. Rudnicky “Wow, it is thick. And, oh…… there are no blank spaces! You are not supposed to add contents in any CMU open source manuals, don’t you know?” Dr. Alan W. Black

Other Documents Robust Tutorial (Aka Sphinx 101) Thanks to Evandro Now could be used for archive_s3 Sphinx 2 Sphinx 3 Doxygen documentation for Sphinx 3.x is fully available xygen/html/ xygen/html/

Sphinx 3.X (X>6) and Conclusion

What is important? Keep the current design priorities: 1, Accuracy We are just OK and we badly need to improve it. 2, Speed We are OK and it doesn’t hurt to improve it 3, Functionalities Still a pain to use Sphinx 3 but it is constant improved Usability eventually implies distributing models. Accuracy should be prior to Speed No excuse in 3.7

Roadmap: In X=7…… For GALE/CALO Speaker Clustering/SAT Bridging SI and SA VTLN LDA 0.5 x CALO may need further speed improvement BBI More secret ideas in GMM computation

Roadmap (cont.) X=8 D.T. MMIE, MCE STC Interface with HTK model X=9 D.T. + S.A. X>10 Time to fire Arthur Chan and hire an assistant professor

Sphinx in Other Languages?

Other Possibilities of Sphinx? [You fill in this part]

We need your help! Project Manager: Enable Development of Sphinx Translation: Kick/Fix people and Kicked/Fixed by Evandro Developers: Incorporate state-of-art speech technology into Sphinx Translation: Fix 1 bug and Generate 5 more Maintainer: Ensure integrity of Sphinx code and resource Translation: You become so called the “Grand Janitor of Sphinx”. Tester: Enable test-based development in Sphinx Translation: You will learn a lot of Zen-Buddhism.

Our Current Motto (Subject to Change) “Don’t ever underestimate yourself…… You never know what a kind of mess you could make.” -Dr. Evandro Gouvea

Conclusion for Sphinx 3.X We have done something We are making some sense in the system development now We have healthy growth in accuracy But we still need more

Q & A

Thank you Acknowledgement Rich/Alan: for your constant encouragement Alex: for your understanding of Yin/Yang Rong: for contributing the confidence estimation program Bano: for reminding me I could die at any time when we were in Lake Arthur -> Hieroglyphs 1 st draft’s progress sped up. Sphinx developers: without you, I won’t be the “Grand Janitor”. Sphinx users: for your capabilities of giving me nightmares

Postscript, a word from my friend “Don’t ever underestimate yourself…… You never know what a mess you could make.” –Dr. Evandro Gouvea

Reserved

Pros/Cons of Batch Sequential Architecture Pros: Great flexibility for individual programmers No assumption, data structure are usually optimized for the application. Align and allphone have optimization. Crafting in individual application has high quality Cons: Great difficulty in maintenance Most changes need to be carried out for 5-6 times. Spread disease of code duplication Code with functionality was duplicated multiple times