Presentation is loading. Please wait.

Presentation is loading. Please wait.

Potential team members to date: Karen Livescu (presenter) Simon King Florian Metze Jeff Bilmes Articulatory Feature-based Speech Recognition: A Proposal.

Similar presentations


Presentation on theme: "Potential team members to date: Karen Livescu (presenter) Simon King Florian Metze Jeff Bilmes Articulatory Feature-based Speech Recognition: A Proposal."— Presentation transcript:

1 Potential team members to date: Karen Livescu (presenter) Simon King Florian Metze Jeff Bilmes Articulatory Feature-based Speech Recognition: A Proposal for the 2006 JHU Summer Workshop on Language Engineering LIP-OP TT- OPEN TT- LOC TB- OPEN VELUM GLOTTIS.................. Mark Hasegawa-Johnson Ozgur Cetin Kate Saenko November 12, 2005

2 Motivations Why articulatory feature-based ASR? –Improved modeling of co-articulatory pronunciation phenomena –Take advantage of human perception and production knowledge –Application to audio-visual modeling –Application to multilingual ASR –Evidence of improved ASR performance with feature-based models  In noise [Kirchhoff et al. 2002]  For hyperarticulated speech [Soltau et al. 2002] –Potential savings in training data Why this workshop project? –Growing number of sites investigating complementary aspects of this idea; a non-exhaustive list:  U. Edinburgh (King et al.)  UIUC (Hasegawa-Johnson et al.)  MIT (Livescu, Glass, Saenko) –Recently developed tools (e.g. graphical models) for systematic exploration of the model space

3 The challenge of pronunciation variation (2) p r aa b iy (1) p r ay (1) p r aw l uh (1) p r ah b iy (1) p r aa l iy (1) p r aa b uw (1) p ow ih (1) p aa iy (1) p aa b uh b l iy (1) p aa ah iy probably (1) s eh n t s (1) s ih t s sense (1) eh v r ax b ax d iy (1) eh v er b ah d iy (1) eh ux b ax iy (1) eh r uw ay (1) eh b ah iy everybody (37) d ow n (16) d ow (6) ow n (4) d ow n t (3) d ow t (3) d ah n (3) ow (3) n ax (2) d ax n (2) ax (1) n uw (1) n (1) t ow (1) d ow ax n... don’t Noted as an obstacle for recognition of conversational speech [McAllaster et al. ‘98, Saraçlar et al. ‘00] –Conversational speech is recognized at twice the error rate of read speech [Weintraub et al. ‘98] –Recognizer errors are correlated with reduced pronunciations [Fosler-Lussier ’99] Phonetic transcription of conversational pronunciations [Greenberg et al. ‘96]

4 000020001000 000011122222 122 0 2 10 2 00000 ind GLOT ind LIP-OPEN 00001112222 ind VEL 1 Approach: Main Ideas WWWWCCCCWWWW WWNNNCCCWWWW U LIP-OPEN S LIP-OPEN baseform dictionary asynchrony + feature substitutions + everybody …... …Wid e CritWid e LIP-OPEN …Off VEL …VVVVGLOT …iyrvehphone …3210index Many ways to use articulatory features in ASR Approach for this project: Multiple streams of hidden articulatory states that can desynchronize and stray from target values –Inspired by linguistic theories, but simplified and cast in a probabilistic setting

5 Dynamic Bayesian network implementation: The context-independent case = 1 ) | Pr(| )Pr( 21 2 ;1 a ind a async  ….1 0 0 4 … … … … … … ….2.7 0 0 2 ….1.2.7 0 1 … 0.1.2.7 0 … 3 2 1 0 given by baseform pronunciations... Example DBN with 3 features:

6 Recent related work Product observation models combining phones and features, p(obs|s) = p(obs|ph s )  p(obs|f si ), improve ASR in some conditions – [Kirchhoff et al. 2002, Metze et al. 2002, Stueker et al. 2002] Lexical access from manual transcriptions of Switchboard words using DBN model above [Livescu & Glass 2004, 2005] –Improves over phone-based pronunciation models (~50%  ~25% error) –Preliminary result: Articulatory phonology features preferable to IPA-style (place/manner) features JHU WS’04 project [Hasegawa-Johnson et al. 2004] –Can combine landmarks + IPA-style features at acoustic level with articulatory phonology features at pronunciation level Articulatory recognition using DBN and ANN/DBN models [Wester et al. 2004, Frankel et al. 2005] –Modeling inter-feature dependencies useful, asynchrony may also be useful Lipreading using multistream DBN model + SVM feature detectors –Improves over viseme-based models in medium-vocabulary word ranking and realistic small-vocabulary task [Saenko et al. 2005]

7 Ongoing work: Audio-visual ASR visual state (viseme) audio state (phoneme) VVV AAA phoneme-viseme based AAA VVV checkSync LT checkSync T G async LT async TG Lip features Tongue features Glottis/velum articulatory feature-based spectrogram mouth images G phone T phone L phone Sample alignment from a prototype feature-based system:

8 A partial taxonomy of design issues factored state (multistream structure)? No factored obs model? YesNo obs model GMSVM NN [Metze ’02][Kirchhoff ’02][Juneja ’04] [Deng ’97, Richardson ’00] Yes state asynchrony free within unit soft asynchrony within word coupled state transitions cross-word soft asynchrony [Livescu ‘04] fact. obs? Y N Y N Y N Y N CD [Kirchhoff ’96, Wester et al. ‘04] CHMMs FHMMs [Livescu ’05] ??? [WS04] CD Y N ??? Y N CD Y N ??? CD YN ??? CD Y N YN YN Y N ??? (Not to mention choice of feature sets... same in hidden structure and observation model?)

9 Goals for 2006 workshop To build complete articulatory feature-based ASR systems –Using multistream DBN structures –For both audio-only and audio-visual ASR To develop a thorough understanding of the design issues involved –Asynchrony modeling –Context modeling –Speaker dependency –Generative observation modeling vs. discriminative feature classification

10 Potential participants and contributors Local participants: –Karen Livescu, MIT:  Feature-based ASR structures, graphical models, GMTK –Mark Hasegawa-Johnson, U. Illinois at Urbana-Champaign  Discriminative feature classification, JHU WS’04 –Simon King, U. Edinburgh  Articulatory feature recognition, ANN/DBN structures –Ozgur Cetin, ICSI Berkeley  Multistream/multirate modeling, graphical models, GMTK –Florian Metze  Articulatory features in HMM framework –Jeff Bilmes, U. Washington  Graphical models, GMTK –Kate Saenko, MIT  Visual feature classification, AVSR –Others? Satellite/advisory contributors –Jim Glass, MIT –Katrin Kirchhoff, U. Washington

11 Resources Tools –GMTK –HTK –Intel AVCSR toolkit Data –Audio-only:  Svitchboard (CSTR Edinburgh): Small-vocab, continuous, conversational  PhoneBook: Medium-vocab, isolated-word, read  (Switchboard rescoring? LVCSR) –Audio-visual:  AVTIMIT (MIT): Medium-vocab, continuous, read, added noise  Digit strings database (MIT): Continuous, read, naturalistic setting (noise and video background) –Articulatory measurements:  X-ray microbeam database (U. Wisconsin): Many speakers, large-vocab, isolated-word and continuous  MOCHA (QMUC, Edinburgh): Few speakers, medium-vocab, continuous  Others? –Manual transcriptions: ICSI Berkeley Switchboard transcription project

12 Thanks! Questions? Comments?


Download ppt "Potential team members to date: Karen Livescu (presenter) Simon King Florian Metze Jeff Bilmes Articulatory Feature-based Speech Recognition: A Proposal."

Similar presentations


Ads by Google