Context Dependent Models Rita Singh and Bhiksha Raj.

Slides:



Advertisements
Similar presentations
Presented by Erin Palmer. Speech processing is widely used today Can you think of some examples? Phone dialog systems (bank, Amtrak) Computers dictation.
Advertisements

Building an ASR using HTK CS4706
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
13.4 Map Coloring and the Four Color Theorem. We started this chapter by coloring the regions formed by a set of circles in the plane. But when do we.
Introduction to Statistics
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Application of HMMs: Speech recognition “Noisy channel” model of speech.
Speech Recognition Training Continuous Density HMMs Lecture Based on:
Training Tied-State Models
Part 6 HMM in Practice CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Sampling Distributions
Dealing with Connected Speech and CI Models Rita Singh and Bhiksha Raj.
1 We will now consider the distributional properties of OLS estimators in models with a lagged dependent variable. We will do so for the simplest such.
Chapter 3 Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
ISSUES IN SPEECH RECOGNITION Shraddha Sharma
1.A file is organized logically as a sequence of records. 2. These records are mapped onto disk blocks. 3. Files are provided as a basic construct in operating.
Automatic Continuous Speech Recognition Database speech text Scoring.
Introduction to Automatic Speech Recognition
Ch 8.1 Numerical Methods: The Euler or Tangent Line Method
RS, © 2004 Carnegie Mellon University Training HMMs with shared parameters Class 24, 18 apr 2012.
Chapter 4 Context-Free Languages Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1.
Phonetics and Phonology
Physical Mapping of DNA Shanna Terry March 2, 2004.
Sociology 5811: Lecture 7: Samples, Populations, The Sampling Distribution Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
LESSON 24.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
CS1Q Computer Systems Lecture 8
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
1 2. Independence and Bernoulli Trials Independence: Events A and B are independent if It is easy to show that A, B independent implies are all independent.
Agenda Introduction Overview of White-box testing Basis path testing
September 23, 2014Computer Vision Lecture 5: Binary Image Processing 1 Binary Images Binary images are grayscale images with only two possible levels of.
Chapter 1 Introduction to Statistics. Statistical Methods Were developed to serve a purpose Were developed to serve a purpose The purpose for each statistical.
+ Simulation Design. + Types event-advance and unit-time advance. Both these designs are event-based but utilize different ways of advancing the time.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Design and Implementation of Speech Recognition Systems Spring 2014 Class 13: Training with continuous speech 26 Mar
Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.
The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree.
Patent Pending New Minimum Number of Clues Findings to Sudoku-derivative Games up to 5-by-5 Matrices including a Definition for Relating Isotopic Patterns.
Chapter 7 Sampling Distributions Statistics for Business (Env) 1.
A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds Paris Smaragdis, Madhusudana Shashanka, Bhiksha Raj NIPS 2009.
Issues concerning the interpretation of statistical significance tests.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Training Tied-State Models Rita Singh and Bhiksha Raj.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
Estimators and estimates: An estimator is a mathematical formula. An estimate is a number obtained by applying this formula to a set of sample data. 1.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech Frank Seide IEEE Transactions on Speech and Audio Processing 2005.
Inferential Statistics Inferential statistics allow us to infer the characteristic(s) of a population from sample data Slightly different terms and symbols.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
Levels of Linguistic Analysis
Copyright © Cengage Learning. All rights reserved. 5 Joint Probability Distributions and Random Samples.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Boolean Algebra and Computer Logic Mathematical Structures for Computer Science Chapter 7 Copyright © 2006 W.H. Freeman & Co.MSCS Slides Boolean Logic.
Uncertainty and confidence Although the sample mean,, is a unique number for any particular sample, if you pick a different sample you will probably get.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
3. The X and Y samples are independent of one another.
Software Engineering (CSI 321)
Speech Recognition UNIT -5.
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Computer Vision Lecture 5: Binary Image Processing
Parametric Methods Berlin Chen, 2005 References:
EGR 2131 Unit 12 Synchronous Sequential Circuits
Visual Recognition of American Sign Language Using Hidden Markov Models 문현구 문현구.
CHAPTER 4 Designing Studies
Implementation of Learning Systems
Presentation transcript:

Context Dependent Models Rita Singh and Bhiksha Raj

17 March 2009 phoneme models Recap and Lookahead  Covered so far: String Matching based Recognition Introduction to HMMs Recognizing Isolated Words Learning word models from continuous recordings Building word models from phoneme models Exercise: Training phoneme models  Phonemes in context Better characterizations of fundamental sound patterns

17 March 2009 phoneme models The Effect of Context  Even phonemes are not entirely consistent Different instances of a phoneme will differ according to its neighbours E.g: Spectrograms of IY in different contexts BeamGeek ReadMeek

17 March 2009 phoneme models The Effect of Context  Even phonemes are not entirely consistent Different instances of a phoneme will differ according to its neighbours E.g: Spectrograms of IY in different contexts BeamGeek ReadMeek

17 March 2009 phoneme models The Effect of Context  Every phoneme has a locus The spectral shape that would be observed if the phoneme were uttered in isolation, for a long time In continuous speech, the spectrum attempts to arrive at locus of the current sound AAIY UWM PARTYING AARTIYIHNGP

17 March 2009 phoneme models The Effect of Context  Every phoneme has a locus For many phoneme such as “B”, the locus is simply a virtual target that is never reached AABAAEEBEE Nevertheless, during continuous speech, the spectrum for the signal tends towards this virtual locus

17 March 2009 phoneme models The Effect of Context  Every phoneme has a locus For many phoneme such as “B”, the locus is simply a virtual target that is never reached AABAAEEBEE Nevertheless, during continuous speech, the spectrum for the signal tends towards this virtual locus

17 March 2009 phoneme models  The acoustic flow for phonemes: every phoneme is associated with a particular articulator configuration  The spectral pattern produced when the articulators are exactly in that configuration is known as the locus for that phoneme  As we produce sequences of sounds, the spectral patterns shift from the locus of one phoneme to the next The spectral characteristics of the next phoneme affect the current phoneme  The inertia of the articulators affects the manner in which sounds are produced The manifestation lags behind the intention The vocal tract and articulators are still completing the previous sound, even as we attempt to generate the next one  As a result of articulator inertia, the spectral manifestations of any phoneme vary with the adjacent phonemes Variability among Sub-word Units

17 March 2009 phoneme models Phoneme 1Phoneme 2Phoneme 3 Phoneme 2Phoneme 1 Locus of phoneme 1 Locus of phoneme 2 Locus of phoneme 3 Locus of phoneme 2 Locus of phoneme 1 Spectral trajectory of a phoneme is dependent on context

17 March 2009 phoneme models Phoneme 1Phoneme 2Phoneme 3 Phoneme 2Phoneme 1 Locus of phoneme 1 Locus of phoneme 2 Locus of phoneme 3 Locus of phoneme 2 Locus of phoneme 1 Spectral trajectory of a phoneme is dependent on context These regions are totally dissimilar

17 March 2009 phoneme models  Phonemes can vary greatly from instance to instance  Due to co-articulation effects, some regions of phonemes are very similar to regions of other phonemes E.g. the right boundary region of a phoneme is similar to the left boundary region of the next phoneme  The boundary regions of phonemes are highly variable and confusable They do not provide significant evidence towards the identity of the phoneme Only the central regions, i.e. the loci of the phonemes, are consistent  This makes phonemes confusable among themselves In turn making these sub-word units poor building blocks for larger structures such as words and sentences  Ideally, all regions of the sub-word units would be consistent Subword units with high variability are poor building blocks for words

17 March 2009 phoneme models Phoneme 1Phoneme 2Phoneme 3 Phoneme 4Phoneme 2Phoneme 3 The shaded regions are similar, although the phonemes to the left are different in the two cases Diphones – a different kind of unit

17 March 2009 phoneme models  A diphone begins at the center of one phoneme and ends at the center of the next phoneme  Diphones are much less affected by contextual, or co- articulation effects than phonemes themselves All regions of the diphone are consistent, i.e. they all provide evidence for the identity of the diphone Boundary regions represent loci of phonemes and are consistent Central regions represent transitions between consistent loci, and are consistent  Consequently, diphones are much better building blocks for word HMMs than phonemes  For a language with N phonemes, there are N 2 diphones These will require correspondingly larger amounts of training data However, the actual number of sub-word units remains limited and enumerable  As opposed to words that are unlimited in number Diphones – a different kind of unit

17 March 2009 phoneme models The Diphone  Phonetic representation of ROCK: ROCK: R AO K  Diphone representation: ROCK: (??-R), (R-AO), (AO-K), (K-??) Each unit starts from the middle of one phoneme and ends at the middle of the next one  Word boundaries are a problem The diphone to be used in the first position (??-R) depends on the last phoneme of the previous word!  ?? is the last phoneme of the previous word Similarly, the diphone at the end of the word (K-??) depends on the next word  We build a separate diphone-based word model for every combination of preceding and following phoneme observed in the grammar  The boundary units are SHARED by adjacent words

17 March 2009 phoneme models  Dictionary entry for ROCK: /R/ /AO/ /K/  Word boundary units are not unique  The specific diphone HMMs to be used at the ends of the word depend on the previous and succeeding word E.g. The first diphone HMM for ROCK in the word series JAILHOUSE ROCK is /S-R/, whereas for PLYMOUTH ROCK, it is /TH-R/  As a result, there are as many HMMs for “ROCK” as there are possible left-neighbor, right-neighbor phoneme combinations HMM for /*-R/HMM for /R-AO/HMM for /K-*/ Components of HMM for ROCK HMM for /AO-K/ Building word HMMs with diphones

17 March 2009 phoneme models  We end up with as many word models for ROCK as the number of possible combinations of words to the right and left HMM for /*-R/HMM for /AO-K/ Composed HMM for ROCK HMM for /K-*/ Building word HMMs with diphones HMM for /R-AO/

17 March 2009 phoneme models Word Sequences using Diphones  Consider this set of sentences HERROCKNECKLACE MYROCKCOLLECTION AROCKSTAR

17 March 2009 phoneme models Word Sequences using Diphones  Consider this set of sentences  For illustration, we will concentrate on this region HERROCKNECKLACE MYROCKCOLLECTION AROCKSTAR

17 March 2009 phoneme models Building word HMMs with diphones /AO-K/ /R-AO/ HERROCKNECKLACE MYROCKCOLLECTION AROCKSTAR /R-R/ /AY-R/ /AX-R/ /K-N/ /K-K/ /K-S/ ROCK HER MY A NECKLACE COLLECTION STAR

17 March 2009 phoneme models Building word HMMs with diphones  Each instance of Rock is a different model The 3 instances are not copies of one another  The boundary units (gray) are shared with adjacent words /AO-K/ /R-AO/ /R-R/ /AY-R/ /AX-R/ /K-N/ /K-K/ /K-S/ ROCK HER MY A NECKLACE COLLECTION STAR

17 March 2009 phoneme models  We end up with as many word models for ROCK as the number of possible words to the right and left Composed HMM for ROCK Building word HMMs with diphones HMM for /*-R/ HMM for /AO-K/ HMM for /K-*/ HMM for /R-AO/

17 March 2009 phoneme models  Under some conditions, portions of multiple versions of the model for a word can be collapsed Depending on the grammar Composed HMM for ROCK Building word HMMs with diphones

17 March 2009 phoneme models Building a Diphone-based recognizer  Train models for all Diphones in the language If there are N phonemes in the language, there will be N 2 diphones For 40 phonemes, there are thus 1600 diphones; still a manageable number  Training: HMMs for word sequences must be built using Diphone models  There will no longer be distinctly identifiable word HMMs because boundary units are shared among adjacent words  A similar problem would arise during recognition

17 March 2009 phoneme models  Cross-word diphones have somewhat different structure than within-word diphones, even when they represent the same phoneme combinations E.g. the within-word diphone AA-F in the word “AFRICA” is different from the word-boundary diphone AA-F in BURKINA FASO” Stresses applied to different portions of the sounds are different This is true even for cross-syllabic diphones  It is advantageous to treat cross-word diphones as being distinct from within-word diphones This results in a doubling of the total number of diphones in the language to 2N 2  Where N is the total number of phonemes in the language We will train cross-word diphones only from cross-word data and within-word diphones from only within-word data Word-boundary Diphones

17 March 2009 phoneme models  Loci may never occur in fluent speech Resulting in variation even in Diphones  The actual spectral trajectories are affected by adjacent phonemes Diphones do not capture all effects Phoneme 1Phoneme 2Phoneme 1 Phoneme 3Phoneme 2Phoneme 1 These diphones have different spectral structure Improving upon Diphones

17 March 2009 phoneme models Phoneme 1 Phoneme 2Phoneme 1 Phoneme 3Phoneme 2Phoneme 1 Triphones: PHONEMES IN CONTEXT These two are different triphone units  Triphones represent phoneme units that are specific to a context E.g. “The kind of phoneme 2 that follows phoneme 3 and precedes phoneme 3”  The triphone is not a triplet of phonemes – it still represents a single phoneme

17 March 2009 phoneme models Triphones  Triphones are actually specialized phonemes The triphones AX (B, T), AX( P, D), AX(M,G) all represent variations of AX  To build the HMM for a word, we simply concatenate the HMMs for individual triphones in it E.g R AO K : R(??, AO) AO (R,K) K(AO,??) We link the HMMs for the triphones for R(??,AO), AO(R,K) and K(AO,??)  Unlike diphones, no units are shared across words  Like diphones, however, the boundary units R(??,AO), K(AO,??) depend on the boundary phonemes of adjacent words In fact it becomes more complex than for diphones

17 March 2009 phoneme models Building HMMs for word sequences  As in the case of diphones, we cannot compose a unique model for a word  The model that is composed will depend on the adajent words

17 March 2009 phoneme models Word Sequences using Triphones  Consider this set of sentences  For illustration, we will concentrate on this region HERROCKNECKLACE MYROCKCOLLECTION AROCKSTAR

17 March 2009 phoneme models  As for diphones, we end up with as many word models for ROCK as the number of possible words to the right and left Composed HMM for ROCK Building word HMMs with triphones HMM for *(*,R) HMM for AO(R,K) HMM for K(AO,S) HMM for R(AX,AO) HMM for K(AO,K) HMM for K(AO,N) HMM for *(K,*) HMM for R(AY,AO) HMM for R(R,AO)

17 March 2009 phoneme models  As with diphones, portions of multiple versions of the model for a word can be collapsed Depending on the grammar Composed HMM for ROCK Building word HMMs with triphones

17 March 2009 phoneme models  Triphones at word boundaries are dependent on neighbouring words.  This results in significant complication of the HMM for the lanaguage (through which we find the best path, for recognition) Resulting in larger HMMs and slower search Lexicon Five Four Nine ++Breath++ = F = AY = V = OW = R = N = SIL = +breath+ Dictionary Five: F AY V Four: F OW R Nine: N AY N : SIL ++breath++: +breath+ Listed here are five “words” and their pronunciations in terms of “phones”. Let us assume that these are the only words in the current speech to be recognized. The recognition vocabulary thus consists of five words. The system uses a dictionary as a reference for these mappings. It can get more complex

17 March 2009 phoneme models A Simple Looping Graph For Words  We will compose an HMM for the simple grammar above A person may say any combination of the words five, nine and four with silences and breath inbetween Five Four Nine pause ++breath++

17 March 2009 phoneme models start end Word boundary units are not context specific. All words can be connected to (and from) null nodes Using (CI) Phonemes Lexicon Five Four Nine ++Breath++ = F = AY = V = OW = R = N = SIL = +breath+ Each coloured square represents an entire HMM

17 March 2009 phoneme models Five Four Nine ++Breath++ Using the dictionary as reference, the system first maps each word into triphone-based pronunciations. Each triphone further has a characteristic label or type, according to where it occurs in the word. Context is not initially known for cross-word triphones. N(*, AY) cross-word AY(F, V) word-internal R(OW, *) cross-word V(AY, *) cross-word OW(F, R) word-internal N(AY, *) cross-word F(*, OW) cross-word AY(N, N) word-internal F(*, AY) cross-word SIL +breath+ Each triphone is modelled by an HMM Silence is modelled by an HMM breath is modelled by an HMM Using Triphones

17 March 2009 phoneme models A triphone is a single context-specific phoneme. It is not a sequence of 3 phones. Expand the word Five All last phones (except +breath+) become left contexts for first phone of Five. All first phones (except +breath+) become right contexts for last phone of Five Silence can form contexts, but itself does not have any context dependency. Filler phones (e.g. +breath+) are treated as silence when building contexts. Like silence, they themselves do not have any context dependency. HMM for “Five”. This is composed of 8 HMMs. Each triple-box represents a triphone. Each triphone model is actually a left-to-right HMM. Using Triphones Lexicon Five Four Nine ++Breath++ = F = AY = V = OW = R = N = SIL = +breath+

17 March 2009 phoneme models  Even a simple looping grammar becomes very complicated because of cross-word triphones  The HMM becomes even more complex when some of the words have only one phoneme There will be as many instances of the HMM for the word as there are word contexts No portion of these HMMs will be shared! The triphone based Grammar HMM

17 March 2009 phoneme models Linking rule: Link from rightmost color x with right context color y to leftmost color y with right context color x W1 W2 W3 W4 P(W1) P(W2) P(W3) P(W4) The triphone based Grammar HMM Lexicon Five Four Nine ++Breath++ five four nine sil breath

17 March 2009 phoneme models All linking rules: Begin goes to all silence left contexts All silence right contexts go to end Link from rightmost color x with right context color y to leftmost color y with right context color x start end The triphone based Grammar HMM Lexicon Five Four Nine ++Breath++

17 March 2009 phoneme models Types of triphones  A triphone in the middle of a word sounds different from the same triphone at word boundaries e.g the word-internal triphone AX(G,T) from GUT: G AX T Vs. cross-word triphone AX(G,T) in BIG ATTEMPT  The same triphone in a single-word (e.g when the central phoneme is a complete word) sounds different E.g. AX(G,T) in WAG A TAIL The word A: AX is a single-phone word and the triphone is a single-word triphone  We distinguish four types of triphones: Word-internal Cross-word at the beginning of a word Cross-word at the end of a word Single-word triphones  Separate models are learned for the four cases and appropriately used

17 March 2009 phoneme models Context Issues  Phonemes are affected by adjacent phonemes.  If there is no adjacent phoneme, i.e. if a phoneme follows or precedes a silence or a pause, that will also have a unique structure  We will treat “silence” as a context also  “Filler” sounds like background noises etc. are typically treated as equivalent to silence for context  “Voiced” filler sounds, like “UM”, “UH” etc. do affect the structure of adjacent phonemes and must be treated as valid contexts

17 March 2009 phoneme models  1.53 million words of training data (~70 hours)  All 39 phonemes are seen (100%)  1387 of 1521 possible diphones are seen (91%) Not considering cross-word diphones as distinct units  of possible triphones are seen (42%) not maintaining the distinction between different kinds of triphones Training Data Considerations for Nphone Units

17 March 2009 phoneme models  All context-independent phonemes occur in sufficient numbers to estimate HMM parameters well Some phonemes such as “ZH” may be rare in some corpora. In such cases, they can be merged with other similar phonemes, e.g. ZH can be merged with SH, to avoid a data insufficiency problem Histogram of the number of occurrences of the 39 phonemes in 1.5 million words of Broadcast News Counts of CI Phones

17 March 2009 phoneme models  Counts (x axis) > not shown, Count of counts (Y axis) clipped from above at 10  The figure is much flatter than the typical trend given by Zipf’s law  The mean number of occurrances of diphones is 4900  Most diphones can be well trained Diphones with low (or 0) count, e.g. count < 100, account for less then 0.1% of the total probability mass  Again, contrary to Zipf’s law Diphones with low count, or unseen diphones, can be clustered with other similar diphones with minimal loss of recognition performance “Count of counts” histogram for the 1387 diphones in 1.5 million words of Broadcast News Counts of Diphones

17 March 2009 phoneme models  Counts (x axis) > 250 not shown  Follows the trends of Zipf’s law very closely Because of the large number of triphones  The mean number of occurrances of observed triphones is 300  The majority of the triphones are not seen, or are rarely seen 58% of all triphones are never seen 86% of all triphones are seen less than 10 times The majority of the triphones in the langauge will be poorly trained “Count of counts” histogram for the triphones in 1.5 million words of Broadcast News Counts of Triphones

17 March 2009 phoneme models  Spectral trajectories are also affected by farther contexts E.g. by phonemes that are two phonemes distant from the current one  The effect of longer contexts is much smaller than that of immediate context, in most speech The use of longer context models only results in relatively minor improvements in recognition accuracy  The are far too many possibilities for longer context units E.g, there are 40 5 possible quinphone units (that consider a 2-phoneme context on either side), even if we ignore cross-word effects  Cross-word effects get far more complicated with longer- context units Cross-word contexts may now span multiple words. The HMM for any word thus becomes dependent on the the previous two (or more) words, for instance. Higher order Nphones

17 March 2009 phoneme models  Word models Very effective, if they can be well trained Difficult to train well, when vocabularies get large Cannot recognize words that are not see in training data  Context-independent phoneme models Simple to train; data insufficiency rarely a problem All phonemes are usually seen in the training data  If some phonemes are not seen, they can be mapped onto other, relatively common phonemes  Diphones Better characterization of sub-word acoustic phenomena than simple context-independent phonemes Data insufficiency a relatively minor problem. Some diphones are usually not seen in training data.  Their HMMs must be inferred from the HMMs for seen diphones Training Tradeoffs

17 March 2009 phoneme models  Triphones Characterizations of phonemes that account for contexts on both sides Result in better recognition than diphones Data insufficiency is a problem: the majority of triphones are not seen in training data.  Parameter sharing techniques must be used to train uncommon triphones derive HMMs for unseen triphones Advantage: can back-off to CI phonemes when HMMs are not available for the triphones themselves  Higher order Nphone models Better characterizations than triphones, however the benefit to be obtained from going to Nphones of higher contexts is small Data insufficiency a huge problem – most Nphones are not seen in the training data  Complex techniques must be employed for inferring the models for unseen Nphones  Alternatively, one can backoff to triphones or CI-phonemes Training Tradeoffs

17 March 2009 phoneme models  Context-independent phonemes are rarely used by themselves Recognition performance is not good enough  Diphone models are typically used in embedded systems Compact models Nearly all diphones will be seen in the training data Recognition accuracy still not as good as triphones  Triphones are the most commonly used units Best recognition performance Model size controlled by tricks such as parameter sharing Most triphones not see in training data  Nevertheless models for them can be constructed as we will see later  Quinphone models are used by some systems Much increase in system complexity for small gains Context-Dependent Phonemes for Recognition

17 March 2009 phoneme models HMM Topology for N-phones  Context-dependent phonemes are modeled with the same topology as context independent phonemes Most systems use a 3-state topology All C-D phonemes have the same topology  Some older systems use a 5-state topology Which permits states to be skipped entirely This is not demonstrably superior to the 3-state topology

17 March 2009 phoneme models Training Triphone Models  Like training context-independent (CI) phoneme models  First step: Train CI models Important Needed for initialization  Initialize all triphone models using corresponding CI model E.g. the model for IY (B,T) [the phoneme IY that occurs after B and before T, such as the IY in BEAT] is initialized by the model for the context independent phoneme IY  Create HMMs for the word sequences using triphones  Train HMMs for all triphones in the training data using Baum Welch

17 March 2009 phoneme models Training Triphone Models with SphinxTrain  A simple exercise: Train triphone models using a small corpus Recognize a small test set using these models