Presentation is loading. Please wait.

Presentation is loading. Please wait.

0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

Similar presentations


Presentation on theme: "0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center."— Presentation transcript:

1 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center for Spoken Language Understanding (CSLU) Department of Biomedical Engineering (BME) Oregon Health & Science University (OHSU) 1 John-Paul Hosom is now with Sensory, Inc.

2 1 / 27 Outline 1.Introduction: Error Analysis of TIMIT ASR 2.Introduction: Hypothesis 3.Background: Characteristics of Clear Speech 4.Background: Formant Targets and Locus Theory 5.Objectives of Current Study 6.Corpus 7.Model 8.Methods 9.Results 10.Conclusions & Future Work

3 2 / 27 1. Introduction: Error Analysis of TIMIT Phoneme ASR of TIMIT: HMM/ANN system trained on TIMIT, decoded with bigram language model [Hosom et al., 2010] Accuracy of 74% is high for HMM-based systems. Vowel substitutions account for 35% of errors, covering all distinctive-phonetic feature dimensions: –62% of vowel substitutions have front/back error –52% of vowel substitutions have tense/lax error –68% of vowel substitutions have height error –31% of vowel substitutions have vowel/cons. error Errors are not confined to specific type

4 3 / 27 2. Introduction: Hypothesis Hypothesis: Our hypothesis is that a main cause of ASR errors is not the feature space or the classification technique (which might result in more distinct error patterns), but in “noise” in the probability estimates. This noise is caused by variability in the features, which can be reduced by estimating phoneme targets instead of the observed values. For now, we work in the formant space, although targets need not be limited to this space. Also, to control for speaking style, we are interested in both “clear” and “conversational” speech.

5 4 / 27 3. Characteristics of Clear Speech clear speech: observed midpoints of vowels, accuracy 90% conversational speech: observed midpoints of vowels, accuracy 73% “Clear” speech has expanded vowel space and longer phoneme durations

6 5 / 27 Using a “hybridization” algorithm that combined features of CLR and CNV speech and perceptual testing, we have shown over several experiments that the most relevant features for intelligibility are the combination of spectrum and duration. [Kain, Amano-Kusumoto, and Hosom (2008); Kusumoto, Kain, Hosom, and van Santen (2007); Amano-Kusumoto and Hosom (2009)] This has led us to study a model coarticulation, to quantitatively model the change of formants over time. 3. Characteristics of Clear Speech

7 6 / 27 4. Formant Targets and Locus Theory [From Klatt 1987, p. 753]

8 7 / 27 time frequency /d/ /u/ Most consonants (all except /j/, /l/, /r/, /w/) do not have visible formants They have “virtual” formants identified by coarticulation in the vowel. 4. Formant Targets and Locus Theory

9 8 / 27 4. Formant Targets and Locus Theory [from Delattre et al., 1955 as reported in Johnson, 1997, p. 135]

10 9 / 27 4. Formant Targets and Locus Theory Locus Theory Summary: 1.Vowels and consonants have formant targets; most consonants have “virtual” formants. 2.Coarticulation yields smooth change between targets when formants are visible. 3.If duration is too short, formants do not reach their targets, yielding undershoot. 4.Both the targets and the rate of change are important for intelligibility.

11 10 / 27 Outline 1.Introduction 2.Background: Error Analysis of TIMIT ASR 3.Background: Characteristics of Clear Speech 4.Background: Formant Targets and Locus Theory 5.Objectives of Current Study 6.Corpus 7.Model 8.Methods 9.Results 10.Conclusions & Future Work

12 11 / 27 5. Objectives of Current Study Estimate each phoneme’s target instead of relying on observed data. Using targets will reduce the variance of features, yielding conversational (and clear) speech with feature overlap similar to that of clear speech. Given a GMM of target values (instead of observed values), compute the probability of each token’s estimated targets given each possible phoneme. Use semi-Markov model to address trajectory of features. In real system, move from formants to another feature domain.

13 12 / 27 5. Objectives of Current Study Mean and standard deviation of target values

14 13 / 27 6. Corpus Corpus: 1.Male and female speaker 2.Sentences contain a neutral carrier phrase (5 total) followed by a target word (242 total) 3.Target words are common English CVC words with 23 initial and final consonants and 8 vowels. 4.All sentences spoken in both clear and conversational styles. 5.Two recordings per style of each sentence. 6.Formants and phoneme boundaries automatically estimated, manually corrected with verification.

15 14 / 27 Formant trajectory model: 7. Model is the estimated formant trajectory over time t. T C1, T V, T C2 are target formant values for C1, V, C2. is the degree of articulation of C1 or C2 is a sigmoid function over time t. s is maximum slope of, p is position of s.

16 15 / 27 Formant trajectory model: 7. Model

17 16 / 27 Formant trajectory model: 7. Model

18 17 / 27 Estimating Model Parameters: 1.Two sets of parameters to estimate: (a) s 1, s 2, p 1, p 2 estimated on a per-token basis (b) T C1, T V, T C2 estimated on a per-token basis (independent of speaking style) and then averaged 2.For one token, error is: 3.Genetic algorithm used to find best estimates. Fitness function is error summed over all tokens. 8. Methods

19 18 / 27 Estimating Model Parameters: 4.Genetic algorithm employs mutation, crossover (exchange of group of parameters), and elitism (best solutions retained in next generation). 5.Data partitioned into 20 folds for n-fold validation. 6.Performed 60 randomly-initialized starts in parameter estimation to get 60 points per phoneme. 8. Methods

20 19 / 27 9. Results Histograms of vowel contribution: Maximum contribution of vowel for both speaking styles (enforced minimum value of 0.6) Clear speechConversational Speech

21 20 / 27 9. Results Target Estimation Results: Estimated targets for vowels (60 points per phoneme)

22 21 / 27 9. Results Target Estimation Results: Estimated targets for C1 (60 points per phoneme)

23 22 / 27 9. Results Target Estimation Results: Vowel classification accuracy on training data: xx.x%(xx.x% CLRxx.x% CNV) Vowel classification accuracy on test data: xx.x%(xx.x% CLRxx.x% CNV)

24 23 / 27 Coarticulation Parameter Results: Error surface between estimated model and observed data as a function of s and p : A low error can be obtained for many values of s. 9. Results

25 24 / 27 Coarticulation Parameter Results: As a result, s shows differences in mean for different consonants, but high variance: Therefore, s values do not cluster well, and s values can not be reliably extracted for a single CVC token. 9. Results Mean and standard deviation of second-formant s1 values for 10 phonemes

26 25 / 27 10. Conclusions & Future Work Conclusions: 1.Estimation of consonant and vowel targets can be performed reliably when estimating over a large number of CVCs. 2.Estimation of coarticulation parameter s can not be performed reliably (yet) for a single CVC. 3.Therefore, formant targets can not (yet) be reliably estimated for a single token, which is necessary to apply this work to automatic speech recognition.

27 26 / 27 Future Work: 1.Determine and apply constraints on s, so that coarticulation parameters and formant targets can be reliably estimated for a single token. 2.Given estimated targets for a CVC, estimate the probability of these targets given each phoneme: p(target | phoneme) 3.Use these probabilities instead of the probabilities currently used in speech recognition, p(observed data at 10-msec frame | phoneme) 4.Expand to recognize arbitrary length phoneme sequence and use non-formant features. 10. Conclusions & Future Work

28 27 / 27 This material is based upon work supported by the National Science Foundation under Grant IIS-091575. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. Acknowledgements


Download ppt "0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center."

Similar presentations


Ads by Google