0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Human Speech Recognition Julia Hirschberg CS4706 (thanks to John-Paul Hosum for some slides)
CS 551/651: Structure of Spoken Language Lecture 12: Tests of Human Speech Perception John-Paul Hosom Fall 2008.
Coarticulation Analysis of Dysarthric Speech Xiaochuan Niu, advised by Jan van Santen.
The perception of dialect Julia Fischer-Weppler HS Speaker Characteristics Venice International University
Identification of prosodic near- minimal Pairs in Spontaneous Speech Keesha Joseph Howard University Center for Spoken Language Understanding (CSLU) Oregon.
Look Who’s Talking Now SEM Exchange, Fall 2008 October 9, Montgomery College Speaker Identification Using Pitch Engineering Expo Banquet /08/09.
Vineel Pratap Girish Govind Abhilash Veeragouni. Human listeners are capable of extracting information from the acoustic signal beyond just the linguistic.
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 1 PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL BALANCE OF VOWELS Jan P.H. van Santen and Xiaochuan.
AUTOMATIC SPEECH CLASSIFICATION TO FIVE EMOTIONAL STATES BASED ON GENDER INFORMATION ABSTRACT We report on the statistics of global prosodic features of.
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
Factor Analysis of MRI- Derived Tongue Shapes Mark Hasegawa-Johnson ECE Department and Beckman Institute University of Illinois at Urbana-Champaign.
Page 0 of 8 Time Series Classification – phoneme recognition in reconstructed phase space Sanjay Patil Intelligent Electronics Systems Human and Systems.
Speaker Adaptation for Vowel Classification
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.
Why is ASR Hard? Natural speech is continuous
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Sebastián-Gallés, N. & Bosch, L. (2009) Developmental shift in the discrimination of vowel contrasts in bilingual infants: is the distributional account.
Classification of place of articulation in unvoiced stops with spectro-temporal surface modeling V. Karjigi , P. Rao Dept. of Electrical Engineering,
7-Speech Recognition Speech Recognition Concepts
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Regression Approaches to Voice Quality Control Based on One-to-Many Eigenvoice Conversion Kumi Ohta, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, and.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
A formant-trajectory model and its usage in comparing coarticulatory effects in Dysarthric and normal speech Xiaochuan Niu and Jan P. H. van Santen Center.
CS 551/651: Structure of Spoken Language Lecture 13: Text-to-Speech (TTS) Technology and Automatic Speech Recognition (ASR) John-Paul Hosom Fall 2008.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.
Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Round-Robin Discrimination Model for Reranking ASR Hypotheses Takanobu Oba, Takaaki Hori, Atsushi Nakamura INTERSPEECH 2010 Min-Hsuan Lai Department of.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Performance Comparison of Speaker and Emotion Recognition
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
Audio Books for Phonetics Research CatCod2008 Jiahong Yuan and Mark Liberman University of Pennsylvania Dec. 4, 2008.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
Mplp(t) derived from PLP cepstra,. This observation
Statistical Models for Automatic Speech Recognition
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
This material is based upon work supported by the National Science Foundation under Grant #XXXXXX. Any opinions, findings, and conclusions or recommendations.
Presentation transcript:

0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center for Spoken Language Understanding (CSLU) Department of Biomedical Engineering (BME) Oregon Health & Science University (OHSU) 1 John-Paul Hosom is now with Sensory, Inc.

1 / 27 Outline 1.Introduction: Error Analysis of TIMIT ASR 2.Introduction: Hypothesis 3.Background: Characteristics of Clear Speech 4.Background: Formant Targets and Locus Theory 5.Objectives of Current Study 6.Corpus 7.Model 8.Methods 9.Results 10.Conclusions & Future Work

2 / Introduction: Error Analysis of TIMIT Phoneme ASR of TIMIT: HMM/ANN system trained on TIMIT, decoded with bigram language model [Hosom et al., 2010] Accuracy of 74% is high for HMM-based systems. Vowel substitutions account for 35% of errors, covering all distinctive-phonetic feature dimensions: –62% of vowel substitutions have front/back error –52% of vowel substitutions have tense/lax error –68% of vowel substitutions have height error –31% of vowel substitutions have vowel/cons. error Errors are not confined to specific type

3 / Introduction: Hypothesis Hypothesis: Our hypothesis is that a main cause of ASR errors is not the feature space or the classification technique (which might result in more distinct error patterns), but in “noise” in the probability estimates. This noise is caused by variability in the features, which can be reduced by estimating phoneme targets instead of the observed values. For now, we work in the formant space, although targets need not be limited to this space. Also, to control for speaking style, we are interested in both “clear” and “conversational” speech.

4 / Characteristics of Clear Speech clear speech: observed midpoints of vowels, accuracy 90% conversational speech: observed midpoints of vowels, accuracy 73% “Clear” speech has expanded vowel space and longer phoneme durations

5 / 27 Using a “hybridization” algorithm that combined features of CLR and CNV speech and perceptual testing, we have shown over several experiments that the most relevant features for intelligibility are the combination of spectrum and duration. [Kain, Amano-Kusumoto, and Hosom (2008); Kusumoto, Kain, Hosom, and van Santen (2007); Amano-Kusumoto and Hosom (2009)] This has led us to study a model coarticulation, to quantitatively model the change of formants over time. 3. Characteristics of Clear Speech

6 / Formant Targets and Locus Theory [From Klatt 1987, p. 753]

7 / 27 time frequency /d/ /u/ Most consonants (all except /j/, /l/, /r/, /w/) do not have visible formants They have “virtual” formants identified by coarticulation in the vowel. 4. Formant Targets and Locus Theory

8 / Formant Targets and Locus Theory [from Delattre et al., 1955 as reported in Johnson, 1997, p. 135]

9 / Formant Targets and Locus Theory Locus Theory Summary: 1.Vowels and consonants have formant targets; most consonants have “virtual” formants. 2.Coarticulation yields smooth change between targets when formants are visible. 3.If duration is too short, formants do not reach their targets, yielding undershoot. 4.Both the targets and the rate of change are important for intelligibility.

10 / 27 Outline 1.Introduction 2.Background: Error Analysis of TIMIT ASR 3.Background: Characteristics of Clear Speech 4.Background: Formant Targets and Locus Theory 5.Objectives of Current Study 6.Corpus 7.Model 8.Methods 9.Results 10.Conclusions & Future Work

11 / Objectives of Current Study Estimate each phoneme’s target instead of relying on observed data. Using targets will reduce the variance of features, yielding conversational (and clear) speech with feature overlap similar to that of clear speech. Given a GMM of target values (instead of observed values), compute the probability of each token’s estimated targets given each possible phoneme. Use semi-Markov model to address trajectory of features. In real system, move from formants to another feature domain.

12 / Objectives of Current Study Mean and standard deviation of target values

13 / Corpus Corpus: 1.Male and female speaker 2.Sentences contain a neutral carrier phrase (5 total) followed by a target word (242 total) 3.Target words are common English CVC words with 23 initial and final consonants and 8 vowels. 4.All sentences spoken in both clear and conversational styles. 5.Two recordings per style of each sentence. 6.Formants and phoneme boundaries automatically estimated, manually corrected with verification.

14 / 27 Formant trajectory model: 7. Model is the estimated formant trajectory over time t. T C1, T V, T C2 are target formant values for C1, V, C2. is the degree of articulation of C1 or C2 is a sigmoid function over time t. s is maximum slope of, p is position of s.

15 / 27 Formant trajectory model: 7. Model

16 / 27 Formant trajectory model: 7. Model

17 / 27 Estimating Model Parameters: 1.Two sets of parameters to estimate: (a) s 1, s 2, p 1, p 2 estimated on a per-token basis (b) T C1, T V, T C2 estimated on a per-token basis (independent of speaking style) and then averaged 2.For one token, error is: 3.Genetic algorithm used to find best estimates. Fitness function is error summed over all tokens. 8. Methods

18 / 27 Estimating Model Parameters: 4.Genetic algorithm employs mutation, crossover (exchange of group of parameters), and elitism (best solutions retained in next generation). 5.Data partitioned into 20 folds for n-fold validation. 6.Performed 60 randomly-initialized starts in parameter estimation to get 60 points per phoneme. 8. Methods

19 / Results Histograms of vowel contribution: Maximum contribution of vowel for both speaking styles (enforced minimum value of 0.6) Clear speechConversational Speech

20 / Results Target Estimation Results: Estimated targets for vowels (60 points per phoneme)

21 / Results Target Estimation Results: Estimated targets for C1 (60 points per phoneme)

22 / Results Target Estimation Results: Vowel classification accuracy on training data: xx.x%(xx.x% CLRxx.x% CNV) Vowel classification accuracy on test data: xx.x%(xx.x% CLRxx.x% CNV)

23 / 27 Coarticulation Parameter Results: Error surface between estimated model and observed data as a function of s and p : A low error can be obtained for many values of s. 9. Results

24 / 27 Coarticulation Parameter Results: As a result, s shows differences in mean for different consonants, but high variance: Therefore, s values do not cluster well, and s values can not be reliably extracted for a single CVC token. 9. Results Mean and standard deviation of second-formant s1 values for 10 phonemes

25 / Conclusions & Future Work Conclusions: 1.Estimation of consonant and vowel targets can be performed reliably when estimating over a large number of CVCs. 2.Estimation of coarticulation parameter s can not be performed reliably (yet) for a single CVC. 3.Therefore, formant targets can not (yet) be reliably estimated for a single token, which is necessary to apply this work to automatic speech recognition.

26 / 27 Future Work: 1.Determine and apply constraints on s, so that coarticulation parameters and formant targets can be reliably estimated for a single token. 2.Given estimated targets for a CVC, estimate the probability of these targets given each phoneme: p(target | phoneme) 3.Use these probabilities instead of the probabilities currently used in speech recognition, p(observed data at 10-msec frame | phoneme) 4.Expand to recognize arbitrary length phoneme sequence and use non-formant features. 10. Conclusions & Future Work

27 / 27 This material is based upon work supported by the National Science Foundation under Grant IIS Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. Acknowledgements