Robust speaking rate estimation using broad phonetic class recognition Jiahong Yuan and Mark Liberman University of Pennsylvania Mar. 16, 2010.

Slides:



Advertisements
Similar presentations
Building an ASR using HTK CS4706
Advertisements

Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
Research & Development ICASSP' Analysis of Model Adaptation on Non-Native Speech for Multiple Accent Speech Recognition D. Jouvet & K. Bartkova France.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.
-- A corpus study using logistic regression Yao 1 Vowel alternation in the pronunciation of THE in American English.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Automatic Continuous Speech Recognition Database speech text Scoring.
Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.
Introduction to Automatic Speech Recognition
1 CS 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Introduction Mel- Frequency Cepstral Coefficients (MFCCs) are quantitative representations of speech and are commonly used to label sound files. They are.
Speech Signal Processing
Speech rate affects the word error rate of automatic speech recognition systems. Higher error rates for fast speech, but also for slow, hyperarticulated.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
1 Phonetics and Phonemics. 2 Phonetics and Phonemics : Phonetics The principle goal of Phonetics is to provide an exact description of every known speech.
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
Daniel May Department of Electrical and Computer Engineering Mississippi State University Analysis of Correlation Dimension Across Phones.
Quantitative and qualitative differences in understanding sentences interrupted with noise by young normal-hearing and elderly hearing-impaired listeners.
Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
Automatic Identification and Classification of Words using Phonetic and Prosodic Features Vidya Mohan Center for Speech and Language Engineering The Johns.
A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Recognition of Speech Using Representation in High-Dimensional Spaces University of Washington, Seattle, WA AT&T Labs (Retd), Florham Park, NJ Bishnu Atal.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS Nelson Morgan, Barry Y Chen, Qifeng Zhu, Andreas Stolcke International.
Introduction to Speech Neal Snider, For LIN110, April 12 th, 2005 (adapted from slides by Florian Jaeger)
New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 1/27 Intro.Intro.
A quick walk through phonetic databases Read English –TIMIT –Boston University Radio News Spontaneous English –Switchboard ICSI transcriptions –Buckeye.
Problems of Modeling Phone Deletion in Conversational Speech for Speech Recognition Brian Mak and Tom Ko Hong Kong University of Science and Technology.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
The HTK Book (for HTK Version 3.2.1) Young et al., 2002.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
Performance Comparison of Speaker and Emotion Recognition
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
Investigating /l/ variation in English through forced alignment Jiahong Yuan & Mark Liberman University of Pennsylvania Sept. 9, 2009.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
Speech recognition Home Work 1. Problem 1 Problem 2 Here in this problem, all the phonemes are detected by using phoncode.doc There are several phonetics.
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
Audio Books for Phonetics Research CatCod2008 Jiahong Yuan and Mark Liberman University of Pennsylvania Dec. 4, 2008.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
A NONPARAMETRIC BAYESIAN APPROACH FOR
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
Structure of Spoken Language
Structure of Spoken Language
Audio Books for Phonetics Research
Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Phonetics and Phonemics
2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2.
Presentation transcript:

Robust speaking rate estimation using broad phonetic class recognition Jiahong Yuan and Mark Liberman University of Pennsylvania Mar. 16, 2010

Yuan and Liberman: ICASSP Introduction Speaking rate has been found to be related to many factors (Yuan et al. 2006, Jacewicz et al. 2009): young people > old people northern speakers > southern speakers (American English) male speakers > female speakers long utterances > short utterances emotion, style, conversation topics, foreign accent, etc. Listeners ‘normalize’ speaking rate in speech perception (Miller and Liberman 1979); and speaking rate affects listeners’ attitudes to the speaker and the message (Megehee et al. 2003). Speaking rate also affects the performance of automatic speech recognition. Fast and slow speech lead to a higher word error rate (Siegler and Stern, 1995, Mirghafori et al, 1996).

Yuan and Liberman: ICASSP Introduction The conventional method for building a robust speaking rate estimator is to do syllable detection based on energy measurements and peak picking algorithms (Mermelstein 1975, Morgan and Fosler-Lussier 1998, Xie amd Niyogi 2006, Wang and Narayanan 2007, Zhang and Glass 2009). The studies have utilized full-band energy, sub-band energy, and sub- band energy correlation in syllable detection. Howitt (2000) demonstrated that energy in a fixed frequency band ( Hz) was as good for finding vowel landmarks as the energy at the first formant. Our study on syllable detection using the convex-hull algorithm (Mermelstein 1975) also shows that this frequency band has the best results.

Yuan and Liberman: ICASSP Introduction

Yuan and Liberman: ICASSP Introduction Using automatic speech recognition for speaking rate estimation would be a natural approach, however: The performance of ASR is much affected by speaking rate; ASR only works well when the training and test data are from the same speech genre, dialect, or language. For speaking rate estimation, what is important is not the recognition word error rate (WER) or phone error rate. A recognizer that can robustly distinguish between vowels and consonants would be sufficient.  broad phonetic class recognition for speaking rate estimation

Yuan and Liberman: ICASSP Introduction The broad phonetic classes possess more distinct spectral characteristics than the phones within the same broad phonetic classes. It has been found that almost 80% of misclassified phonemes were within the same broad phonetic class (Halberstadt and Glass 1997). Broad phonetic classes have been applied for improved phone recognition, and have been shown to be more robust in noise (Scanlon et al. 2007, Sainath and Zue 2008). Broad phonetic classes have also been used in large vocabulary ASR to overcome the issue of data sparsity and robustness, e.g., decision tree-based clustering with broad phonetic classes.

Yuan and Liberman: ICASSP Data and Method A broad phonetic class recognizer was built using 34,656 speaker turns from the SCOTUS corpus (~ 66 hours). The speaker turns were first forced aligned using the Penn Phonetics Lab Forced Aligner, and then, the aligned phones were mapped to broad phonetic classes for training. The acoustic models are mono broad-class three-state HMMs. Each HMM has 64 Gaussian Mixture components on 39 PLP coefficients. The language model is broad-class bigram probabilities. To compare, a general monophone recognizer was also built using the same data. The training was done using the HTK Toolkit, and the HVite tool in HTK was used for testing.

Yuan and Liberman: ICASSP Data and Method ClassPhonetic categorization CMU dictionary phones Number of tokens V1Stressed vowelsVowel classes: 1 and 2 447,665 V0unstressed vowelsVowel class: 0336,278 SStops and affricates B CH D G JH K P T 418,994 FFricatives DH F HH S SH TH V Z ZH 352,968 NNasals M N NG 208,178 GGlides and liquids L R W Y 203,683 PPauses and non- speech --149,268

Yuan and Liberman: ICASSP Evaluation on TIMIT There is no standard scoring toolkit for syllable detection evaluation. We follow the evaluation method in Xie and Niyogi (2006): Find the middle points of the vowel segments from the recognition output. A point is counted as correct if it is located within a syllabic segment, otherwise, it is counted as incorrect. If two or more points are located within a syllabic segment, only one of them is counted as correct and the others as incorrect. The incorrect points are insertion errors, and the syllabic segments that don’t have any correct points are deletion errors. Deletion and insertion error rates are both calculated against the number of syllabic segments in the testing data.

Yuan and Liberman: ICASSP Evaluation on TIMIT There are 1,344 utterances and 17,190 syllabic segments in the testing data, which includes all the utterances in the TIMIT test dataset excluding SA1 and SA2 utterances.

Yuan and Liberman: ICASSP Effect of Language Model Language model has a larger effect on monophone recognition than on broad phonetic class recognition. In the following experiments using broad phonetic class models, the grammar scale factor was set to be 2.5.

Yuan and Liberman: ICASSP Error analysis There were 7,448 outside insertions in total, among which: /r, l, y, w/: 3635 (48.8%) /q/: 1411 (18.9%) - “a glottal stop that “may be an allophone of t, or may mark an initial vowel or a vowel-vowel boundary”. The syllabic nasals and laterals, /el, em, en, eng/, and the schwa vowels, /ax, ax-h, ax-r/, are more likely to be deleted. The diphthongs, /aw, ay, ey, ow, oy/, are more likely to have inside insertions.

Yuan and Liberman: ICASSP Error analysis

Yuan and Liberman: ICASSP Evaluation on Switchboard The ICSI manual transcription portion of the Switchboard telephone conversation speech was used for testing. We ran the broad class recognizer on the entire utterances, and let the recognizer handle pauses and non-speech segments in the utterances. To calculate the detected speaking rate, we simply counted the number of vowels, both V1 and V0, in the recognition of an utterance, and divided the number by the length of the utterance.

Yuan and Liberman: ICASSP Evaluation on Foreign Accented English 200 self-introductions selected from the CSLU foreign accented English corpus were used for testing. correlation: 0.898; mean error: -0.01; stddev error: 0.36.

Yuan and Liberman: ICASSP Evaluation on Mandarin Broadcast News 5,000 utterances randomly selected from the Hub-4 Mandarin Broadcast News corpus were used for testing. No language models were involved. correlation:.755;mean error:.055; stddev error:.730.

Yuan and Liberman: ICASSP Conclusion We built a broad phonetic class recognizer, and applied it to syllable detection and speaking rate estimation. Its performance is comparable to state-of-the-art syllable detection and speaking rate estimation algorithms, and it is robust for different speech genres and different languages without tuning any parameters. Unlike the previous algorithms, the broad class phonetic recognizer can automatically handle pauses and non-speech segments. This presents a great advantage for estimating speaking rate in natural speech. With no language models involved, the broad class recognizer still has good performance on syllable detection and speaking rate estimation, which opens up many opportunities for application.