On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech 2006 9/14/06.

Slides:



Advertisements
Similar presentations
Improved ASR in noise using harmonic decomposition Introduction Pitch-Scaled Harmonic Filter Recognition Experiments Results Conclusion aperiodic contribution.
Advertisements

1 A Spectral-Temporal Method for Pitch Tracking Stephen A. Zahorian*, Princy Dikshit, Hongbing Hu* Department of Electrical and Computer Engineering Old.
“Effect of Genre, Speaker, and Word Class on the Realization of Given and New Information” Julia Agustín Gravano & Julia Hirschberg {agus,
Acoustic Characteristics of Vowels
Coarticulation Analysis of Dysarthric Speech Xiaochuan Niu, advised by Jan van Santen.
The perception of dialect Julia Fischer-Weppler HS Speaker Characteristics Venice International University
The evaluation and optimisation of multiresolution FFT Parameters For use in automatic music transcription algorithms.
Results: Word prominence detection models Each feature set increases accuracy over the 69% baseline accuracy. Word Prominence Detection using Robust yet.
Look Who’s Talking Now SEM Exchange, Fall 2008 October 9, Montgomery College Speaker Identification Using Pitch Engineering Expo Banquet /08/09.
Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented.
AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY R.J.J.H. van Son, Barbertje M. Streefkerk, and Louis C.W. Pols Institute of Phonetic Sciences / ACLC University.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
: Recognition Speech Segmentation Speech activity detection Vowel detection Duration parameters extraction Intonation parameters extraction German Italian.
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 1 PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL BALANCE OF VOWELS Jan P.H. van Santen and Xiaochuan.
Modeling Prosodic Sequences with K-Means and Dirichlet Process GMMs Andrew Rosenberg Queens College / CUNY Interspeech 2013 August 26, 2013.
Emotions and Voice Quality: Experiments with Sinusoidal Modeling Authors: Carlo Drioli, Graziano Tisato, Piero Cosi, Fabio Tesser Institute of Cognitive.
Development of coarticulatory patterns in spontaneous speech Melinda Fricke Keith Johnson University of California, Berkeley.
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
Comparing American and Palestinian Perceptions of Charisma Using Acoustic-Prosodic and Lexical Analysis Fadi Biadsy, Julia Hirschberg, Andrew Rosenberg,
Presented By: Karan Parikh Towards the Automated Social Analysis of Situated Speech Data Watt, Chaudhary, Bilmes, Kitts CS546 Intelligent.
Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.
Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.
Context in Multilingual Tone and Pitch Accent Recognition Gina-Anne Levow University of Chicago September 7, 2005.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
Classification of Discourse Functions of Affirmative Words in Spoken Dialogue Julia Agustín Gravano, Stefan Benus, Julia Hirschberg Shira Mitchell, Ilia.
9/5/20051 Acoustic/Prosodic and Lexical Correlates of Charismatic Speech Andrew Rosenberg & Julia Hirschberg Columbia University Interspeech Lisbon.
Communications & Multimedia Signal Processing Analysis of Effects of Train/Car noise in Formant Track Estimation Qin Yan Department of Electronic and Computer.
Varying Input Segmentation for Story Boundary Detection Julia Hirschberg GALE PI Meeting March 23, 2007.
MIL Speech Seminar TRACHEOESOPHAGEAL SPEECH REPAIR Arantza del Pozo CUED Machine Intelligence Laboratory November 20th 2006.
Representing Acoustic Information
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Schizophrenia and Depression – Evidence in Speech Prosody Student: Yonatan Vaizman Advisor: Prof. Daphna Weinshall Joint work with Roie Kliper and Dr.
1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.
As a conclusion, our system can perform good performance on a read speech corpus, but we will have to develop more accurate tools in order to model the.
1 ELEN 6820 Speech and Audio Processing Prof. D. Ellis Columbia University Midterm Presentation High Quality Music Metacompression Using Repeated- Segment.
On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
♥♥♥♥ 1. Intro. 2. VTS Var.. 3. Method 4. Results 5. Concl. ♠♠ ◄◄ ►► 1/181. Intro.2. VTS Var..3. Method4. Results5. Concl ♠♠◄◄►► IIT Bombay NCC 2011 : 17.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,
National Taiwan University, Taiwan
VIP: Finding Important People in Images Clint Solomon Mathialagan Andrew C. Gallagher Dhruv Batra CVPR
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Singer Similarity Doug Van Nort MUMT 611. Goal Determine Singer / Vocalist based on extracted features of audio signal Classify audio files based on singer.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Acoustic Cues to Emotional Speech Julia Hirschberg (joint work with Jennifer Venditti and Jackson Liscombe) Columbia University 26 June 2003.
Pitch Tracking + Prosody January 19, 2012 Homework! For Tuesday: introductory course project report Background information on your consultant and the.
Yow-Bang Wang, Lin-Shan Lee INTERSPEECH 2010 Speaker: Hsiao-Tsung Hung.
On the role of context and prosody in the interpretation of ‘okay’ Julia Agustín Gravano, Stefan Benus, Julia Hirschberg Héctor Chávez, and Lauren Wilcox.
A Text-free Approach to Assessing Nonnative Intonation Joseph Tepperman, Abe Kazemzadeh, and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory,
Investigating Pitch Accent Recognition in Non-native Speech
Comparing American and Palestinian Perceptions of Charisma Using Acoustic-Prosodic and Lexical Analysis Fadi Biadsy, Julia Hirschberg, Andrew Rosenberg,
Agustín Gravano & Julia Hirschberg {agus,
Emotional Speech Julia Hirschberg CS /16/2019.
Speaker Identification:
Low Level Cues to Emotion
Automatic Prosodic Event Detection
Presentation transcript:

On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06

9/18/20062 Talk Outline Introduction to Pitch Accent Previous Work Contribution and Approach Corpus Results and Discussion Conclusion Future Work

9/18/20063 Introduction Pitch Accent is the way a word is made to “stand out” from its surrounding utterance. –As opposed to lexical stress which refers to the most prominent syllable within a word. Accurate detection of pitch accent is particularly important to many NLU tasks. –Identification of salient or “important” words. –Indication of Information Status. –Disambiguation of Syntax/Semantics. Pitch (f0), Duration, and Energy are all known correlates of Pitch Accent deaccentedaccented

9/18/20064 Previous Work (Sluijter and van Heuven 96,97): Accent in Dutch strongly correlates with the energy of a word extracted from the frequency subband > 500Hz. (Heldner, et al. 99,01) and (Fant, et al. 00) found that high- frequency emphasis or spectral tilt strongly correlates with accent in Swedish. A lot of research attention has been given to the automatic identification of prominent or accented words. –(Tamburini 03,05) used the energy component of the 500Hz- 2000Hz band. –(Tepperman 05) used the RMS energy from the 60Hz-400Hz band –And many more...

9/18/20065 Contribution and Approach There is no agreement as to the best -- most discriminative -- frequency subband from which to extract energy information. We set up a battery of analysis-by-classification experiments varying: –The frequency band: lower bound frequency ranged from 0 to 19 bark bandwidth ranged from 1 to 20 bark –upper bound was 20 bark by the 8KHz Nyquist rate Also, analyzed the first and/or second formants. –The region of analysis: Full word, only vowels, longest syllable, longest vowel –Speaker: Each of 4 speakers separately, and all together. We performed the experiments using J48 -- a java implementation of C4.5.

9/18/20066 Contribution and Approach Local Features: –minimum, maximum, mean, standard deviation and RMS of energy –z score (x – mean / std.dev) of max energy within the word Context-based Features: –Using 6 windows: –The max and mean energy were normalized by z score (x – mean / std.dev) and the energy range within the window (x / (max-min)) word i word i+1 word i+2 word i-1 word i-2

9/18/20067 Corpus Boston Directions Corpus (BDC) [Hirschberg&Nakatani96] –Speech elicited from a direction-giving task. –Used only the read portion. –50 minutes –Fully ToBI labeled –10825 words Manually segmented –4 Speakers: 3 male, 1 female

9/18/20068 Variation across subbands Energy from different frequency regions predict pitch accent differently –Across experiment configurations mean relative improvement of best region over worst: 14.8%

9/18/20069 The most predictive subband The single most predictive subband for all speakers was 3-18bark over full words –Classification Accuracy: 76% ( P=71.6,R=73.4) 57.6% majority class baseline (no accent) –However, performs significantly worse than the best when analyzing the speech of one speaker in particular. Speaker h2, not the female speaker

9/18/ The most robust subband The subband from 2-20bark performs as well as the most discriminative subband in all but one configuration [h1-longest vowel] –Accuracy: 75.5% (P=70.5, R=72.5) –Due to its robustness we consider this band the “best” The formant-based energy features perform worse than fixed bands –6.4% mean accuracy reduction from 2-20bark –Attributable to: Errors in the formant tracking algorithm The presence of discriminative information in higher formants

9/18/ Contextual windows Most predictive features were z-score normalized maximum energy relative to three contextual windows 1 previous and 1 following word 2 previous and 1 following word 2 previous and 2 following words word i word i+1 word i+2 word i-1 word i-2

9/18/ Combining predictions There is a relatively small intersection of correct predictions even among similar subbands of words were correctly classified by at least one classifier. Using a majority voting scheme: –Accuracy: 81.9% (p=76.7, r=82.5)

9/18/ Region of analysis How do the regioning strategies perform? Full Word > Only Vowels > Longest Syllable ~ Longest Vowel Why does analysis of the full word outperform other regioning strategies? –Syllable/Vowel segmentation algorithms are imperfect –Pitch accents are not neatly placed –Duration is a crude measure of lexical stress

9/18/ Conclusion Using an analysis-by-classification approach we showed: –Energy from different frequency bands correlate with pitch accent differently. –The “best” (high accuracy, most robust) frequency region to be 2-20bark (>2bark?) –A voting classifier based exclusively on energy can predict accent reliably.

9/18/ Future Work Can we automatically identify which bands will predict accent best for a given word? We plan on incorporating these findings into a general pitch accent classifier with pitch and duration features. We plan on repeating these experiments on spontaneous speech data.

Thank you {amaxwell,