Presentation is loading. Please wait.

Presentation is loading. Please wait.

On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech 2006 9/14/06.

Similar presentations


Presentation on theme: "On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech 2006 9/14/06."— Presentation transcript:

1

2 On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech 2006 9/14/06

3 9/18/20062 Talk Outline Introduction to Pitch Accent Previous Work Contribution and Approach Corpus Results and Discussion Conclusion Future Work

4 9/18/20063 Introduction Pitch Accent is the way a word is made to “stand out” from its surrounding utterance. –As opposed to lexical stress which refers to the most prominent syllable within a word. Accurate detection of pitch accent is particularly important to many NLU tasks. –Identification of salient or “important” words. –Indication of Information Status. –Disambiguation of Syntax/Semantics. Pitch (f0), Duration, and Energy are all known correlates of Pitch Accent deaccentedaccented

5 9/18/20064 Previous Work (Sluijter and van Heuven 96,97): Accent in Dutch strongly correlates with the energy of a word extracted from the frequency subband > 500Hz. (Heldner, et al. 99,01) and (Fant, et al. 00) found that high- frequency emphasis or spectral tilt strongly correlates with accent in Swedish. A lot of research attention has been given to the automatic identification of prominent or accented words. –(Tamburini 03,05) used the energy component of the 500Hz- 2000Hz band. –(Tepperman 05) used the RMS energy from the 60Hz-400Hz band –And many more...

6 9/18/20065 Contribution and Approach There is no agreement as to the best -- most discriminative -- frequency subband from which to extract energy information. We set up a battery of analysis-by-classification experiments varying: –The frequency band: lower bound frequency ranged from 0 to 19 bark bandwidth ranged from 1 to 20 bark –upper bound was 20 bark by the 8KHz Nyquist rate Also, analyzed the first and/or second formants. –The region of analysis: Full word, only vowels, longest syllable, longest vowel –Speaker: Each of 4 speakers separately, and all together. We performed the experiments using J48 -- a java implementation of C4.5.

7 9/18/20066 Contribution and Approach Local Features: –minimum, maximum, mean, standard deviation and RMS of energy –z score (x – mean / std.dev) of max energy within the word Context-based Features: –Using 6 windows: –The max and mean energy were normalized by z score (x – mean / std.dev) and the energy range within the window (x / (max-min)) word i word i+1 word i+2 word i-1 word i-2

8 9/18/20067 Corpus Boston Directions Corpus (BDC) [Hirschberg&Nakatani96] –Speech elicited from a direction-giving task. –Used only the read portion. –50 minutes –Fully ToBI labeled –10825 words Manually segmented –4 Speakers: 3 male, 1 female

9 9/18/20068 Variation across subbands Energy from different frequency regions predict pitch accent differently –Across experiment configurations mean relative improvement of best region over worst: 14.8%

10 9/18/20069 The most predictive subband The single most predictive subband for all speakers was 3-18bark over full words –Classification Accuracy: 76% ( P=71.6,R=73.4) 57.6% majority class baseline (no accent) –However, performs significantly worse than the best when analyzing the speech of one speaker in particular. Speaker h2, not the female speaker

11 9/18/200610 The most robust subband The subband from 2-20bark performs as well as the most discriminative subband in all but one configuration [h1-longest vowel] –Accuracy: 75.5% (P=70.5, R=72.5) –Due to its robustness we consider this band the “best” The formant-based energy features perform worse than fixed bands –6.4% mean accuracy reduction from 2-20bark –Attributable to: Errors in the formant tracking algorithm The presence of discriminative information in higher formants

12 9/18/200611 Contextual windows Most predictive features were z-score normalized maximum energy relative to three contextual windows 1 previous and 1 following word 2 previous and 1 following word 2 previous and 2 following words word i word i+1 word i+2 word i-1 word i-2

13 9/18/200612 Combining predictions There is a relatively small intersection of correct predictions even among similar subbands. 10823 of 10825 words were correctly classified by at least one classifier. Using a majority voting scheme: –Accuracy: 81.9% (p=76.7, r=82.5)

14 9/18/200613 Region of analysis How do the regioning strategies perform? Full Word > Only Vowels > Longest Syllable ~ Longest Vowel Why does analysis of the full word outperform other regioning strategies? –Syllable/Vowel segmentation algorithms are imperfect –Pitch accents are not neatly placed –Duration is a crude measure of lexical stress

15 9/18/200614 Conclusion Using an analysis-by-classification approach we showed: –Energy from different frequency bands correlate with pitch accent differently. –The “best” (high accuracy, most robust) frequency region to be 2-20bark (>2bark?) –A voting classifier based exclusively on energy can predict accent reliably.

16 9/18/200615 Future Work Can we automatically identify which bands will predict accent best for a given word? We plan on incorporating these findings into a general pitch accent classifier with pitch and duration features. We plan on repeating these experiments on spontaneous speech data.

17 Thank you {amaxwell, julia}@cs.columbia.edu


Download ppt "On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech 2006 9/14/06."

Similar presentations


Ads by Google