Presentation is loading. Please wait.

Presentation is loading. Please wait.

On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.

Similar presentations


Presentation on theme: "On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06."— Presentation transcript:

1 On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06

2 Talk Outline  Introduction to Pitch Accent  Previous Work  Contribution and Approach  Corpus  Results and Discussion  Conclusion  Future Work

3 Introduction  Pitch Accent is the way a word is made to “stand out” from its surrounding utterance.  As opposed to lexical stress which refers to the most prominent syllable within a word.  Accurate detection of pitch accent is particularly important to many NLU tasks.  Identification of “important” words.  Indication of Discourse Status and Structure.  Disambiguation Syntax/Semantics.  Pitch (f0), Duration, and Energy are all known correlates of Pitch Accent

4 Previous Work  Sluijter and van Heuven 96, 97 showed that accent in Dutch strongly correlates with the energy of a word extracted from the frequency subband > 500Hz.  Heldner 99,01 and Fant, et al. 00 found that energy in a particular spectral region indicated accent in Swedish.  A lot of researh attention has been given to the automatic identification of prominent or accented words.  Tamburini 03,05 used the energy components of the 500Hz- 2000Hz band.  Tepperman 05 used the RMS energy from the 60Hz-400Hz band  Far too many others to mention here.

5 Contribution and Approach  There is no agreement as to the best -- most discriminative -- frequency subband from which to extract energy information.  We set up a battery of analysis-by-classification experiments varying:  The frequency band:  lower bound frequency ranged from 0 to 19 bark  bandwidth ranged from 1 to 20 bark  upper bound was 20 bark by the 8KHz Nyquist rate  Also, analyzed the first and/or second formants.  The region of analysis:  Full word, only syllable nuclei, longest syllable, longest syllable nuclei  Speaker:  Each of 4 speakers separately, and all together.  We performed the classification using J48 -- a java implementation of C4.5.

6 Contribution and Approach  Local Features:  minimum, maximum, mean, standard deviation and RMS of energy  z score of max energy within the word  mean slope  energy contour classification {rising, falling, peak, valley}  Context-based Features:  Use 6 contexts: (# previous words, #following words)  (2,2) (1,1) (1,0) (2,0) (0,1) (2,1)  (max word - mean region ) / std.dev region  (mean word - mean region ) / std.dev region  (max word - max region ) / std.dev region  max word / (max region -min region )  mean word / (max region -min region )

7 Corpus  Boston Directions Corpus (BDC) [Hirschberg&Nakatani96]  Speech elicited from a direction-giving task.  Used only the read portion.  50 minutes  Fully ToBI labeled  10825 words  Manually segmented  4 Speakers: 3 male, 1 female

8 Results and Discussion  Energy from different frequency regions predict pitch accent differently  mean relative improvement of best region over worst: 14.8%

9 Results and Discussion  Our experiments did not confirm previously reported results.  The single most predictive subband for all speakers was 3-18bark over full words  Classification Accuracy: 76% (42.4% baseline)  p=71.6,r=73.4  However, performs significantly worse than the best for analyzing a single speaker  not the female speaker

10 Results and Discussion  The subband from 2-20bark is performs significantly worse than the most predicitive in only a single experiment (h1nucl)  Accuracy: 75.5% (p=70.5, r=72.5)  Due to its robustness we consider this band the “best”  The formant-based energy features tend to perform worse  6.4% mean accuracy reduction from 2-20bark  Attributable to:  Errors in the formant tracking algorithm  The presence of discriminative information in higher formants

11 Results and Discussion  Most predictive features were normalized maximum energy relative to the mean and standard deviation of three contextual regions  1 previous and 1 following word  2 previous and 1 following word  2 previous and 2 following words

12 Results and Discussion  There is a relatively small intersection of correct predictions even among similar subbands.  10823 of 10825 words were correctly classified by at least one classifier.  Using a majority voting scheme:  Accuracy: 81.9% (p=76.7, r=82.5)

13 Results and Discussion  How do the regioning strategies perform? Full Word > All Nuclei > Longest Syllable ~ Longest Nuclei  Why does analysis of the full word outperform other regioning strategies?  Duration is a crude measure of lexical stress  Syllable/nuclei segmentation algorithms are imperfect  Pitch accents are not neatly placed  More data has the ability to highlight distinctions more easily

14 Conclusion  Using an analysis-by-classification approach we showed:  Energy from different frequency bands correlate with pitch accent differently.  The “best” (highest accuracy, most robust) frequency region to be 2-20bark (>2bark?)  A voting classifier based exclusively on energy can predict accent reliably.

15 Future Work  Can we predict which bands will predict accent best for a given word?  We plan on incorporating these findings into a general pitch accent classifier with pitch and duration features.  We plan on repeating these experiments on spontaneous speech data.


Download ppt "On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06."

Similar presentations


Ads by Google