Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN 6820 - Speech and Audio Processing and Recognition 4/27/05.

Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN 6820 - Speech and Audio Processing and Recognition 4/27/05

Overview Project Goal ToBI standard for prosodic labeling Previous Work Method Results Conclusion

Project Goal: Automatic assignment of tones tier elements –Given the waveform, orthographic and break index tiers, predict a subset/simplification of elements in the tones tier. –Distinct experiments for determining each of pitch accents, phrase tones, and phrase boundary tones

ToBI Annotation Tones and Break Index (ToBI) labeling scheme consists of a speech waveform and 4 tiers: –Tones Annotation of pitch accents and phrasal tones –Orthographic Transcription of text –Break Index Pauses between words, rated on a scale from 0-4. –Miscellaneous Notes about the annotation (e.g., ambiguities, non-speech noise)

ToBI Transcription Example

ToBI Examples Pitch Accents (made3.wav): –H*, L*, L+H* Boundary Tones (money.wav): –L-H%, H-H%, L-L%, H-L%, (H-, L-)

Previous Work Ross: “Prediction of abstract prosodic labels for speech synthesis” 1996 –BU Radio News Corpus (~48 minutes) Public news broadcasts spoken by 7 speakers –Uses decision tree output as input to an HMM for pitch accent identification; Decision trees for phrase/boundary tone identification –Employs no acoustic features. Narayanan: “An Automatic Prosody Recognizer using a Coupled Multi- Stream Acoustic Model and a Syntactic-Prosodic Language Model” 2005 –BU Radio News Corpus –Detects stressed syllables (collapsed ToBI labels) and all boundaries. –Uses CHMM on pitch, intensity and duration to track these “asynchronous” acoustic features, and a trigram POS/stress-boundary language model Wightman: “Automatic Labeling of Prosodic Patterns” 1994 –Single speaker subset of BNC and ambiguous sentence corpus (read speech). –Like Ross, uses decision tree output as input to HMM –Uses many acoustic features

Method JRip –Classification rule learner –Better at working with nominal attributes –Easier to read output Corpus –Boston Direction Corpus 4 speakers ~65 minutes of semi-spontaneous speech Original Plan: –HMMs and SVMs SVMs took a prohibitive amount of time to learn and performed worse. HMM implementation problems, and not enough time to implement my own

Method - Features Min, max, mean, std.dev. F0 and Intensity # Syllables, Duration, approx. vowel length, POS F0 slope (weighted) zscore of max F0 and intensity Phrase-length F0, intensity and vowel length features Phrase position

Results - Tasks Pitch Accent –Identification –Detection Phrase Tone identification Boundary Tone identification Phrase/Boundary Tone –Identification –Detection

Results - Pitch Accent Identification Accuracy Relevant Features –# syllables, duration (previous 2), vowel length (prev, next 2), POS, max & stdev F0, slope F0, max & stdev intensity, zscore of F0, phrase level zscore of F0 and intensity BestNo BreaksBaseRoss* Acc.79.2%78.0%58.8%80.2% *Ross identifies a different subset of ToBI pitch accents

Results - Pitch Accent Detection BestNo Breaks Ross Narayanan Wightman Acc.85.7%83.9%82.5%-- T/F83.2/ 12.4 80.1/ 14% -79.5/ 13.2% 83/ 14% Baseline: 58.9% On BNC, human agreement of 91%, in general 86-88% Idenical relevant features as id task

Results - Phrase Tone Accuracy Relevant Features –Duration of next word, max, min, mean F0. –Linear slope F0, zscore of intensity, phrase zscores of F0 and intensity BestBaseNo BreakBase Acc.72.4%57.9%86.7%77.4%

Results - Boundary Tone Identification Accuracy Relevant Features –Quadratically weighted F0 slope BestBaseNo BreakBase Acc.73.2%65.1%91.3%84.5%

Results - Phrase/Boundary Tone Identification Accuracy Relevant Features –Duration of next two words, POS (current and 2 next), max, mean and slope (all weighting) of F0, mean intensity, phrase zscores of F0 and intensity, –zscore of difference in max intensity in the current word and the phrase. BestBaseRossBase Acc.54.7%33.8%66.9%56.3%

Results – Phrase/Boundary Tone Detection Accuracy Human agreement (in general): 95% Best agreement: 93.0% over 77% baseline Relevant Features –Vowel length (current and next word) –POS of the next word BestNarayananWightman T/F82.5/3.9%80.9/16.0%77/3%

Conclusion Relatively low-tech acoustic features and ml algorithms can perform competitively with more complicated NLP approaches Break index information was not as helpful as initially suspected. Potential Improvements: –Sequential Modeling (HMM) –Different features More sophisticated pitch contour feature Content-based features (similar to Ross)

Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN 6820 - Speech and Audio Processing and Recognition 4/27/05.

Similar presentations

Presentation on theme: "Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN 6820 - Speech and Audio Processing and Recognition 4/27/05."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN 6820 - Speech and Audio Processing and Recognition 4/27/05.

Similar presentations

Presentation on theme: "Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN 6820 - Speech and Audio Processing and Recognition 4/27/05."— Presentation transcript:

Similar presentations

About project

Feedback