Presentation is loading. Please wait.

Presentation is loading. Please wait.

Prosody in Spoken Language Understanding Gina Anne Levow University of Chicago January 4, 2008 NLP Winter School 2008.

Similar presentations


Presentation on theme: "Prosody in Spoken Language Understanding Gina Anne Levow University of Chicago January 4, 2008 NLP Winter School 2008."— Presentation transcript:

1

2 Prosody in Spoken Language Understanding Gina Anne Levow University of Chicago January 4, 2008 NLP Winter School 2008

3 U: Give me the price for AT&T. U: Give me the price for American Telephone and Telegraph.

4 Error Correction Spiral U: Give me the price for AT&T. S: Hewlett Packard was 83 ¾, up 2 ½ since yesterday. U: Give me the price for AT&T. S: Data General was at 10 ½ up a quarter. U: Give me the price for AT&T. S: Hewlett Packard was 83 ¾, up 2 ½. U: Give me the price for American Telephone and Telegraph. S: Excuse me?

5 Roadmap Corrections: A motivating example Defining prosody Why prosody? Challenges in prosody Prosody in language understanding Recognizing tone and pitch accent Spoken corrections, Topic segmentation Conclusions

6 Defining Prosody Prosody Phonetic phenomena in speech than span more than a single segment-“suprasegmental” Prosody includes: Stress, focus, tone, intonation, length/pause, rhythm Prosodic features include: Pitch: perceptual correlate of fundamental frequency f0: rate of vocal fold vibration Loudness/intensity, duration, segment quality

7 Why Prosody? Prosody plays a crucial role At all levels of language Lexical, syntactic, pragmatic/discourse Establishes meaning Disambiguates sense and structure Across languages families Common physiological, articulatory basis In synthesis and recognition of fluent speech

8 Prosody and the Lexicon Lexical:Determines word identity Prosodic effect at the syllable level (minimal unit) Lexical stress: syllable prominence Combination of length, pitch movement, loudness REcord (N) vs reCORD (V) Pitch accent can differentiate words in some languages Lexical tone: tone languages, e.g. Chinese, Punjabi Pitch height (register) and/or shape (contour) Ma (high): mother Ma (rising): hemp Ma (low): horse Ma (falling): scold

9 Prosody and Syntax Prosody can disambiguate structure Associated with chunking and attachment Not identical with syntactic phrase boundaries “Prosody is predictable from syntax, except when it isn’t” Prosodic phrasing indicated by: Some combination of pause, change in pitch

10 Chunking, or “phrasing” A1: I met Mary and Elena’s mother at the mall yesterday. A2: I met Mary and Elena’s mother at the mall yesterday. Example from Jennifer Venidetti

11 Punctuation & Prosody Humor A panda goes into a restaurant and has a meal. Just before he leaves he takes out a gun and fires it. The irate restaurant owner says ‘Why did you do that?’ The panda replies, ‘ I'm a panda. Look it up.’The restaurateur goes to his dictionary and under ‘panda’ finds: ‘black and white arboreal, bear like creatures; eats, shoots and leaves.’

12 Prosody in Pragmatics & Discourse Focus: Prominence, new information: pitch accent “October eleventh”: Sentence type, dialogue act: Statement vs. declarative question :“It’s raining (?)” Discourse Structure (Topic), Emotion from Shih, Prosody Learning and Generation

13 Challenges in Prosody I Highly variable Actual realization differs from ideal Speaker variation: Gender, vocal track differences, idiosyncrasy Tonal coarticulation Neighboring tones influence (like segmental) Underlying fall can become rise Parallel encoding Effects at multiple levels realized simultaneously

14 Challenges in Prosody II Challenges for learning Lack of training data Sparseness: Many prosodic phenomena are infrequent E.g., non-declarative utterances, topic boundaries, contrastive accents, etc Challenging for machine learning methods Costs of labeling: Many prosodic events require expert labeling Need large corpus to attest Time-consuming, expensive

15 Context and Learning in Multilingual Tone and Pitch Accent Recognition

16 Strategy: Context Common model across languages Pure acoustic-prosodic model No word label, POS, lexical stress info English, Mandarin Chinese (also Cantonese, isiZulu) Exploit contextual information Features from adjacent syllables, phrase contour Analyze impact of Context position, context encoding, context type > 12.5% reduction in error over no context

17 Data Collections English: (Ostendorf et al, 95) Boston University Radio News Corpus, f2b Manually annotated, aligned, syllabified 4 Pitch accent labels, aligned to syllables Mandarin: TDT2 Voice of America Mandarin Broadcast News Automatically aligned, syllabified 4 main tones, neutral

18 Local Feature Extraction Uniform representation for tone, pitch accent Motivated by Pitch Target Approximation Model Tone/pitch accent target exponentially approached Linear target: height, slope (Xu et al, 99) Base features: Pitch, Intensity max, mean, min, range (Praat, speaker normalized) Pitch at 5 points across voiced region Duration Initial, final in phrase Slope: Linear fit to last half of pitch contour

19 Context Features Local context: Extended features Pitch max, mean, adjacent points of preceding, following syllables Difference features Difference between Pitch max, mean, mid, slope Intensity max, mean Of preceding, following and current syllable Phrasal context: Compute collection average phrase slope Compute scalar pitch values, adjusted for slope

20 Classification Experiments Classifier: Support Vector Machine Linear kernel Multiclass formulation SVMlight (Joachims), LibSVM (Cheng & Lin 01) 4:1 training / test splits Experiments: Effects of Context position: preceding, following, none, both Context encoding: Extended/Difference Context type: local, phrasal

21 Results: Local Context ContextMandarin ToneEnglish Pitch Accent Full74.5%81.3% Extend PrePost74%80.7% Extend Pre74%79.9% Extend Post70.5%76.7% Diffs PrePost75.5%80.7% Diffs Pre76.5%79.5% Diffs Post69%77.3% Both Pre76.5%79.7% Both Post71.5%77.6% No context68.5%75.9%

22 Results: Local Context ContextMandarin ToneEnglish Pitch Accent Full74.5%81.3% Extend PrePost74%80.7% Extend Pre74%79.9% Extend Post70.5%76.7% Diffs PrePost75.5%80.7% Diffs Pre76.5%79.5% Diffs Post69%77.3% Both Pre76.5%79.7% Both Post71.5%77.6% No context68.5%75.9%

23 Results: Local Context ContextMandarin ToneEnglish Pitch Accent Full74.5%81.3% Extend PrePost74%80.7% Extend Pre74%79.9% Extend Post70.5%76.7% Diffs PrePost75.5%80.7% Diffs Pre76.5%79.5% Diffs Post69%77.3% Both Pre76.5%79.7% Both Post71.5%77.6% No context68.5%75.9%

24 Discussion: Local Context Any context information improves over none Preceding context information consistently improves over none or following context information English: Generally more context features are better Mandarin: Following context can degrade Little difference in encoding (Extend vs Diffs) Consistent with phonetic analysis (Xu) that carryover coarticulation is greater than anticipatory

25 Results & Discussion: Phrasal Context Phrase ContextMandarin ToneEnglish Pitch Accent Phrase75.5%81.3% No Phrase72%79.9% Phrase contour compensation enhances recognition Simple strategy Use of non-linear slope compensate may improve

26 Context: Summary Employ common acoustic representation Tone (Mandarin), pitch accent (English) Cantonese: ~64%; 68% with RBF kernel SVM classifiers - linear kernel: 76%, 81% Local context effects: Up to > 20% relative reduction in error Preceding context greatest contribution Carryover vs anticipatory Phrasal context effects: Compensation for phrasal contour improves recognition

27 Strategy: Training Challenge: Can we use the underlying acoustic structure of the language – through unlabeled examples – to reduce the need for expensive labeled training data? Exploit semisupervised and unsupervised learning Semi-supervised Laplacian SVM K-means and asymmetric k-lines clustering Substantially outperform baselines Can approach supervised levels

28 Data Collections & Processing English: (as before) Boston University Radio News Corpus, f2b Binary: Unaccented vs accented 4-way: Unaccented, High, Downstepped High, Low Mandarin: Lab speech data: (Xu, 1999) 5 syllable utterances: vary tone, focus position In-focus, pre-focus, post-focus TDT2 Voice of America Mandarin Broadcast News 4-way: High, Mid-rising, Low, High falling isiZulu: (as before) Read web sentences 2-way: High vs low

29 Semi-supervised Learning Approach: Employ small amount of labeled data Exploit information from additional – presumably more available –unlabeled data Few prior examples: several weakly supervised: (Wong et al, ’05) Classifier: Laplacian SVM (Sindhwani,Belkin&Niyogi ’05) Semi-supervised variant of SVM Exploits unlabeled examples RBF kernel, typically 6 nearest neighbors, transductive

30 Experiments Pitch accent recognition: Binary classification: Unaccented/Accented 1000 instances, proportionally sampled Labeled training: 200 unacc, 100 acc 80% accuracy (cf. 84% w/15x labeled SVM) Mandarin tone recognition: 4-way classification: n(n-1)/2 binary classifiers 400 instances: balanced; 160 labeled Clean lab speech- in-focus-94% cf. 99% w/SVM, 1000s train; 85% w/SVM 160 training samples Broadcast news: 70% Cf. < 50% w/SVM 160 training samples

31 Unsupervised Learning Question: Can we identify the tone structure of a language from the acoustic space without training? Analogous to language acquisition Significant recent research in unsupervised clustering Established approaches: k-means Spectral clustering (Shi & Malik ‘97, Fischer & Poland 2004): asymmetric k-lines Little research for tone Self-organizing maps (Gauthier et al,2005) Tones identified in lab speech using f0 velocities Cluster-based bootstrapping (Narayanan et al, 2006) Prominence clustering (Tambourini ’05)

32 Contrasting Clustering Contrasts: Clustering: 2-16 clusters, label w/most freq class 3 Spectral approaches: Perform spectral decomposition of affinity matrix Asymmetric k-lines (Fischer & Poland 2004) Symmetric k-lines (Fischer & Poland 2004) Laplacian Eigenmaps (Belkin, Niyogi, & Sindhwani 2004) Binary weights, k-lines clustering K-means: Standard Euclidean distance # of clusters: 2-16 Best results: > 78% 2 clusters: asymmetric k-lines; > 2 clusters: kmeans Larger # clusters: all similar

33 Contrasting Learners

34 Tone Clustering: I Mandarin four tones: 400 samples: balanced 2-phase clustering: 2-5 clusters each Asymmetric k-lines, k-means clustering Clean read speech: In-focus syllables: 87% (cf. 99% supervised) In-focus and pre-focus: 77% (cf. 93% supervised) Broadcast news: 57% (cf. 74% supervised) K-means requires more clusters to reach k-lines level

35 Tone Structure First phase of clustering splits high/rising from low/falling by slope Second phase by pitch height

36 Conclusions Common prosodic framework for tone and pitch accent recognition Contextual modeling enhances recognition Local context and broad phrase contour Carryover coarticulation has larger effect for Mandarin Exploiting unlabeled examples for recognition Semi- and Un-supervised approaches Best cases approach supervised levels with less training Exploits acoustic structure of tone and accent space

37 Error Correction Spiral U: Give me the price for AT&T. S: Hewlett Packard was 83 ¾, up 2 ½ since yesterday. U: Give me the price for AT&T. S: Data General was at 10 ½ up a quarter. U: Give me the price for AT&T. S: Hewlett Packard was 83 ¾, up 2 ½. U: Give me the price for American Telephone and Telegraph. S: Excuse me?

38 Recognizing Spoken Corrections Spoken Corrections Recognize user attempts to correct ASR failures Compare original input to repeat corrections Significant differences: Corrections: increases in duration, pause #/length, final fall Increases in pitch accent for misrecognitions Automatic recognition with decision trees, boosting Distinguish corrective/not (human level) Key features: raw/normalized duration, pause Identify specific word being corrected Key features: highest pitch, widest pitch range

39 The Problem: Speech Topic Segmentation Separate audio stream into component topics On "World News Tonight" this Thursday, another bad day on stock markets, all over the world global economic anxiety. || Another massacre in Kosovo, the U.S. and its allies prepare to do something about it. Very slowly. || And the millennium bug, Lubbock Texas prepares for catastrophe, Bangalore, in India, sees only profit.||

40 Is It Possible in Mandarin?

41 Recognizing Shifts in Topic & Turn Topic & Turn boundaries in English & Mandarin Initial syllables: Significantly higher pitch, loudness than final Lexical and prosodic cues: Cue words, tf*idf similarity; pitch, loudness, silence Automatic recognition with decision trees, boosting Voting to combine text, prosody, silence: 97% accuracy Key features: Pause; pitch, loudness contrast between syllables

42 Conclusions & Opportunities Prosody Rich source of information for languages Challenging due to variation, paucity of data Can be successfully employed, with learning, to improve language understanding Pitch accent, tone, dialogue act, turn, topic,… Unrestricted conversational, multi-party, multimodal speech much more challenging Increased variability, interaction with non-verbal evidence

43 Thanks Dinoj Surendran, Siwei Wang, Yi Xu V. Sindhwani, M. Belkin, & P. Niyogi; I. Fischer & J. Poland; T. Joachims; C-C. Cheng & C. Lin This work supported by NSF Grant #0414919 http://people.cs.uchicago.edu/~levow/tai

44 Phrasing can disambiguate I met Mary and Elena’s mother at the mall yesterday Mary & Elena’s mother mall One intonation phrase with relatively flat overall pitch range.

45 Phrasing can disambiguate I met Mary and Elena’s mother at the mall yesterday Mary mall Elena’s mother Separate phrases, with expanded pitch movements.

46 Lists of numbers, nouns twenty.eight.five ninety.four.three seventy.three.sev en forty.seven.seven seventy.seven.sev en coffee cake and cream chocolate ice cream and cake fish fingers and bottles cheese sandwiches and milk cream buns and chocolate [from Prosody on the Web tutorial on chunking]

47 Clustering Pitch accent clustering: 4 way distinction: 1000 samples, proportional 2-16 clusters constructed Assign most frequent class label to each cluster Classifier: Asymmetric k-lines: context-dependent kernel radii, non-spherical > 78% accuracy: 2 clusters: asymmetric k-lines best Context effects: Vector w/preceding context vs vector with no context comparable


Download ppt "Prosody in Spoken Language Understanding Gina Anne Levow University of Chicago January 4, 2008 NLP Winter School 2008."

Similar presentations


Ads by Google