Presentation is loading. Please wait.

Presentation is loading. Please wait.

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.

Similar presentations


Presentation on theme: "Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004."— Presentation transcript:

1 Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004

2 Roadmap The Problem: Mandarin Story Segmentation The Tools: Prosodic and Text Cues –Mandarin Chinese Individual Results Integrating Cues Conclusion & Future Work

3 The Problem: Mandarin Speech Topic Segmentation Separate audio stream into component topics

4 Why Segment? Enables language understanding tasks –Information Retrieval Only regions of interest –Summarization Cover all main topics –Reference Resolution Pronouns tend to refer within segments

5 The Challenge How do we define/measure topicality? –Are two regions on the same topic? –Fundamentally requires full understanding How can we approach with partial understanding? How do we identify boundaries sharply? –Association of sentences may be ambiguous Especially, “filler”

6 The Tools: Prosodic and Text Cues Represent local changes at boundaries with audio –Silence!, speaker change, pitch, loudness, rate (GHN, AT&T00) Represent topicality with text –Component words in audio stream Possibly noisy Many possible models (Hearst 94, Beeferman99,..) Combining Prosody and Text –Human annotators more accurate, confident if use BOTH transcribed text and original audio!! (Swerts 97) –English broadcast news (Tur et al, 2001)

7 Data and Processing Broadcast News –Topic Detection and Tracking TDT3 corpus –Voice of America broadcast news ASR transcription Manually segmented – known boundaries –~4,000 stories, ~750K words Acoustic analysis (Praat) –Automatic pitch, intensity tracking Smoothed, speaker-normalized, per-word

8 Acoustic-Prosodic Cues Languages differ in use of intonation –E.g. English: declarative fall, question rise –Chinese: pitch contour determines word meaning At segment boundaries??? –Surprisingly similar, though not identical –Significantly lower pitch at end of segment –Significantly lower amplitude at end of segment –Significantly longer duration at end of segment

9 Acoustic-Prosodic Contrasts Mandarin Normalized Pitch Mandarin Normalized Intensity

10 Learning Boundaries Decision tree classifier (Quinlan C4.5) –Classification problem For each word, classify as final/non-final Features –Acoustic-Prosodic: Duration, Pitch, Loudness, Silence –Word average, Between-word difference

11 Text Boundary Features –Text Information retrieval style –Cosine similarity between weighted term vectors »tf*idf in 50-word windows Cue phrases –N-gram features »Identified by BoosTexter (Schapire & Singer, 2000) –E.g. “Voice of America”, “Audience”, “Reporting”

12 Classification Results Balanced training and test sets –Results on held-out subsets Acoustic cues only –95.6% accuracy Text cues (+ silence) –95.6% accuracy Combined text and prosody –96.4% accuracy Typically, false alarms twice as common as miss

13 Joint Decision Tree < <

14 Feature Assessment Role of silence Useful in both text and acoustic classifiers More necessary for text Text captures topicality, not locality Can not identify boundaries sharply Prosodic cues: Localize boundaries Multiple supporting cues: intensity, pitch: contrastive use

15 Issue: False Alarms Evaluate representative sample –Boundary <<< Non-boundary –95.6% accuracy 2% miss, 4.4% false alarms Non-boundary frequent False alarms frequent

16 Voting Against False Alarms Error analysis: –Construct per-feature classifiers: Prosody-only, text-only, silence-only –Compare classifiers: per-feature, joint Joint + 0,1 per-feature classifer FALSE ALARM Approach: Voting –Require joint + 2 per-feature classifiers Result: 1/3 reduction in false alarms –~97% accuracy: 2.8% miss, 3.15% false alarm

17 Conclusion Mandarin broadcast news segmentation –Identify topicality and boundary locality Integrate text and acoustic cues –Text similarity: vector space model, n-gram cues –Prosodic cues: Silence, intensity, pitch, duration »Robust across range of languages Provide supporting and orthogonal information Majority agreement of per-feature classifiers: –1/3 fewer alarms

18 Current & Future Work Improving the model of topicality –Richer text similarity models; broader acoustic models Alternative classifiers –Preliminary experiments: Boosting, Boosted Decision trees, MaxEnt – Comparable –Alternative integration strategies Hierarchical subtopic segmentation –Broadcast news –Dialogue: human-computer, human-human Integration with multi-modal features: e.g. gesture, gaze

19 Acoustic-Prosodic Contrasts Mandarin Normalized Pitch Mandarin Normalized Intensity English Normalized Intensity English Normalized Pitch

20 Text Decision Tree

21 Prosodic Decision Tree

22 The Problem: Speech Topic Segmentation Separate audio stream into component topics On "World News Tonight" this Thursday, another bad day on stock markets, all over the world global economic anxiety. || Another massacre in Kosovo, the U.S. and its allies prepare to do something about it. Very slowly. || And the millennium bug, Lubbock Texas prepares for catastrophe, India sees only profit.||


Download ppt "Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004."

Similar presentations


Ads by Google