Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.

Slides:



Advertisements
Similar presentations
1 Multimodal Technology Integration for News-on-Demand SRI International News-on-Demand Compare & Contrast DARPA September 30, 1998.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
Sub-Project I Prosody, Tones and Text-To-Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan.
Results: Word prominence detection models Each feature set increases accuracy over the 69% baseline accuracy. Word Prominence Detection using Robust yet.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
Modeling Prosodic Sequences with K-Means and Dirichlet Process GMMs Andrew Rosenberg Queens College / CUNY Interspeech 2013 August 26, 2013.
Effectiveness of spatial cues, prosody, and talker characteristics in selective attention C.J. Darwin & R.W. Hukin.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Identifying Local Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago October 5, 2004.
Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago MAICS April 1, 2006.
Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.
Spoken Language Processing Lab Who we are: Julia Hirschberg, Stefan Benus, Fadi Biadsy, Frank Enos, Agus Gravano, Jackson Liscombe, Sameer Maskey, Andrew.
Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.
Issues in Pre- and Post-translation Document Expansion: Untranslatable Cognates and Missegmented Words Gina-Anne Levow University of Chicago July 7, 2003.
Context in Multilingual Tone and Pitch Accent Recognition Gina-Anne Levow University of Chicago September 7, 2005.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
ML Course Project Text Segmentation in the Informedia Project Text segmentation in Informedia Faculty Mentor:Alex Hauptmann TA Mentor:Vandi Verma Students:Zhirong.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Turn-taking in Mandarin Dialogue: Interactions of Tone and Intonation Gina-Anne Levow University of Chicago October 14, 2005.
Classification of Discourse Functions of Affirmative Words in Spoken Dialogue Julia Agustín Gravano, Stefan Benus, Julia Hirschberg Shira Mitchell, Ilia.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,
2001/03/29Chin-Kai Wu, CS, NTHU1 Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY.
Varying Input Segmentation for Story Boundary Detection Julia Hirschberg GALE PI Meeting March 23, 2007.
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.
Topic Detection and Tracking Introduction and Overview.
Exploiting video information for Meeting Structuring ….
Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Overview of the TDT-2003 Evaluation and Results Jonathan Fiscus NIST Gaithersburg, Maryland November 17-18, 2002.
Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.
Turn-taking Discourse and Dialogue CS 359 November 6, 2001.
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.
Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
1 Prosody-Based Automatic Segmentation of Speech into Sentences and Topics Elizabeth Shriberg Andreas Stolcke Speech Technology and Research Laboratory.
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
National Taiwan University, Taiwan
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington.
Imposing native speakers’ prosody on non-native speakers’ utterances: Preliminary studies Kyuchul Yoon Spring 2006 NAELL The Division of English Kyungnam.
Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Parsing & Language Acquisition: Parsing Child Language Data CSMC Natural Language Processing February 7, 2006.
Adapting Dialogue Models Discourse & Dialogue CMSC November 19, 2006.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
STD Approach Two general approaches: word-based and phonetics-based Goal is to rapidly detect the presence of a term in a large audio corpus of heterogeneous.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Cell Segmentation in Microscopy Imagery Using a Bag of Local Bayesian Classifiers Zhaozheng Yin RI/CMU, Fall 2009.
Acoustic Cues to Emotional Speech Julia Hirschberg (joint work with Jennifer Venditti and Jackson Liscombe) Columbia University 26 June 2003.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
A Text-free Approach to Assessing Nonnative Intonation Joseph Tepperman, Abe Kazemzadeh, and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory,
Investigating Pitch Accent Recognition in Non-native Speech
Automatic Fluency Assessment
Recognizing Structure: Sentence, Speaker, andTopic Segmentation
Recognizing Structure: Dialogue Acts and Segmentation
Automatic Prosodic Event Detection
Presentation transcript:

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004

Roadmap The Problem: Mandarin Story Segmentation The Tools: Prosodic and Text Cues –Mandarin Chinese Individual Results Integrating Cues Conclusion & Future Work

The Problem: Mandarin Speech Topic Segmentation Separate audio stream into component topics

Why Segment? Enables language understanding tasks –Information Retrieval Only regions of interest –Summarization Cover all main topics –Reference Resolution Pronouns tend to refer within segments

The Challenge How do we define/measure topicality? –Are two regions on the same topic? –Fundamentally requires full understanding How can we approach with partial understanding? How do we identify boundaries sharply? –Association of sentences may be ambiguous Especially, “filler”

The Tools: Prosodic and Text Cues Represent local changes at boundaries with audio –Silence!, speaker change, pitch, loudness, rate (GHN, AT&T00) Represent topicality with text –Component words in audio stream Possibly noisy Many possible models (Hearst 94, Beeferman99,..) Combining Prosody and Text –Human annotators more accurate, confident if use BOTH transcribed text and original audio!! (Swerts 97) –English broadcast news (Tur et al, 2001)

Data and Processing Broadcast News –Topic Detection and Tracking TDT3 corpus –Voice of America broadcast news ASR transcription Manually segmented – known boundaries –~4,000 stories, ~750K words Acoustic analysis (Praat) –Automatic pitch, intensity tracking Smoothed, speaker-normalized, per-word

Acoustic-Prosodic Cues Languages differ in use of intonation –E.g. English: declarative fall, question rise –Chinese: pitch contour determines word meaning At segment boundaries??? –Surprisingly similar, though not identical –Significantly lower pitch at end of segment –Significantly lower amplitude at end of segment –Significantly longer duration at end of segment

Acoustic-Prosodic Contrasts Mandarin Normalized Pitch Mandarin Normalized Intensity

Learning Boundaries Decision tree classifier (Quinlan C4.5) –Classification problem For each word, classify as final/non-final Features –Acoustic-Prosodic: Duration, Pitch, Loudness, Silence –Word average, Between-word difference

Text Boundary Features –Text Information retrieval style –Cosine similarity between weighted term vectors »tf*idf in 50-word windows Cue phrases –N-gram features »Identified by BoosTexter (Schapire & Singer, 2000) –E.g. “Voice of America”, “Audience”, “Reporting”

Classification Results Balanced training and test sets –Results on held-out subsets Acoustic cues only –95.6% accuracy Text cues (+ silence) –95.6% accuracy Combined text and prosody –96.4% accuracy Typically, false alarms twice as common as miss

Joint Decision Tree < <

Feature Assessment Role of silence Useful in both text and acoustic classifiers More necessary for text Text captures topicality, not locality Can not identify boundaries sharply Prosodic cues: Localize boundaries Multiple supporting cues: intensity, pitch: contrastive use

Issue: False Alarms Evaluate representative sample –Boundary <<< Non-boundary –95.6% accuracy 2% miss, 4.4% false alarms Non-boundary frequent False alarms frequent

Voting Against False Alarms Error analysis: –Construct per-feature classifiers: Prosody-only, text-only, silence-only –Compare classifiers: per-feature, joint Joint + 0,1 per-feature classifer FALSE ALARM Approach: Voting –Require joint + 2 per-feature classifiers Result: 1/3 reduction in false alarms –~97% accuracy: 2.8% miss, 3.15% false alarm

Conclusion Mandarin broadcast news segmentation –Identify topicality and boundary locality Integrate text and acoustic cues –Text similarity: vector space model, n-gram cues –Prosodic cues: Silence, intensity, pitch, duration »Robust across range of languages Provide supporting and orthogonal information Majority agreement of per-feature classifiers: –1/3 fewer alarms

Current & Future Work Improving the model of topicality –Richer text similarity models; broader acoustic models Alternative classifiers –Preliminary experiments: Boosting, Boosted Decision trees, MaxEnt – Comparable –Alternative integration strategies Hierarchical subtopic segmentation –Broadcast news –Dialogue: human-computer, human-human Integration with multi-modal features: e.g. gesture, gaze

Acoustic-Prosodic Contrasts Mandarin Normalized Pitch Mandarin Normalized Intensity English Normalized Intensity English Normalized Pitch

Text Decision Tree

Prosodic Decision Tree

The Problem: Speech Topic Segmentation Separate audio stream into component topics On "World News Tonight" this Thursday, another bad day on stock markets, all over the world global economic anxiety. || Another massacre in Kosovo, the U.S. and its allies prepare to do something about it. Very slowly. || And the millennium bug, Lubbock Texas prepares for catastrophe, India sees only profit.||