Context and Prosody in the Interpretation of Cue Phrases in Dialogue Julia Hirschberg Columbia University and KTH 11/22/07 Spoken Dialog with Humans and.

Slides:



Advertisements
Similar presentations
Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.
Advertisements

Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
“Effect of Genre, Speaker, and Word Class on the Realization of Given and New Information” Julia Agustín Gravano & Julia Hirschberg {agus,
Prosody Modeling (in Speech) by Julia Hirschberg Presented by Elaine Chew QMUL: ELE021/ELED021/ELEM March 2012.
“Downstepped contours in the given/new distinction” Agustín Gravano Spoken Language Processing Group Columbia University, New York On the Role of Prosody.
Detecting Certainness in Spoken Tutorial Dialogues Liscombe, Hirschberg & Venditti Using System and User Performance Features to Improve Emotion Detection.
Using prosody to avoid ambiguity: Effects of speaker awareness and referential context Snedeker and Trueswell (2003) Psych 526 Eun-Kyung Lee.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
1 Spoken Dialogue Systems Dialogue and Conversational Agents (Part IV) Chapter 19: Draft of May 18, 2005 Speech and Language Processing: An Introduction.
Agustín Gravano 1 · Stefan Benus 2 · Julia Hirschberg 1 Elisa Sneed German 3 · Gregory Ward 3 1 Columbia University 2 Univerzity Konštantína Filozofa.
Emotion in Meetings: Hot Spots and Laughter. Corpus used ICSI Meeting Corpus – 75 unscripted, naturally occurring meetings on scientific topics – 71 hours.
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
Comparing American and Palestinian Perceptions of Charisma Using Acoustic-Prosodic and Lexical Analysis Fadi Biadsy, Julia Hirschberg, Andrew Rosenberg,
Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.
Spoken Language Processing Lab Who we are: Julia Hirschberg, Stefan Benus, Fadi Biadsy, Frank Enos, Agus Gravano, Jackson Liscombe, Sameer Maskey, Andrew.
Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.
Extracting Social Meaning Identifying Interactional Style in Spoken Conversation Jurafsky et al ‘09 Presented by Laura Willson.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
High Frequency Word Entrainment in Spoken Dialogue ACL, June Columbus, OH Department of Computer and Information Science University of Pennsylvania.
Turn-taking in Mandarin Dialogue: Interactions of Tone and Intonation Gina-Anne Levow University of Chicago October 14, 2005.
Classification of Discourse Functions of Affirmative Words in Spoken Dialogue Julia Agustín Gravano, Stefan Benus, Julia Hirschberg Shira Mitchell, Ilia.
9/5/20051 Acoustic/Prosodic and Lexical Correlates of Charismatic Speech Andrew Rosenberg & Julia Hirschberg Columbia University Interspeech Lisbon.
10/10/20051 Acoustic/Prosodic and Lexical Correlates of Charismatic Speech Andrew Rosenberg & Julia Hirschberg Columbia University 10/10/05 - IBM.
1 Back Channel Communication Antoine Raux Dialogs on Dialogs 02/25/2005.
Agustín Gravano 1,2 Julia Hirschberg 1 (1)Columbia University, New York, USA (2) Universidad de Buenos Aires, Argentina Turn-Yielding Cues in Task-Oriented.
Discourse Markers Discourse & Dialogue CS November 25, 2006.
A Study in Cross-Cultural Interpretations of Back-Channeling Behavior Yaffa Al Bayyari Nigel Ward The University of Texas at El Paso Department of Computer.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.
Information Technology – Dialogue Systems Ulm University (Germany) Speech Data Corpus for Verbal Intelligence Estimation.
Turn-taking Discourse and Dialogue CS 359 November 6, 2001.
1 Computation Approaches to Emotional Speech Julia Hirschberg
Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.
Predicting Student Emotions in Computer-Human Tutoring Dialogues Diane J. Litman&Kate Forbes-Riley University of Pittsburgh Department of Computer Science.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
The Games Corpus Design, implementation and annotation Agustín Gravano Spoken Language Processing Group Columbia University.
Discourse & Dialogue CS 359 November 13, 2001
1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.
Lexical, Prosodic, and Syntactics Cues for Dialog Acts.
Adapting Dialogue Models Discourse & Dialogue CMSC November 19, 2006.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Acoustic Cues to Emotional Speech Julia Hirschberg (joint work with Jennifer Venditti and Jackson Liscombe) Columbia University 26 June 2003.
On the role of context and prosody in the interpretation of ‘okay’ Julia Agustín Gravano, Stefan Benus, Julia Hirschberg Héctor Chávez, and Lauren Wilcox.
Predicting and Adapting to Poor Speech Recognition in a Spoken Dialogue System Diane J. Litman AT&T Labs -- Research
A Text-free Approach to Assessing Nonnative Intonation Joseph Tepperman, Abe Kazemzadeh, and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory,
Investigating Pitch Accent Recognition in Non-native Speech
Towards Emotion Prediction in Spoken Tutoring Dialogues
Recognizing Structure: Dialogue Acts and Segmentation
Studying Intonation Julia Hirschberg CS /21/2018.
Studying Intonation Julia Hirschberg CS /21/2018.
Intonational and Its Meanings
Intonational and Its Meanings
The American School and ToBI
Agustín Gravano1,2 Julia Hirschberg1
Dialogue Acts Julia Hirschberg CS /18/2018.
Comparing American and Palestinian Perceptions of Charisma Using Acoustic-Prosodic and Lexical Analysis Fadi Biadsy, Julia Hirschberg, Andrew Rosenberg,
Turn-taking and Disfluencies
Recognizing Structure: Sentence, Speaker, andTopic Segmentation
“Downstepped contours in the given/new distinction”
Dialogue Acts Julia Hirschberg LSA /29/2018.
High Frequency Word Entrainment in Spoken Dialogue
Agustín Gravano & Julia Hirschberg {agus,
Discourse Structure in Generation
Agustín Gravano1,2 Julia Hirschberg1
Agustín Gravano1 · Stefan Benus2 · Julia Hirschberg1
Recognizing Structure: Dialogue Acts and Segmentation
Acoustic-Prosodic and Lexical Entrainment in Deceptive Dialogue
Guest Lecture: Advanced Topics in Spoken Language Processing
Automatic Prosodic Event Detection
Presentation transcript:

Context and Prosody in the Interpretation of Cue Phrases in Dialogue Julia Hirschberg Columbia University and KTH 11/22/07 Spoken Dialog with Humans and Machines

2 In collaboration with Agust í n Gravano, Stefan Benus, H é ctor Ch á vez, Shira Mitchell, and Lauren Wilcox With thanks to Gregory Ward and Elisa Sneed German

3 Managing Conversation How do speakers indicate conversational structure in human/human dialogue? How do they communicate varying levels of attention, agreement, acknowledgment? What role does lexical choice play in these communicative acts? Phonetic realization? Prosodic variation? Prior context? Can human/human behavior be modeled in Spoken Dialogue Systems?

4 Cue Phrases/Discourse Markers/Cue Words/ Discourse Particles/Clue Words Linguistic expressions that can be employed  to convey information about the discourse structure, or  to make a semantic (literal?) contribution. Examples:  now, well, so, alright, and, okay, first, on the other hand, by the way, for example, …

5 Some Examples that’s pretty much okay Speaker 1: between the yellow mermaid and the whale Speaker 2: okay Speaker 1: and it is okay we gonna be placing the blue moon

6 A Problem for Spoken Dialogue systems How do speakers produce and hearers interpret such potentially ambiguous terms?  How important is acoustic/prosodic information?  Phonetic variation?  Discourse context?

7 Research Goals Learn which features best characterize the different functions of single affirmative cue words. Determine how these can be identified automatically. Important in Spoken Dialogue Systems:  Understand user input.  Produce output appropriately.

8 Overview Previous research The Columbia Games Corpus  Collection paradigm  Annotations Perception Study of Okays  Experimental design  Analysis and results Machine Learning Experiments on Okay Future work: Entrainment and Cue Phrases

9 Previous Work General studies  Schriffin ’82, ‘87; Reichman ’85; Grosz & Sidner ‘86 Cues to cue phrase disambiguation  Hirschberg & Litman ’87, ’93; Hockey ’93; Litman ’94 Cues to Dialogue Act identification  Jurafsky et al ’98; Rosset & Lamel ’04 Contextual cues to the production of backchannels  Ward & Tsukahara ’00; Sanjanhar & Ward ’06

10 The Columbia Games Corpus Collection 12 spontaneous task-oriented dyadic conversations in Standard American English (9h 8m speech) 2 subjects playing a series of computer games, no eye contact (45m 39s mean session time)  2 sessions per subject, w/different partners Several types of games, designed to vary the way discourse entities became old, or ‘given’ in the discourse to study variation in intonational realization of information status

11 Player 2 (Searcher) Player 1 (Describer) Cards Game #1   Short monologues Vary frequency and order of occurrence of objects on the cards.

12 Cards Game #2 Player 2 (Searcher) Player 1 (Describer)   Dialogue Vary frequency and order of occurrence of objects on the cards across speakers.

13 Objects Game Follower must place the target object where it appears on the Describer’s screen solely via the description provided (4h 19m) Describer: Follower:

14 The Columbia Games Corpus Recording and Logging Recorded on separate channels in soundproof booth, digitized and downsampled to 16k All user and system behaviors logged

15 The Columbia Games Corpus Annotation Orthographic transcription and alignment (~73k words). Laughs, coughs, breaths, smacks, throat-clearings. Self-repairs. Intonation, using ToBI conventions. Function (10 categories) of affirmative cue words (alright, mm-hm, okay, right, uh-huh, yeah, yes, …). Question form and function. Turn-taking behaviors.

16 The Columbia Games Corpus ToBI Labeling Tones  Pitch accents:L*, H*, L*+H, H+!H*, …  Phrase accents:L-, H-, !H-  Boundary tones:L%, H% Break Indices:  Degrees of junction 0 = no word boundary... 4 = full intonational phrase boundary Miscellaneous:  Disfluencies, non-speech sounds, …

17 The Columbia Games Corpus ToBI Example waveform fundamental frequency (F0)

18 Perception Study Selection of Materials  okay Speaker 1: but it's gonna be below the onion Speaker 2: okay  Cue beginning discourse segment  Backchannel  Acknowledgment / Agreement Speaker 1: okay alright I'll try it okay Speaker 2: okay the owl is blinking Speaker 1: yeah um there's like there's some space there's Speaker 2: okay I think I got it

19 contextualized ‘okay’ Perception Study Experiment Design 54 instances of ‘okay’ (18 for each function). 2 tokens for each ‘okay’: Isolated condition: Only the word ‘okay’. Contextualized condition: 2 full speaker turns:  The turn containing the target ‘okay’; and  The previous turn by the other speaker. speakersokay

20 Perception Study Experiment Design 1/3 each: 3 labelers agreed, 2…, none Two conditions:  Part 1: 54 isolated tokens  Part 2: 54 contextualized tokens Subjects asked to classify each token of ‘okay’ as:  Acknowledgment / Agreement, or  Backchannel, or  Cue beginning discourse segment.

21 Perception Study Definitions Given to the Subjects Acknowledge/Agreement:  The function of okay that indicates “I believe what you said” and/or “I agree with what you say”. Backchannel:  The function of okay in response to another speaker's utterance that indicates only “I’m still here” or “I hear you and please continue”. Cue beginning discourse segment  The function of okay that marks a new segment of a discourse or a new topic. This use of okay could be replaced by now.

22 Perception Study Subjects and Procedure Subjects:  20 paid subjects (10 female, 10 male).  Ages between 20 and 60.  Native speakers of English.  No hearing problems. GUI on a laboratory workstation with headphones.

23 Results: Inter-Subject Agreement Kappa measure of agreement with respect to chance (Fleiss ’71) Isolated ConditionContextualized Condition Overall Ack / Agree vs. Other Backchannel vs. Other Cue beginning vs. Other

24 Results:Cues to Interpretation Phonetic transcription of okay: Isolated Condition Strong correlation for realization of  initial vowel  Backchannel  Ack/Agree, Cue Beginning Contextualized Condition No strong correlations found for phonetic variants.

25 Results: Cues to Interpretation Isolated ConditionContextualized Condition Ack / Agree Shorter /k/Shorter latency between turns Shorter pause before okay Backchannel Higher final pitch slope Longer 2 nd syllable Lower intensity Higher final pitch slope More words by S2 before okay Fewer words by S1 after okay Cue beginning Lower final pitch slope Lower overall pitch slope Lower final pitch slope Longer latency between turns More words by S1 after okay S1 = Utterer of the target ‘okay’. S2 = The other speaker.

26 Phrase-final intonation (ToBI) (Both isolated and contextualized conditions.) H-H%  Backchannel H-L% L-H%  Ack/Agree, Backchannel L-L%  Ack/Agree, Cue beginning Results: Cues to Interpretation

27 Perception Study: Conclusions Agreement:  Availability of context improves inter-subject agreement.  Cue beginnings easier to disambiguate than the other two functions. Cues to interpretation:  Contextual features override word features  Exception: Final pitch slope of okay in both conditions.

28 Machine Learning Experiments: Okay Can we identify the different functions of okay in our larger corpus reliably? What features perform best?  How do these compare to those that predict human judgments?

29 ML Algorithm  JRip: Weka’s implementation of the propositional rule learner Ripper (Cohen ’95).  We also tried J4.8, Weka’s implementation of the decision tree learner C4.5 (Quinlan ’93, ’96), with similar results. 10-fold cross validation in all experiments. Method

30 Units of Analysis IPU (Inter-pausal unit)  Maximal sequence of words delimited by pause > 50ms. Conversational Turn  Maximal sequence of IPUs by the same speaker, with no contribution from the other speaker.

31 Experimental features Text-based features (from transcriptions)  Word ident, POS tags (auto); position of word in IPU / turn  IPU, turn length in words; prev turn same spkr? Timing features (from time alignment)  Word / IPU / turn duration; amount of spkr overlap  Time to word beg/end in IPU, turn Acoustic features  {min, mean, max, stdev} x {pitch, intensity}  Slope of pitch, stylized pitch, and intensity, over the whole word, and over its last 100, 200, 300ms.  Acoustic features from last IPU of prior speaker’s turn.

32 Results: Classification of individual words Classification of each individual word into its most common functions.  alright  Ack/Agree, Cue Begin, Other  mm-hm  Ack/Agree, Backchannel  okay  Ack/Agree, Backchannel, Cue Begin, Ack+CueBegin, Ack+CueEnd, Other  right  Ack/Agree, Check, Literal Modifier  yeah  Ack/Agree, Backchannel

33 Majority Labeled Functions of Okay (n=2434)  1137Ackn / Agreem’t  548Cue begin discourse segment  232Pivot ending (A/A+Cue end)  121Backchannel  68Pivot beginning (A/A+Cue beg)  33Check with the interlocutor  29Literal modifier  15Stall / Filler  10Cue end discourse seg  6Back from task

34 Results: Classification of ‘okay’ Feature Set Error Rate F-Measure Ack / Agree Back- channel Cue Begin Ack/Agree + Cue Begin Ack/Agree + Cue End Majority Label Text-based Acoustic Text-based + Timing Full set Baseline (1) Human labelers (2) (1) Majority class baseline: ACK/AGREE. (2) Calculated wrt each labeler’s agreement with the majority labels.

35 Conclusions: ML Experiments Context and timing features  Like perception in context results: timingperception in context Pause after okay, not before # of succeeding words Acoustic features impoverished  No phonetic features  No pitch slope  But ToBI labels (where available) didn’t help

36 Future Work Experiments with full ToBI labeling  Other features Lexical, Acoustic-Prosodic, and Discourse Entrainment and Dis-Entrainment  Positive correlations for affirmative cue words Affirmative cue word entrainment and game scores Affirmative cue word entrainment and overlaps and interruptions in turn-taking

Tack!

38 Other Work Benus et al, 2007  “The prosody of backchannels in American English”, ICPhS 2007, Saarbrücken, Germany, August Gravano et al, 2007  “Classification of discourse functions of affirmative words in spoken dialogue”, Interspeech 2007, Antwerp, Belgium, August 2007.

39 Importance for Spoken Dialogue Systems Convey ambiguous terms with the intended meaning Interpret the user’s input correctly

40 Experiment Design Goal: Study the relation between the down- stepped contour and  Information status  Syntactic position  Discourse position Spontaneous speech Both monologue and dialogue

41 Experiment Design Three computer games. Two players, each on a different computer. They collaborate to perform a common task. Totally unrestricted speech.

42 Objects Game Player 2 (Searcher) Player 1 (Describer)   Dialogue Vary target and surrounding objects (subject and object position).

43 Games Session Repeat 3 times:  Cards Game #1  Cards Game #2 Short break (optional) Repeat 3 times:  Objects Game Each subject participated in 2 sessions. 12 sessions

44 Subjects Postings:  Columbia’s webpage for temporary job adds.  Craig’s list Category: Gigs  Event gigs Problem:  People are unreliable  ~50% did not show up, or cancelled with short notice.

45 Subjects Possible solutions:  Give precise instructions to ALL required info: Name, native speaker?, hearing impairments?, etc.  Ask for a phone number.  Call them and explain why it is so important for us that they show up (or cancel with adecuate notice).  Increase the pay after each session. Example: $5, $10, $15 instead of $10, $10, $10.

46 Recording Sound-proof booth  2 subjects + 1 or 2 confederates.  Head-mounted mics.  Digital Audio Tape (DAT): one channel per speaker. Wav files  One mono file per speaker.  Sample rate:  Downsampled to (but kept original files!)  ~20 hours of speech  2.8 GB (16k)

47 Logs Log everything the subjects do to a text file. Example: 17:03:55:234BEGIN_EXECUTION 17:04:04:868NEXT_TURN 17:04:31:837RESULTS97 points awarded. 17:04:38:426NEXT_TURN 17:05:03:873RESULTS92 points awarded.... Later, this may be used (e.g.) to divide each session into smaller tasks or conversations.