Presentation is loading. Please wait.

Presentation is loading. Please wait.

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences Agustín Gravano.

Similar presentations


Presentation on theme: "Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences Agustín Gravano."— Presentation transcript:

1

2 Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences Agustín Gravano Columbia University Turn-Taking and Affirmative Cue Words in Task-Oriented Dialogue

3 Agustín Gravano - Thesis Defense - Jan 28, Special thanks to: Julia Hirschberg Committee Members Maxine Eskenazi, Kathy McKeown, Becky Passonneau, Amanda Stent. The Speech Lab Stefan Benus, Fadi Biadsy, Sasha Caskey, Bob Coyne, Frank Enos, Martin Jansche, Jackson Liscombe, Sameer Maskey, Andrew Rosenberg. Collaborators Gregory Ward and Elisa Sneed German (Northwestern U); Ani Nenkova (UPenn); Héctor Chávez, David Elson, Michel Galley, Enrique Henestroza, Hanae Koiso, Shira Mitchell, Michael Mulley, Kristen Parton, Ilia Vovsha, and Lauren Wilcox. Family and friends

4 Agustín Gravano - Thesis Defense - Jan 28, Interactive Voice Response Systems Quickly spreading. Mostly simple functionality. Examples of IVR systems: Lets Go!: Bus scheduling information (CMU). GOOG-411: Local information (Google). Most visible components of IVR systems: Automatic Speech Recognition (ASR) Text-To-Speech (TTS) Introduction

5 Agustín Gravano - Thesis Defense - Jan 28, ASR+TTS account for most IVR problems. ASR: Up to 60% word error rate. TTS: Described as odd or mechanical. As ASR and TTS improve, other problems begin to show: Coordination of system-user exchanges. Frequent words overloaded with multiple functions. Introduction Interactive Voice Response Systems

6 Agustín Gravano - Thesis Defense - Jan 28, Coordination of Exchanges Lets Go! Demo (http://www.speech.cs.cmu.edu/letsgo/) S: Thank you for calling the CMU Let's Go! Bus Information System. […] What can I do for you? U: I would like to go to the airport tomorrow morning. [silence] S: To the airport. When do you want to arrive? U: I'd like to arrive at 10:30. [silence] S: Arriving at around 10:30 AM. Where do you want to leave from? U: I'd like to leave from Carnegie Mellon. [silence] S: From Carnegie Mellon. There is a 28X leaving Forbes Avenue […] Turn boundary detection is currently based on silence detection. Problems: latencies and false positives. Introduction

7 Agustín Gravano - Thesis Defense - Jan 28, Overloaded Cue Words Cue words: expressions such as by the way, however, after all. Frequent in dialogue, used for structuring discourse and shaping conversation. Affirmative cue words: okay, alright, etc. Convey acknowledgment, start a new topic, display continued attention, inter alia. Frequent in task-oriented dialogue. IVR systems: understanding and generation. Introduction

8 Agustín Gravano - Thesis Defense - Jan 28, Motivation Understand and incorporate these and other phenomena into IVR systems, aiming at gradually approaching human-like behavior. Descriptions of associations between observed phenomena (e.g. turn exchange types) and measurable events (e.g. variations in acoustic features). No strong claims about the degree of awareness of speakers and listeners. Introduction

9 Agustín Gravano - Thesis Defense - Jan 28, (1) Columbia Games Corpus (2) Study of Turn-Taking (3) Study of Affirmative Cue Words

10 Agustín Gravano - Thesis Defense - Jan 28, Columbia Games Corpus Task-oriented spontaneous dialogues. Two subjects, each with a laptop computer. Series of collaborative computer games. Soundproof booth; head-mounted mics. No eye contact; only verbal communication. No restrictions; subjects could speak freely.

11 Agustín Gravano - Thesis Defense - Jan 28, Cards Game, Part 1 Columbia Games Corpus Player 1: DescriberPlayer 2: Searcher

12 Agustín Gravano - Thesis Defense - Jan 28, Cards Game, Part 2 Player 1: DescriberPlayer 2: Searcher Columbia Games Corpus

13 Agustín Gravano - Thesis Defense - Jan 28, Objects Game Player 1: DescriberPlayer 2: Follower Columbia Games Corpus

14 Agustín Gravano - Thesis Defense - Jan 28, Columbia Games Corpus 12 sessions, 13 subjects (6 female, 7 male). 9 hours of dialogue. Orthographic transcription and alignment. 70K words, 2K unique words Non-word vocalizations (laughs, coughs, etc.) Prosodic transcription (ToBI conventions). Automatically generated session logs.

15 Agustín Gravano - Thesis Defense - Jan 28, (1) Columbia Games Corpus (2) Study of Turn-Taking (3) Study of Affirmative Cue Words

16 Agustín Gravano - Thesis Defense - Jan 28, Goals Speech understanding: Detection of the end of the users turn. Detection of points in the users turn where a backchannel response would be welcome. Speech generation: Display of cues signalling the end of systems turn. Display of cues inviting the user to produce a backchannel response. Turn-Taking

17 Agustín Gravano - Thesis Defense - Jan 28, Previous Work Sacks, Schegloff & Jefferson General characterization of turn-taking in conversation between two or more persons. Transition-relevance place: The current speaker may either yield the turn, or continue speaking. Duncan 1972, 1973, 1974, inter alia. Six turn-yielding cues in face-to-face dialogue. Linear relation between the number of displayed cues and the likelihood of a turn-taking attempt. Turn-Taking

18 Agustín Gravano - Thesis Defense - Jan 28, Previous Work Corpus and perception studies. Formalized and verified some of the turn-yielding cues hypothesized by Duncan. Ford & Thompson 1996 ; Wennerstrom & Siegel 2003 ; Cutler & Pearson 1986 ; Wichmann & Caspers Implementations of turn-boundary detection. Simulations (Ferrer et al. 2002, 2003 ; Edlund et al ; Schlangen 2006 ; Atterer et al. 2008; Baumann 2008 ). Actual systems (Raux & Eskenazi 2008, on Lets Go!). Exploiting turn-yielding cues improves performance. Turn-Taking

19 Agustín Gravano - Thesis Defense - Jan 28, Turn-Yielding Cues Cues displayed by the speaker when approaching a potential turn boundary. Turn-Taking

20 Agustín Gravano - Thesis Defense - Jan 28, Method Smooth switch: Speaker A finishes her utterance; speaker B takes the turn with no overlapping speech. Trained annotators distinguished Smooth switches from Interruptions and Backchannels using a scheme based on Ferguson 1977, Beattie Turn-Yielding Cues IPU (Inter Pausal Unit): Maximal sequence of words from the same speaker surrounded by silence 50ms. Speaker A: Speaker B: Hold IPU1IPU2 IPU3 Smooth switch

21 Agustín Gravano - Thesis Defense - Jan 28, Compare IPUs preceding Holds and IPUs preceding Smooth switches. Assumption: Cues are more likely to occur before Smooth switches than before Holds. Speaker A: Speaker B: HoldSmooth switch IPU1IPU2 IPU3 Turn-Yielding Cues Method

22 Agustín Gravano - Thesis Defense - Jan 28, Final intonation 2. Speaking rate 3. Intensity level 4. Pitch level 5. Textual completion 6. Voice quality 7. IPU duration Individual Turn-Yielding Cues

23 Agustín Gravano - Thesis Defense - Jan 28, Individual Turn-Yielding Cues Smooth switch Hold H-H%22.1%9.1% [!]H-L%13.2%29.9% L-H%14.1%11.5% L-L%47.2%24.7% No boundary tone0.7%22.4% Other2.6%2.4% Total100% ( 2 test: p0) 1. Final Intonation Falling, high-rising: turn-final. Plateau: turn-medial. Examination of final pitch slope shows same results.

24 Agustín Gravano - Thesis Defense - Jan 28, Individual Turn-Yielding Cues 2. Speaking Rate Reduced final lengthening before turn boundaries. * * ** (*) ANOVA: p < 0.01 Smooth switch Hold Final word Entire IPU z-score

25 Agustín Gravano - Thesis Defense - Jan 28, /4. Intensity and Pitch Levels Individual Turn-Yielding Cues * * * *** Intensity Pitch (*) ANOVA: p < 0.01 Lower intensity, pitch levels before turn boundaries. Smooth switch Hold z-score

26 Agustín Gravano - Thesis Defense - Jan 28, Textual Completion Syntactic/semantic/pragmatic completion independent of intonation and gesticulation. Automatic computation of textual completion. (1) Manually annotated a portion of the data. 3 labelers; 400 IPUs; Fleiss = (2) Trained an SVM classifier. 80% accuracy; baseline: 55%; human: 91%. Individual Turn-Yielding Cues

27 Agustín Gravano - Thesis Defense - Jan 28, Textual Completion Labeled all IPUs in the corpus with the SVM model. Individual Turn-Yielding Cues Incomplete Complete Smooth switchHold 18% 82% 47%53% ( 2 test, p 0) Textual completion seems to be almost a necessary condition before switches, but not before holds.

28 Agustín Gravano - Thesis Defense - Jan 28, Voice Quality Individual Turn-Yielding Cues * * * * * * * * * JitterShimmerNHR Higher jitter, shimmer, NHR before turn boundaries. (*) ANOVA: p < 0.01 Smooth switch Hold z-score

29 Agustín Gravano - Thesis Defense - Jan 28, IPU Duration Individual Turn-Yielding Cues Longer IPUs before turn boundaries. * * (*) ANOVA: p < 0.01 Smooth switch Hold z-score

30 Agustín Gravano - Thesis Defense - Jan 28, Final intonation 2. Speaking rate 3. Intensity level 4. Pitch level 5. Textual completion 6. Voice quality 7. IPU duration Individual Cues Turn-Yielding Cues

31 Agustín Gravano - Thesis Defense - Jan 28, Combined Cues Number of cues conjointly displayed Percentage of turn-taking attempts Turn-Yielding Cues r 2 = 0.969

32 Agustín Gravano - Thesis Defense - Jan 28, Backchannel-Inviting Cues Cues displayed by the speaker inviting the listener to produce a backchannel response. Turn-Taking

33 Agustín Gravano - Thesis Defense - Jan 28, Compare IPUs preceding Holds and IPUs preceding Backchannels. Assumption: Cues are more likely to occur before Backchannels than before Holds. Backchannel-Inviting Cues Method Speaker A: Speaker B: HoldBackchannel IPU1IPU2 IPU3 IPU4

34 Agustín Gravano - Thesis Defense - Jan 28, Backchannel-Inviting Cues Individual Cues 1. Final rising intonation: H-H% or L-H%. 2. Higher intensity level. 3. Higher pitch level. 4. Longer IPU duration. 5. Lower NHR. 6. Final POS bigram: DT NN, JJ NN, or NN NN.

35 Agustín Gravano - Thesis Defense - Jan 28, Backchannel-Inviting Cues Combined Cues Number of cues conjointly displayed Percentage of IPUs followed by a BC r 2 = r 2 = 0.993

36 Agustín Gravano - Thesis Defense - Jan 28, Speaker A: Speaker B: ip2ip1ip3 Overlapping Speech 95% of overlaps start during the turn-final intermediate phrase (ip). We look for turn-yielding cues in the second- to-last intermediate phrase (e.g., ip2). HoldOverlap Turn-Taking

37 Agustín Gravano - Thesis Defense - Jan 28, Overlapping Speech Cues found in second-to-last ips: Higher speaking rate. Lower intensity. Higher jitter, shimmer, NHR. All cues match the corresponding cues found in (non-overlapping) smooth switches. Cues seem to extend further back in the turn, becoming more prominent toward turn endings. Future research: Generalize the model of discrete turn-yielding cues. Turn-Taking

38 Agustín Gravano - Thesis Defense - Jan 28, (1) Columbia Games Corpus (2) Study of Turn-Taking (3) Study of Affirmative Cue Words

39 Agustín Gravano - Thesis Defense - Jan 28, Affirmative Cue Words 8% of the words in the Columbia Games Corpus: okay, right, yeah, mm-hm, alright, uh-huh, gotcha, huh, yep, yes, yup. 10 discourse/pragmatic functions: Acknowledgment/agreement, Literal modifier, Backchannel, Cue beginning/ending discourse segment, Check with the interlocutor, Stall/Filler, Back from a task, Pivot beginning/ending (Ack+Cue). Labeled by 3 trained annotators. Fleiss = 0.69: Substantial agreement.

40 Agustín Gravano - Thesis Defense - Jan 28, Examples Affirmative Cue Words thats pretty much okay Speaker 1: between the yellow mermaid and the whale Speaker 2: okay Speaker 1: and it is okay were gonna be placing the blue moon Literal modifier Backchannel Cue beginning discourse segment

41 Agustín Gravano - Thesis Defense - Jan 28, Interactive Voice Response Systems Speech understanding: Must interpret the users input correctly. Speech generation: Need to convey potentially ambiguous terms with the appropriate parameters for the intended meaning. Affirmative Cue Words

42 Agustín Gravano - Thesis Defense - Jan 28, Previous Work Disambiguation of single-word cue phrases. well, now, say, so, like, really, … Discourse vs. sentential senses. Hirschberg & Litman 1987, 1993; Litman 1994, 1996; Zufferey & Popescu-Belis 2004, Lai Affirmative cue words. Hockey 1991, 1992; Kowtko 1997: Intonational differences across discourse/pragmatic functions. Jurafsky et al. 1998: Lexical identity is a strong cue to word function. Affirmative Cue Words

43 Agustín Gravano - Thesis Defense - Jan 28, Descriptive statistics Large contextual differences Backchannels occur always as separate turns. Cue beginnings occur mostly in turn-initial position. Modifier instances of right occur in all positions within the turn, but rarely as separate turns. Acknowledgments occur in turn initial, medial and final positions, and also as separate turns. Affirmative Cue Words

44 Agustín Gravano - Thesis Defense - Jan 28, Descriptive statistics Final intonation Backchannel: Rising (H-H%, L-H%) Cue beginning: Falling (L-L%) Check:High-rising (H-H%) Intensity Backchannel:High Cue beginning:High Cue ending:Low Affirmative Cue Words

45 Agustín Gravano - Thesis Defense - Jan 28, Perception study of okay Okay is the most frequent ACW in the corpus. How do hearers disambiguate its meaning? Acoustic/prosodic/phonetic vs. contextual info? 20 subjects classified 54 tokens of okay into {Ack, BC, CueBeg} in two conditions: No context available: only the word okay. Context available: 2 full speaker turns. Affirmative Cue Words contextualized okay Speaker A:okay Speaker B:

46 Agustín Gravano - Thesis Defense - Jan 28, Perception study of okay No context available Very low inter-subject agreement. Correlations of word function with acoustic/prosodic/ phonetic features. Context available Higher inter-subject agreement. Contextual features trump ac/pr/ph features of okay. Exception: Final intonation of okay. Affirmative Cue Words

47 Agustín Gravano - Thesis Defense - Jan 28, Automatic Classification Identify automatically the function of ACWs. Classification into discourse vs. sentential function insufficient for ACWs. right: 15% discourse, 85% sentential. All other ACWs: 99% discourse, 1% sentential. New classification tasks: Detection of an acknowledgment function. Acknowledgment vs. No acknowledgment. Detection of a discourse segment boundary function. SegBeg vs. SegEnd vs. None. Affirmative Cue Words

48 Agustín Gravano - Thesis Defense - Jan 28, Automatic Classification Lexical features Lexical id, POS tags, n-grams. Discourse features Position of target word in IPU, turn, conversation. Timing features Duration of word, IPU, turn; amount of overlaps; latencies. Acoustic features Pitch, intensity, pitch slope, voice quality. Phonetic features Id, duration of each phone. Affirmative Cue Words

49 Agustín Gravano - Thesis Defense - Jan 28, Automatic Classification Discourse Boundary Acknowledgment Error Rate Baseline (1) 18.6 %15.3 % SVM: Word-only14.4 %15.0 % SVM: Online (up to current IPU) 10.1 % 6.7 % SVM: Full model 6.9 % 4.5 % Human labelers 5.7 % 3.3 % (1) Discourse Boundary:majority class == no boundary Acknowledgment:{right, huh} no ACK; all others ACK ( * ) Significantly different (Wilcoxon signed rank sum test; p < 0.05) Affirmative Cue Words * * * * * } } } } }

50 Agustín Gravano - Thesis Defense - Jan 28, Affirmative Cue Words Speaker Entrainment In conversation, people adapt the way they speak to match their partner. Referring expressions (Brennan 1996). Syntactic constructions (Reitter et al. 2006). Intensity (Coulston et al. 2002, Ward & Litman 2007). Entrainment at different levels (lex, syn, sem): Key for both production and understanding, and facilitates interaction (Pickering & Garrod 2004, Goleman 2006). Predictor of task success (MapTask; Reitter & Moore 2007).

51 Agustín Gravano - Thesis Defense - Jan 28, Affirmative Cue Words Speaker Entrainment Two novel measures of entrainment based on usage of high-frequency words (HFW), including ACW. Entrainment of HFW correlates with: (+) Game score Task success (+) Proportion of overlaps (–) Proportion of interruptions Dialogue coordination (–) Latency of smooth switches Future work: Establish causality relation. Impact on IVR system design and/or evaluation. }

52 Agustín Gravano - Thesis Defense - Jan 28, (1) Columbia Games Corpus (2) Study of Turn-Taking (3) Study of Affirmative Cue Words

53 Agustín Gravano - Thesis Defense - Jan 28, Contributions Columbia Games Corpus Valuable dataset for studying spontaneous task- oriented dialogue. Study of Turn-Taking Turn-yielding cues. Backchannel-inviting cues. Objective, automatically computable. Combined cues. Improve turn-taking decisions of IVR systems.

54 Agustín Gravano - Thesis Defense - Jan 28, Contributions Study of Affirmative Cue Words Descriptive statistics and perceptual results. Automatic classification. Speaker entrainment. Understanding and generation in IVR systems. Results drawn from task-oriented dialogues, thus not necessarily generalizable, but suitable for most IVR domains. Necessary steps towards the ambitious, long- term goal of human-like speech systems.

55 Agustín Gravano - Thesis Defense - Jan 28, Future Work Additional turn-taking cues. Voice quality? Novel ways to combine cues. Weights? Study cues that extend over entire turns, increasing near potential turn boundaries. Characterize interruptions. Speaker entrainment Affirmative cue words. Turn-taking behavior. Acoustic/prosodic variation.

56 Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences Agustín Gravano Columbia University Turn-Taking and Affirmative Cue Words in Task-Oriented Dialogue


Download ppt "Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences Agustín Gravano."

Similar presentations


Ads by Google