Challenges in Dialogue Discourse and Dialogue CMSC 35900-1 October 27, 2006.

Challenges in Dialogue Discourse and Dialogue CMSC 35900-1 October 27, 2006

Roadmap Issues in Dialogue –Dialogue vs General Discourse –Dialogue Acts Modeling Recognition and Interpretation –Dialogue Management for Computational Agents

Dialogue vs General Discourse Key contrast: Two or more speakers –Primary focus on speech Issues in multi-party spoken dialogue –Turn-taking – who speaks next, when? –Collaboration – clarification, feedback,… –Disfluencies –Adjacency pairs, dialogue acts

Turn-Taking Multi-party discourse –Need to trade off speaker/hearer roles Interpret reference from sequential utterances When? –End of sentence? No: multi-utterance turns –Silence? No: little silence in smooth dialogue:< 250ms –When other starts speaking? No: relatively little overlap face-to-face: ~5%

Turn-taking: When Rule-governed behavior –Possibly multiple legal turn change times Aka transition-relevance places (TRP) Generally at utterance boundaries –Utterance not necessarily sentence –In fact, utterance/sentence boundaries not obvious in speech »Don’t necessarily pause between sentences Automatic utterance boundary detection –Cue words (okay, so,..); POS sequences; prosody

Turn-taking: Who & How At each TRP in each turn (Sacks 1974) –If speaker has selected A to speak, A must take floor –If speaker has selected no one to speak, anyone can –If no one else takes the turn, the speaker can Selecting speaker A: –By explicit/implicit mention: What about it, Bob? By gaze, function Selecting others: questions, greetings, closing –(Traum et al., 2003)

Turn-taking in HCI Human turn end: – Detected by 250ms silence System turn end: –Signaled by end of speech –Indicated by any human sound Barge-in Continued attention: –No signal

Gesture, Gaze & Voice Range of gestural signals: –head (nod,shake), shoulder, hand, leg, foot movements; facial expressions; postures; artifacts –Align with syllables Units: phonemic clause + change Study with recorded exchanges

Yielding the Floor Turn change signal –Offer floor to auditor/hearer Cues: pitch fall, lengthening, “but uh”, end gesture, amplitude drop+’uh’, end clause Likelihood of change increases with more cues Negated by any gesticulation

Taking the Floor Speaker-state signal –Indicate becoming speaker Occurs at beginning of turns Cues: –Shift in head direction AND/OR –Start of gesture

Retaining the Floor Within-turn signal –Still speaker: Look at hearer as end clause Continuation signal –Still speaker: Look away after within-turn/back Back-channel: –‘mmhm’/okay/etc; nods, sentence completion. Clarification request; restate –NOT a turn: signal attention, agreement, confusion

Segmenting Turns Speaker alone: –Within-turn signal->end of one unit; –Continuation signal -. Beginning of next unit Joint signal: –Speaker turn signal (end); auditor ->speaker; speaker- >auditor –Within-turn + back-channel + continuation Back-channels signal understanding –Early back-channel + continuation

Regaining Attention Gaze & Disfluency –Disfluency: “perturbation” in speech Silent pause, filled pause, restart –Gaze: Conversants don’t stare at each other constantly However, speaker expects to meet hearer’s gaze –Confirm hearer’s attention Disfluency occurs when realize hearer NOT attending –Pause until begin gazing, or to request attention

Improving Human-Computer Turn-taking Identifying cues to turn change and turn start Meeting conversations: –Recorded, natural research meetings –Multi-party –Overlapping speech –Units = “Spurts” between 500ms silence Can predict – on-line – likely turn end

Text + Prosody Text sequence: –Modeled as n-gram language model –Implement as HMM Prosody: –Duration, Pitch, Pause, Energy –Decision trees: classify + probability Integrate LM + DT

Decision Trees A BC DE F G X=tX=f Y>1 Y<=1 Y>2 Y<=2 Disfluency Sentence End None

Interpreting Breaks For each inter-word position: –Is it a disfluency, sentence end, or continuation? Key features: –Pause duration, vowel duration 62% accuracy wrt 50% chance baseline –~90% overall Best combines LM & DT

Jump-in Points (Used) Possible turn changes –Points WITHIN spurt where new speaker starts Key features: –Pause duration, low energy, pitch fall Accuracy: 65% wrt 50% baseline Performance depends only on preceding prosodic features

Jump-in Features Do people speak differently when jump-in? –Differ from regular turn starts? Examine only first words of turns –No LM Key features: –Raised pitch, raised amplitude Accuracy: 77% wrt 50% baseline –Prosody only

Collaborative Communication Speaker tries to establish and add to “common ground” – “mutual belief” –Presumed a joint, collaborative activity Make sure “mutually believe” the same thing –Hearer can acknowledge/accept/disagree »Clark & Schaeffer: Degrees of grounding Display, Demonstrate/Reformulate, Acknowledgement, Next relevant contribution, Continued attention

Computational Models (Traum et al) revised for computation –Involves both speaker and hearer Initiate, Continue, Acknowledge, Repair, Request Repair, etc –Common phenomena “Back-Channel” – “uh-huh”, “okay”, etc –Allows hearer to signal continued attention, ack »WITHOUT taking the turn Requests for repair – common in human-human –Even more common in human-computer dialogue

Implicature & Grice’s Maxims Inferences licensed by utterances Grice’s Maxims –Quantity: Be as informative as required “There are two classes per week” – not 1, or 5 –Quality: Be truthful – don’t lie, –Relevance: Be relevant –Manner: “Be perspicuous” Don’t be obscure, ambiguous, prolix, or disorderly “Flouting” maxims: Consciously violate for effect –Humor, emphasis,

Speech & Dialogue Acts Speech Acts (Austin, Searle) –“Doing things with words” E.g. performatives: “I dub thee Sir Lancelot” –Illocutionary acts: act of asking, answering, promising, etc in saying an utterance Include: Assertives: “I propose to..”, Directives: “Stop that”, Commissives: “I promise”, Expressives: “Thank you”, Declarations: “You’re fired”

Dialogue Acts (aka Conversational moves) –Enriched set of speech acts Capture full range of conversational functions –Adjacency pairs: Many two-part structures E.g. Question-Answer, Greeting-Greeting, Request- Grant, etc… Paired for speaker-hearer dyads –Contrast with rhetorical relations in monologue

DAMSL Dialogue Act Tagging framework –Adjacency pairs+grounding+repair Forward looking functions –Statement, info-request, commit, closing, etc Backward looking functions –Focus on link to prior speaker utterance Agreement, answer, accept, etc..

Tagged Dialogue [assert] C1:... I need to travel in May. [inforeq,ack] A1: And, what day in May did you want to travel? [assert,answer] C2: OK uh I need to be there for a meeting that’s from the 12th to the 15th. [inforeq,ack] A2: And you’re flying into what city? [assert,answer]C3: Seattle. [inforeq,ack] A3: And what time would you like to leave Pittsburgh? [check,hold] C4: Uh hmm I dont think theres many options for nonstop. [accept,ack] A4: Right. [assert] There’s three non-stops today. [info-req] C5: What are they? [assert,open-option] A5: The first one departs PGH at 10:00am arrives Seattle at 12:05 their time. The second flight departs PGH at 5:55pm, arrives Seattle at 8pm. And the last flight departs PGH at 8:15pm arrives Seattle at 10:28pm. [accept,ack] C6: OK Ill take the 5ish flight on the night before on the11th. [check,ack] A6: On the 11th? [assert,ack] OK. Departing at 5:55pm arrives Seattle at 8pm, U.S. Air flight 115. [ack] C7: OK.

Dialogue Act Recognition Goal: Identify dialogue act tag(s) from surface form Challenge: Surface form can be ambiguous –“Can you X?” – yes/no question, or info-request “Flying on the 11t h, at what time?” – check, statement Requires interpretation by hearer –Strategies: Plan inference, cue recognition

Plan-inference-based Classic AI (BDI) planning framework –Model Belief, Knowledge, Desire Formal definition with predicate calculus –Axiomatization of plans and actions as well –STRIPS-style: Preconditions, Effects, Body –Rules for plan inference Elegant, but.. –Labor-intensive rule, KB, heuristic development –Effectively AI-complete

Cue-based Interpretation Employs sets of features to identify –Words and collocations: Please -> request –Prosody: Rising pitch -> yes/no question –Conversational structure: prior act Example: Check: Syntax: tag question “,right?” Syntax + prosody: Fragment with rise N-gram: argmax d P(d)P(W|d) –So you, sounds like, etc Details later ….

From Human to Computer Conversational agents –Systems that (try to) participate in dialogues –Examples: Directory assistance, travel info, weather, restaurant and navigation info Issues: –Limited understanding: ASR errors, interpretation –Computational costs: broader coverage -> slower, less accurate

Dialogue Manager Tradeoffs Flexibility vs Simplicity/Predictability –System vs User vs Mixed Initiative –Order of dialogue interaction –Conversational “naturalness” vs Accuracy –Cost of model construction, generalization, learning, etc Models: FST, Frame-based, HMM, BDI Evaluation frameworks

Challenges in Dialogue Discourse and Dialogue CMSC 35900-1 October 27, 2006.

Similar presentations

Presentation on theme: "Challenges in Dialogue Discourse and Dialogue CMSC 35900-1 October 27, 2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Challenges in Dialogue Discourse and Dialogue CMSC 35900-1 October 27, 2006.

Similar presentations

Presentation on theme: "Challenges in Dialogue Discourse and Dialogue CMSC 35900-1 October 27, 2006."— Presentation transcript:

Similar presentations

About project

Feedback