Presentation is loading. Please wait.

Presentation is loading. Please wait.

Building & Evaluating Spoken Dialogue Systems Discourse & Dialogue CS 359 November 27, 2001.

Similar presentations


Presentation on theme: "Building & Evaluating Spoken Dialogue Systems Discourse & Dialogue CS 359 November 27, 2001."— Presentation transcript:

1 Building & Evaluating Spoken Dialogue Systems Discourse & Dialogue CS 359 November 27, 2001

2 Agenda How to get started –System bootstrapping “Wizard-of-Oz” design Strengths & Limitations How to tell if you succeeded –System evaluation What you do & how you do it Performance = Task success - Task cost

3 System Bootstrapping Question: How should we design a system? –What should it be able to understand? Key: How would people talk to it? Suggestion 1: Like people talk to each other? –Collect human-human interactions, same task –But, computers NOT like people, act differently –Politeness, assumed knowledge, style, complexity –Adapt to needs of hearer –Balance need for understanding, reduce effort

4 “Wizard-of-Oz” Studies Suggestion 2: Like people talk to computer! –Get application/domain specific language But, system NOT built yet! –Simulate system mediated thru human wizard Fast, rigid/consistent, no small errors/typos –Structured simulations Automate as much as possible –E.g. response editor - hierarchical menus/templates, access to different apps, query creator, time-stamped logging

5 Good Wizard Studies Requirements: –Background system: Fully implemented or simulated Allows some user initiative –Task: Somewhat open “scenario” Not too complex or private –Must be piloted: Task scenario/simulation

6 Comparing Styles Human-human versus human-computer –H-H: more complex; H-C: simpler structure –Domain variability greater than individual –Vocabulary choice –Use of anaphora Question: Should you lie to the user? –Only way to get realistic behavior –Debrief: explain protocol, offer to destroy data

7 System Evaluation Question: Which design is better? Approach 1: Content-based measures –Task-completion –Concept accuracy –Reference answer Query result versus key Limited: Only one strategy –Many alternatives

8 System Evaluation (cont’d) Not just accuracy, but efficiency Approach 2: Cost-based measures –Time to completion: # of utterances # of turns Duration in seconds –Error measures: # corrections, # repetitions

9 Combining Measures Issues: –Generalization: Factors affecting performance –Sub-dialogues: not just WHOLE task PARADISE: –Separate what agent does from how does it –Performance = task success & dialogue costs Performance => Usability => User satisfaction Task success = operationalized as K-coefficient Costs = efficiency, qualitative measures

10 Measuring Task Success AVM: Attribute Value Matrix –Capture info to be exchanged b/t user & system –“Key”: AVM instantiation for scenario K-coefficient calculated from confusion matrix –on-diagonal: match key; off-diagonal: misunderstood K = P(A) - P(E)/ (1-P(E)) –P(A): Proportion agreement; P(E): Proportion expect –Actual - chance agreement Pros: corrects for chance; compare across tasks

11 Measuring Task Costs Define cost measures: –E.g. # utterances, # repairs Can compute across sub-dialogues –Match segment to purpose Hierarchical structure - link to subtasks Tag by AVM info goals

12 Estimating Performance Fn Predicted measure: Performance –User satisfaction rating: Rating: 1-6 on some question or average of questions Predictor measures: Success & Costs –Normalize each to z-score Handle varying scales –Apply multiple linear regression to compute weights Calculate for sub-dialogue: restrict K, costs

13 Evaluation Applied to multiple tasks –Travel, Reservation/Purchase, Circuit-Fix-It –Define new AVM attributes Match discourse structure Compare dialogue strategies –Explicit/Implicit confirmation –System/User/Mixed initiative

14 Summary Building for HCI –Human-human versus human-computer –Acquire vocabulary, structure, style Base on “Wizard-of-Oz” simulation Evaluating strategies –Performance = task success - dialogue cost –task success: agreement between response & key Success level compensates for chance –Costs: number of repairs, utterances


Download ppt "Building & Evaluating Spoken Dialogue Systems Discourse & Dialogue CS 359 November 27, 2001."

Similar presentations


Ads by Google