Chapter 6. Data Collection in a Wizard-of-Oz Experiment in Reinforcement Learning for Adaptive Dialogue Systems by: Rieser & Lemon. Course: Autonomous.

Slides:



Advertisements
Similar presentations
Tuning Jenny Burr August Discussion Topics What is tuning? What is the process of tuning?
Advertisements

Dialogue Policy Optimisation
Presented by Eroika Jeniffer.  We want to set tasks that form a representative of the population of oral tasks that we expect candidates to be able to.
+ UCD Methods For Understanding Users In Computer Science – Music application.
CHAPTER 1: AN OVERVIEW OF COMPUTERS AND LOGIC. Objectives 2  Understand computer components and operations  Describe the steps involved in the programming.
Cognitive Walkthrough More evaluation without users.
What can humans do when faced with ASR errors? Dan Bohus Dialogs on Dialogs Group, October 2003.
Heuristic Evaluation Evaluating with experts. Discount Evaluation Techniques  Basis: Observing users can be time- consuming and expensive Try to predict.
Evaluating with experts
1 User Centered Design and Evaluation. 2 Overview My evaluation experience Why involve users at all? What is a user-centered approach? Evaluation strategies.
PRESENTED BY Mark Henry, President Mark Henry Enterprises, Inc. Copyright © 2007 Mark Henry.
Seeking and providing assistance while learning to use information systems Presenter: Han, Yi-Ti Adviser: Chen, Ming-Puu Date: Sep. 16, 2009 Babin, L.M.,
Spreadsheets and Microsoft Excel. Introduction n A spreadsheet (called a worksheet in Excel) is a two-dimensional array of cells containing data to be.
DQA meeting: : Learning more effective dialogue strategies using limited dialogue move features Matthew Frampton & Oliver Lemon, Coling/ACL-2006.
Scientific Process ► 1) Developing a research idea and hypothesis ► 2) Choosing a research design (correlational vs. experimental) ► 3) Choosing subjects.
Evaluating a Research Report
PubMed/How to Search, Display, Download & (module 4.1)
MarkNotes Question 1 The Human Computer Interface (HCI) is an important part of an ICT system. Describe four factors which should be taken.
Evaluation of SDS Svetlana Stoyanchev 3/2/2015. Goal of dialogue evaluation Assess system performance Challenges of evaluation of SDS systems – SDS developer.
Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.
User Support Chapter 8. Overview Assumption/IDEALLY: If a system is properly design, it should be completely of ease to use, thus user will require little.
1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.
Intelligent Robot Architecture (1-3)  Background of research  Research objectives  By recognizing and analyzing user’s utterances and actions, an intelligent.
Usability Engineering Dr. Dania Bilal IS 582 Spring 2006.
1 Dialogue, Speech and Images: The Companions Project Data Set Yorick Wilks, David Benyon, Christopher Brewster, Pavel Ircing, and Oli Mival
UNSUPERVISED CV LANGUAGE MODEL ADAPTATION BASED ON DIRECT LIKELIHOOD MAXIMIZATION SENTENCE SELECTION Takahiro Shinozaki, Yasuo Horiuchi, Shingo Kuroiwa.
FORESTUR How to work… …with this training platform? …with this methodology?
Lti Shaping Spoken Input in User-Initiative Systems Stefanie Tomko and Roni Rosenfeld Language Technologies Institute School of Computer Science Carnegie.
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.
Speech Processing 1 Introduction Waldemar Skoberla phone: fax: WWW:
Wizard of OZ Oxford, Maurice Vereecken. Goal of implementing Woz Wp2 –Quick insight in mapping events to Ontdeknet –Working towards what to do in WP3.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
MarkNotes Question 1 The Human Computer Interface (HCI) is an important part of an ICT system. Describe four factors which should be taken.
Chapter 11 user support. Overview Users require different types of support at different times. There are four main types of assistance that users require:
Newton County Public Schools Session 5 VIEWpath and SAFARI Montage Luanne Ropp Regional Director of Professional Development
Human Computer Interaction Lecture 21 User Support
Chapter 6 : User interface design
California Assessment of STUDENT PERFORMANCE and PROGRESS
Educational Communication & E-learning
Microsoft Access Lesson One Notes.
Data Virtualization Demoette… Business Directory Custom Properties
Xiaolin Wang Andrew Finch Masao Utiyama Eiichiro Sumita
12 april 2011 Water Coach.
Human Computer Interaction Lecture 21,22 User Support
DATA INPUT AND OUTPUT.
11.10 Human Computer Interface
Objectives At the end of this session, students will be able to:
Regional Architecture Development for Intelligent Transportation
Towards Emotion Prediction in Spoken Tutoring Dialogues
Test Validity.
Variables are factors that change or can be changed.
Week 12 Option 3: Database Design
Designing Information Systems Notes
Chapter 5. The Bootstrapping Approach to Developing Reinforcement Learning-based Strategies in Reinforcement Learning for Adaptive Dialogue Systems, V.
Use of ICT in Education for Online and Blended Learning
Qualitative and Quantitative Data
Issues in Spoken Dialogue Systems
Finding Answers through Data Collection
Usability Testing: An Overview
Exploring Naturalistic Gestures for Digital Tabletops
Webinar Instructor Training
Chapter 11 user support.
Common Core Standard 9-10.RL.Key Ideas and Details
Use of ICT in Education for Online and Blended Learning
CS 188: Artificial Intelligence Fall 2008
Jourdann Fraser Vy Mai Tiffany Manuel
Evaluation.
Test Driven Lasse Koskela Chapter 9: Acceptance TDD Explained
Tribal Stewardship Cohort Program
PubMed/How to Search, Display, Download & (module 4.1)
Presentation transcript:

Chapter 6. Data Collection in a Wizard-of-Oz Experiment in Reinforcement Learning for Adaptive Dialogue Systems by: Rieser & Lemon. Course: Autonomous Machine Learning Hanan Al Rahbi hanan.alrahbi@snu.ac.kr Interdisciplinary Program in Cognitive Science Seoul National University

The Turk https://youtu.be/RdT4yG8wczQ

Contents Introduction Experimental Setup Noise Simulation Recruited Subjects Experimental Procedure and Task Design Noise Simulation Related Work Method Results and Discussion Corpus Description Analysis Qualitative Measures Subjective Ratings from the User Questionnaires Summary and Discussion

What is a WOZ? What is it used for? A research method in which a human being simulates the intelligent behavior of a machine Before one is able to build a full working system What is it used for? Collecting initial data before a system is designed Producing more intelligent behavior by current machines

Why is it important? Allows data-driven development for domains with no available prototypes Helps creating effective policies with the highest rewards Saves effort and time

The Experiment

Experimental setup Multimodal Wizard-of-Oz data collection setup for an in-car music player application

Recruited Subjects: Wizards & Users This experiment focuses on the behavior of users and wizards Quota Age Language Wizards 5 (2F, 1M) 20~35 German: Native English: Good No Experience in dialogue systems € Quota Age Field of Study Users 21 (11F, 10M) 20~30 Social Science:23.8% Languages:23.8% Natural Sciences:28.6% Arts:17%

Experimental Procedure and Task Design Training wizards (database, interaction with users) User and wizard placed in separate rooms User received sheet of instructions upon arrival Introducing the user to the driving simulator (tested) User could solve the tasks in any order they preferred After each task user filled task-specific questionnaire User interviewed by experiment leader

Simple text-message conveying how many results were found Output of a list of just the name (album, song, or artist) A table of complete search results A table of complete search results but only displaying a subset of of columns.

Experimental Procedure and Task Design Designed 10 task sets Every task set was used at least twice Each set contains 4 tasks of 2 different types: Search for a specific title/ album Build a playlist

Noise Simulation HCI vs. WOZ Related work: Skauntze (2003,2005) Stuttle et al (2004); Williams and Young (2004a) Even with high noise wizards are able to interpret the ASR output well and assimilate contextual knowledge about what user actions are likely to follow

Noise Simulation Method: To approximate speech recognition errors, a tool was used to randomly delete parts of the transcribed utterances Wizards also build up their own hypotheses about what the user really said (misunderstandings) Word deletion rate of the text varied: 20% weakly corrupted = deletion rate of 20% 20% strongly corrupted= deletion rate of 50% In 60% of the cases the wizard saw the transcribed speech uncorrupted

Noise Simulation

Results and Discussion 30% of the corrupted utterances had a noticeable effect on the interaction 7% of all user turns lead to a communication error (much lower than the current WER for spoken dialogue systems {around 30%}) On the other hand, the error rate is higher than for human-human communication

Results and Discussion Shortcomings of the deletion method: Deleting words is a rather crude simulation of real-world acoustic problems (justified) Time delay introduced by transcribing the utterances (both of user and wizard) This method is not suitable for studying detailed error, however, it can be sufficient in order to study natural presentation strategies under the presence of noise.

Corpus Description 21 sessions, containing 72 dialogue, with about 1600 turns were gathered Data for each session includes video and audio recordin, questionnaire data, transcripts, and a log file The logging information per session consists of OAA messages in chronological order Corpus is marked up and annotated using Nite XML Toolkit (NXT)

Analysis Results of corpus analysis for multimodal presentation strategies Qualitative measures: 22.3% of the 793 wizard turns were annotated as presentation strategies, resulting in 177 instances for learning 48% screen output 78.6% the table option 17% the list 0.04% text only Verbal presentation only present 1.6 items on average Where wizard summarized the results by presenting the options for the most distinctive feature to the user.

Analysis Did the Wizards apply significantly different strategies? It is important to compare! (data will be used for learning) Dialogue length is about the same with very slight differences between wizards Most wizards were equally successful in completing tasks, only one was better with 100% task success, where another one scored 78% task success Therefore we can say they applied similar strategies (this doesn't mean they react the same way) However, multimodal behavior of wizards is very limited Only 3 users selected an item by clicking

Analysis Subjective Ratings from the User Questionnaire

Analysis Subjective Ratings from the User Questionnaire

Discussion Common mistakes (the wizard either): Either wizard displayed too much information On the screen Or fail to present results early enough Screen output should display appropriate amount of information There is a need for a strategy which decides how many database search results to present to the user, when, and which modality to use in an adaptive optimal matter Also a strategy to help minimize the large lists displayed, cut the length of the dialogue, as well as the noise Include information about users driving performance is very important There should be a better and more realistic in-car simulation (the screen size)