Modeling User Satisfaction and Student Learning in a Spoken Dialogue Tutoring System with Generic, Tutoring, and User Affect Parameters Kate Forbes-Riley.

Slides:

Advertisements

Similar presentations

1 Psych 5510/6510 Chapter Eight--Multiple Regression: Models with Multiple Continuous Predictors Part 3: Testing the Addition of Several Parameters at.

Advertisements

5/10/20151 Evaluating Spoken Dialogue Systems Julia Hirschberg CS 4706.

Uncertainty Corpus: Resource to Study User Affect in Complex Spoken Dialogue Systems Kate Forbes-Riley, Diane Litman, Scott Silliman, Amruta Purandare.

Using Statistics in Research Psych 231: Research Methods in Psychology.

Using Statistics in Research Psych 231: Research Methods in Psychology.

Student simulation and evaluation DOD meeting Hua Ai 03/03/2006.

Using Statistics in Research Psych 231: Research Methods in Psychology.

Statistical Methods Chichang Jou Tamkang University.

A principled approach for rejection threshold optimization Dan Bohuswww.cs.cmu.edu/~dbohus Alexander I. Rudnickywww.cs.cmu.edu/~air Computer Science Department.

Today Concepts underlying inferential statistics

Using Statistics in Research Psych 231: Research Methods in Psychology.

Topics = Domain-Specific Concepts Online Physics Encyclopedia ‘Eric Weisstein's World of Physics’ Contains total 3040 terms including multi-word concepts.

Predicting Student Emotions in Computer-Human Tutoring Dialogues Diane J. Litman and Kate Forbes-Riley University of Pittsburgh Pittsburgh, PA USA.

SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Split Sample Validation General criteria for split sample validation Sample problems.

Chapter 12 Inferential Statistics Gay, Mills, and Airasian

LEARNING PROGRAMME Hypothesis testing Intermediate Training in Quantitative Analysis Bangkok November 2007.

Interactive Dialogue Systems Professor Diane Litman Computer Science Department & Learning Research and Development Center University of Pittsburgh Pittsburgh,

Are there “Hidden Variables” in Students’ Initial Knowledge State Which Correlate with Learning Gains? David E. Meltzer Department of Physics and Astronomy.

Kate’s Ongoing Work on Uncertainty Adaptation in ITSPOKE.

Spoken Dialogue for Intelligent Tutoring Systems: Opportunities and Challenges Diane Litman Computer Science Department & Learning Research & Development.

Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.

circle Adding Spoken Dialogue to a Text-Based Tutorial Dialogue System Diane J. Litman Learning Research and Development Center & Computer Science Department.

Evaluation of SDS Svetlana Stoyanchev 3/2/2015. Goal of dialogue evaluation Assess system performance Challenges of evaluation of SDS systems – SDS developer.

Comparing Synthesized versus Pre-Recorded Tutor Speech in an Intelligent Tutoring Spoken Dialogue System Kate Forbes-Riley and Diane Litman and Scott Silliman.

Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.

Animated banners - H1 Sigurbjörn Óskarsson. Research design Repeated measures N=32 (- 1 outlier) Task testing Control category + 3 levels of experimental.

Correlations with Learning in Spoken Tutoring Dialogues Diane Litman Learning Research and Development Center and Computer Science Department University.

Experiments with ITSPOKE: An Intelligent Tutoring Spoken Dialogue System Dr. Diane Litman Associate Professor, Computer Science Department and Research.

Statistics (cont.) Psych 231: Research Methods in Psychology.

A Meta-Study of Algorithm Visualization Effectiveness Christopher Hundhausen, Sarah Douglas, John Stasko Presented by David Burlinson 8/10/2015.

Education 795 Class Notes Data Analyst Pitfalls Difference Scores Effects Sizes Note set 12.

Collaborative Research: Monitoring Student State in Tutorial Spoken Dialogue Diane Litman Computer Science Department and Learning Research and Development.

Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.

Predicting Student Emotions in Computer-Human Tutoring Dialogues Diane J. Litman&Kate Forbes-Riley University of Pittsburgh Department of Computer Science.

Why predict emotions? Feature granularity levels [1] uses pitch features computed at the word-level Offers a better approximation of the pitch contour.

Using Word-level Features to Better Predict Student Emotions during Spoken Tutoring Dialogues Mihai Rotaru Diane J. Litman Graduate Research Competition.

Speech and Language Processing for Educational Applications Professor Diane Litman Computer Science Department & Intelligent Systems Program & Learning.

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

Spoken Dialogue in Human and Computer Tutoring Diane Litman Learning Research and Development Center and Computer Science Department University of Pittsburgh.

Finishing up: Statistics & Developmental designs Psych 231: Research Methods in Psychology.

Speech and Language Processing for Adaptive Training Diane Litman Professor, Computer Science Department Senior Scientist, Learning Research & Development.

Spoken Dialog Systems Diane J. Litman Professor, Computer Science Department.

Using Prosody to Recognize Student Emotions and Attitudes in Spoken Tutoring Dialogues Diane Litman Department of Computer Science and Learning Research.

Finding Answers. Steps of Sci Method 1.Purpose 2.Hypothesis 3.Experiment 4.Results 5.Conclusion.

(Speech and Affect in Intelligent Tutoring) Spoken Dialogue Systems Diane Litman Computer Science Department and Learning Research and Development Center.

Metacognition and Learning in Spoken Dialogue Computer Tutoring Kate Forbes-Riley and Diane Litman Learning Research and Development Center University.

Modeling Student Benefits from Illustrations and Graphs Michael Lipschultz Diane Litman Computer Science Department University of Pittsburgh.

A Tutorial Dialogue System that Adapts to Student Uncertainty Diane Litman Computer Science Department & Intelligent Systems Program & Learning Research.

Spoken Dialogue for Intelligent Tutoring Systems: Opportunities and Challenges Diane Litman Computer Science Department Learning Research & Development.

Improving (Meta)cognitive Tutoring by Detecting and Responding to Uncertainty Diane Litman & Kate Forbes-Riley University of Pittsburgh Pittsburgh, PA.

1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.

Experiments with ITSPOKE: An Intelligent Tutoring Spoken Dialogue System Diane Litman Computer Science Department and Learning Research and Development.

User Simulation for Spoken Dialogue Systems Diane Litman Computer Science Department & Learning Research and Development Center University of Pittsburgh.

Statistics (cont.) Psych 231: Research Methods in Psychology.

Using Natural Language Processing to Analyze Tutorial Dialogue Corpora Across Domains and Modalities Diane Litman, University of Pittsburgh, Pittsburgh,

Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.

Detecting and Adapting to Student Uncertainty in a Spoken Tutorial Dialogue System Diane Litman Computer Science Department & Learning Research & Development.

Prosodic Cues to Disengagement and Uncertainty in Physics Tutorial Dialogues Diane Litman, Heather Friedberg, Kate Forbes-Riley University of Pittsburgh.

Inferential Statistics Psych 231: Research Methods in Psychology.

Psychology research methods– Analysis Portfolio Taylor Rodgers B

Predicting and Adapting to Poor Speech Recognition in a Spoken Dialogue System Diane J. Litman AT&T Labs -- Research

Inferential Statistics Psych 231: Research Methods in Psychology.

Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources Kate Forbes-Riley and Diane Litman Learning Research and Development Center and Computer.

Towards Emotion Prediction in Spoken Tutoring Dialogues

Conditional Random Fields for ASR

Dialogue-Learning Correlations in Spoken Dialogue Tutoring

Spoken Dialogue Systems

Spoken Dialogue Systems

Chapter 12 Power Analysis.

Multiple Regression – Split Sample Validation

Presentation transcript:

Modeling User Satisfaction and Student Learning in a Spoken Dialogue Tutoring System with Generic, Tutoring, and User Affect Parameters Kate Forbes-Riley and Diane Litman University of Pittsburgh

Outline u Overview u PARADISE u System and Corpora u Interaction Parameters u Prediction Models u Conclusions and Future Work

Overview u Goals: u PARADISE: Model performance in our spoken dialogue tutoring system in terms of interaction parameters u Focus design efforts on improving parameters - predict better performance for future users u Use model to predict simulated user performance - as different system versions designed

Overview u What is Performance in our spoken dialogue tutoring system? u User Satisfaction: primary metric for many spoken dialogue systems, e.g. travel-planning (user surveys) u Hypothesis: less useful u Student Learning: primary metric for tutoring systems (student pre/post tests) u Hypothesis: more useful

Overview u What Interaction Parameters for our spoken dialogue tutoring system? u Spoken Dialogue System-Generic (e.g. time): shown useful in non-tutoring PARADISE applications modeling User Satisfaction u Tutoring-Specific (e.g. correctness) u Hypothesis: task-specific parameters impact performance u User Affect (e.g. uncertainty) u Hypothesis: affect impacts performance - generic too

Overview u Are the resulting Performance Models useful? u Generic and Tutoring parameters yield useful Student Learning models u Affect parameters increase usefulness u Generic and Tutoring parameters yield less useful User Satisfaction models than prior non-tutoring applications u (Bonneau-Maynard et al., 2000), (Walker et al., 2002), (Möller, 2005): better models with generic only u Too little data to include Affect parameters

PARADISE Framework (Walker et al., 1997) u Measure parameters (interaction costs and benefits) and performance in system corpus u Train model via multiple linear regression (MLR) over parameters, predict performance (R 2 = variance predicted) u SPSS stepwise MLR: determine parameter inclusion (most correlated until no better R 2 /non-significant model) System Performance = ∑ w i * p i u Test model usefulness (generalize) on new corpus (R 2 ) n i=1

Speech front-end for text-based Why2-Atlas (VanLehn et al., 2002) Qualitative Physics Tutor

Sphinx2 speech recognizer - Why2-Atlas performs NLP on transcript

3 ITSPOKE Corpora u Synthesized voice: Cepstral text-to-speech system u Pre-Recorded voice: paid voice talent Corpus#Students#with Tests#with Survey#with Affect SYN PR SYN0529 0

Experimental Procedure u Subjects without college physics: u Read a small background document u Took a pretest u Worked 5 problems (dialogues) with ITSPOKE u Took a posttest Corpus#Students#with Tests#with Survey#with Affect SYN PR SYN0529 0

User Satisfaction Survey after Posttest Task Ease: It was easy to learn from the tutor. TTS Performance: It was easy to understand the tutor. User Expertise: I knew what I could say or do at each point in the conversations with the tutor. Expected Behavior: The tutor worked the way I expected it to. Future Use: Based on my experience using the tutor to learn physics, I would like to use such a tutor regularly. ALMOST ALWAYS (5), OFTEN (4), SOMETIMES (3), RARELY (2), ALMOST NEVER (1) Corpus#Students#with Tests#with Survey#with Affect SYN PR SYN0529 0

User Affect Annotation u Each student turn labeled with 1 Certainness State: u certain, uncertain, mixed, neutral u Each student turn labeled with 1 Frustration State: u frustrated, non-frustrated Corpus#Students#with Tests#with Survey#with Affect SYN PR SYN0529 0

SYN03 Annotated Excerpt T: What is the NET force acting on the truck equal to? S: I don’t know (ASR: I don’t know ) [Uncertain, Frustrated, Incorrect] T: Since net force is defined as the vector sum of all the forces exerted on an object, let's first find the forces acting on the truck. Try to name them. S: The impact force, the normal force, and gravity (ASR: the impact force but normal force and gravity) [Certain, Non-Frustrated, Correct] T: Yep.

Interaction Parameters u Prior PARADISE applications u Each dialogue between user and system represents a single task (e.g., booking airline travel) u Parameters calculated on a per-dialogue basis u Our tutoring application u Entire tutoring session (5 dialogues) between student and ITSPOKE represents a single task u Parameters calculated on a per-student basis

13 Dialogue System-Generic Parameters u Most from prior PARADISE applications (Möller, 2005), (Walker et al. 2002), (Bonneau-Maynard, 2000) u Time on Task u Total ITSPOKE Turns, Total Student Turns u Total ITSPOKE Words, Total Student Words u Ave. ITSPOKE Words/Turn, Ave. Student Words/Turn u Word Error Rate, Concept Accuracy u Total Timeouts, Total Rejections u Ratio of Student Words to ITSPOKE Words u Ratio of Student Turns to ITSPOKE Turns

12 Tutoring-Specific Parameters u 9 Parameters related to Correctness of Student Turn u ITSPOKE labels: Correct, Incorrect, Partially Correct u Total and Percent for each label u Ratio of each label to every other label u Total number of essays per student u Student pretest and posttest score (for US) u Similar parameters available in most tutoring systems

25 User Affect Parameters u For each of our 4 Certainness labels: u Total, Percent, and Ratio to each other label u Total for each sequence of identical labels (e.g. Certain:Certain) u For each of our 2 Frustration labels u Total, Percent, and Ratio to each other label u Total for each sequence of identical labels

User Satisfaction Prediction Models u Predicted Variable: Total Survey Score u Range: out of ; no corpora differences (p =.46) u Input Parameters: Generic and Tutoring u Do models generalize across corpora (system versions)? u Train on PR05  Test on SYN05 u Train on SYN05  Test on PR05 u Do models generalize better within corpora? u Train on half PR05  Test on half PR05 (for each half) u Train on half SYN05  Test on half SYN05 (for each half)

User Satisfaction Prediction Models u Best Results (on Test Data) u Inter-corpus models are weak and don’t generalize well u Intra-corpus models generalize better, but are still weak predictors of User Satisfaction u Generic and Tutoring parameters selected TrainR2R2 Ranked PredictorsTestR2R2 SYN05.068ITSPOKE Words/TurnPR SYN05 half #2.685ITSPOKE Words/Turn, Student Words/Turn, #Correct SYN05 half #1.227

User Satisfaction Prediction Models u Comparison to Prior Work u Some of same parameters also selected as predictors, e.g. in (Walker et al., 2002) (User Words/Turn) u Higher best test results (R 2 =.3 -.5) in (Möller, 2005), (Walker et al., 2002) and (Bonneau- Maynard et al., 2000)

Student Learning Prediction Models u First Experiments: u Data and Input Parameters: same as for User Satisfaction experiments u Predicted Variable: Posttest controlled for Pretest (learning gains); significant learning independently of corpus (p <.001) TrainR2R2 Ranked PredictorsTestR2R2 PR05.556Pretest, %CorrectSYN SYN05 half #1.580Pretest, Student Words/Turn SYN05 half #2.556

Student Learning Prediction Models u First Experiments: (Best Results on Test Data in table) u All models account for ~ 50% of Posttest variance in train and test data u Intra-corpus models don’t show higher generalizability u Generic and Tutoring parameters selected TrainR2R2 Ranked PredictorsTestR2R2 PR05.556Pretest, %CorrectSYN SYN05 half #1.580Pretest, Student Words/Turn SYN05 half #2.556

Student Learning Prediction Models u Further experiments: u Including third corpus (SYN03) with Generic and Tutoring parameters yields similar results u Best Result (on Test Data): TrainR2R2 Ranked Predictors TestR2R2 PR05+SYN03.413Pretest, TimeSYN05.586

Student Learning Prediction Models u Further experiments: including User Affect Parameters can improve results: TrainR2R2 Ranked PredictorsTestR2R2 SYN03.644Time, Pretest, #NeutralsPR Posttest =.86 * Time +.65 * Pretest -.54 * #Neutrals TrainR2R2 Ranked PredictorsTestR2R2 SYN03.478Pretest, TimePR u Same experiment without User Affect Parameters:

Summary: Student Learning Models u This method of developing a Student Learning model: u useful for our tutoring application u User Affect parameters can increase usefulness of Student Learning Models

Summary: User Satisfaction Models u This method of developing a User Satisfaction model: u less useful for our tutoring application as compared to prior non-tutoring applications u Why are our User Satisfaction models less useful? u Per-student measure of User Satisfaction not fine- grained enough u Tutoring systems not designed to maximize User Satisfaction; goal is to maximize Student Learning

Conclusions u For the tutoring community: u PARADISE provides an effective method of extending single Student Learning correlations u For the spoken dialogue community: u When using PARADISE: u other performance metrics may be more useful for applications not optimized for User Satisfaction u task-specific and user affect parameters may be useful

Future Work u Investigate usefulness of additional input parameters for predicting Student Learning and User Satisfaction u User Affect annotations (once complete) u Tutoring Dialogue Acts (e.g. Möller, 2005; Litman and Forbes-Riley, 2006) u Discourse Structure annotations (Rotaru and Litman, 2006)

Thank You! Questions? Further information:

Architecture Cepstral www server www browser java ITSpoke Text Manager Spoken Dialogue Manager essay dialogue student text (xml) tutor turn (xml) html xml text Speech Analysis (Sphinx) dialogue repair goals Essay Analysis (Carmel, Tacitus- lite+) Content Dialogue Manager (Ape, Carmel) Why2 tutorial goals text essay

User Satisfaction Prediction Models TrainR2R2 PredictorsTestR2R2 PR05.274INC, EssaysSYN SYN05.068TwptPR PR05:half1.335PCOR/INCPR05:half2.137 PR05:half2.443StrnsPR05:half1.079 SYN05:half1.455Strn/TtrnSYN05:half2.051 SYN05:half2.685Twpt, Swpt, CORSYN05:half1.227

Student Learning Prediction Models u Posttest Score controlled for Pretest Score is target predicted variable (learning gains) u System-Generic and Tutoring-Specific parameters available as predictors (same corpora as for US models) TrainR2R2 PredictorsTestR2R2 PR05.556Pre, %CORSYN SYN05.736Pre, INC/COR, SwptPR PR05:half1.840Pre, PCORPR05:half2.128 PR05:half2.575PCOR/INC, PrePR05:half1.485 SYN05:half1.580Pre, SwptSYN05:half2.556 SYN05:half2.855Pre, ToutsSYN05:half1.384

Student Learning Prediction Models u Further Experiments: u Including SYN03 Corpus yields similar results (same parameters as prior experiments, different datasets) u Training on corpus most similar to test corpus yields highest generalizability TrainR2R2 PredictorsTestR2R2 PR05+SYN03.413Pre, TimeSYN PR05+SYN05.621Pre, INC/CORSYN SYN05+SYN03.590INC/COR, %INC, Pre, Time PR05.244

Student Learning Prediction Models u Including User Affect Parameters can increase generalizability of SL models (more parameters than prior experiments, different datasets) TrainR2R2 PredictorsTestR2R2 SYN03 (affect).644Time, Pre, NEUPR05: PR05:17 (affect).835Pre, NFA:NFA, Swpt SYN SYN03 (no affect).478Pre, TimePR05: PR05:17 (no affect).609Pre, Strns/TtrnsSYN03.164

Student Learning Prediction Models u Further experiments: u Including third corpus (SYN03) with same Generic and Tutoring Specific parameters yields similar results u Training set most similar to test set yields highest generalizability TrainR2R2 PredictorsTestR2R2 PR05+SYN03.413Pretest, TimeSYN05.586

User Satisfaction Prediction Models u Comparison to Prior Work u Some of same parameters also selected as predictors, e.g. in (Walker et al., 2002) (User Words/Turn) u Higher best test results (R 2 =.3 -.5) in (Möller, 2005), (Walker et al., 2002) and (Bonneau-Maynard et al., 2000) u Similar sensitivity to changes in training data in (Möller, 2005) and (Walker et al., 2000)

Student Learning Prediction Models u First Experiments: (Best Results on Test Data in table) u All models account for ~ 50% of Posttest variance in train and test data; less sensitive to training data changes u Intra-corpus models don’t have higher generalizability u Generic and Tutoring parameters selected TrainR2R2 PredictorsTestR2R2 PR05.556Pretest, %CorrectSYN SYN05 half #1.580Pretest Student Words/Turn SYN05 half #2.556