Modeling User Satisfaction and Student Learning in a Spoken Dialogue Tutoring System with Generic, Tutoring, and User Affect Parameters Kate Forbes-Riley.

Modeling User Satisfaction and Student Learning in a Spoken Dialogue Tutoring System with Generic, Tutoring, and User Affect Parameters Kate Forbes-Riley and Diane Litman University of Pittsburgh

Outline u Overview u PARADISE u System and Corpora u Interaction Parameters u Prediction Models u Conclusions and Future Work

Overview u Goals: u PARADISE: Model performance in our spoken dialogue tutoring system in terms of interaction parameters u Focus design efforts on improving parameters - predict better performance for future users u Use model to predict simulated user performance - as different system versions designed

Overview u What is Performance in our spoken dialogue tutoring system? u User Satisfaction: primary metric for many spoken dialogue systems, e.g. travel-planning (user surveys) u Hypothesis: less useful u Student Learning: primary metric for tutoring systems (student pre/post tests) u Hypothesis: more useful

Overview u What Interaction Parameters for our spoken dialogue tutoring system? u Spoken Dialogue System-Generic (e.g. time): shown useful in non-tutoring PARADISE applications modeling User Satisfaction u Tutoring-Specific (e.g. correctness) u Hypothesis: task-specific parameters impact performance u User Affect (e.g. uncertainty) u Hypothesis: affect impacts performance - generic too

Overview u Are the resulting Performance Models useful? u Generic and Tutoring parameters yield useful Student Learning models u Affect parameters increase usefulness u Generic and Tutoring parameters yield less useful User Satisfaction models than prior non-tutoring applications u (Bonneau-Maynard et al., 2000), (Walker et al., 2002), (Möller, 2005): better models with generic only u Too little data to include Affect parameters

PARADISE Framework (Walker et al., 1997) u Measure parameters (interaction costs and benefits) and performance in system corpus u Train model via multiple linear regression (MLR) over parameters, predict performance (R 2 = variance predicted) u SPSS stepwise MLR: determine parameter inclusion (most correlated until no better R 2 /non-significant model) System Performance = ∑ w i * p i u Test model usefulness (generalize) on new corpus (R 2 ) n i=1

Speech front-end for text-based Why2-Atlas (VanLehn et al., 2002) Qualitative Physics Tutor

Sphinx2 speech recognizer - Why2-Atlas performs NLP on transcript

3 ITSPOKE Corpora u Synthesized voice: Cepstral text-to-speech system u Pre-Recorded voice: paid voice talent Corpus#Students#with Tests#with Survey#with Affect SYN0320 0 PR0528 17 SYN0529 0

Experimental Procedure u Subjects without college physics: u Read a small background document u Took a pretest u Worked 5 problems (dialogues) with ITSPOKE u Took a posttest Corpus#Students#with Tests#with Survey#with Affect SYN0320 0 PR0528 17 SYN0529 0

User Satisfaction Survey after Posttest Task Ease: It was easy to learn from the tutor. TTS Performance: It was easy to understand the tutor. User Expertise: I knew what I could say or do at each point in the conversations with the tutor. Expected Behavior: The tutor worked the way I expected it to. Future Use: Based on my experience using the tutor to learn physics, I would like to use such a tutor regularly. ALMOST ALWAYS (5), OFTEN (4), SOMETIMES (3), RARELY (2), ALMOST NEVER (1) Corpus#Students#with Tests#with Survey#with Affect SYN0320 0 PR0528 17 SYN0529 0

User Affect Annotation u Each student turn labeled with 1 Certainness State: u certain, uncertain, mixed, neutral u Each student turn labeled with 1 Frustration State: u frustrated, non-frustrated Corpus#Students#with Tests#with Survey#with Affect SYN0320 0 PR0528 17 SYN0529 0

SYN03 Annotated Excerpt T: What is the NET force acting on the truck equal to? S: I don’t know (ASR: I don’t know ) [Uncertain, Frustrated, Incorrect] T: Since net force is defined as the vector sum of all the forces exerted on an object, let's first find the forces acting on the truck. Try to name them. S: The impact force, the normal force, and gravity (ASR: the impact force but normal force and gravity) [Certain, Non-Frustrated, Correct] T: Yep.

Interaction Parameters u Prior PARADISE applications u Each dialogue between user and system represents a single task (e.g., booking airline travel) u Parameters calculated on a per-dialogue basis u Our tutoring application u Entire tutoring session (5 dialogues) between student and ITSPOKE represents a single task u Parameters calculated on a per-student basis

13 Dialogue System-Generic Parameters u Most from prior PARADISE applications (Möller, 2005), (Walker et al. 2002), (Bonneau-Maynard, 2000) u Time on Task u Total ITSPOKE Turns, Total Student Turns u Total ITSPOKE Words, Total Student Words u Ave. ITSPOKE Words/Turn, Ave. Student Words/Turn u Word Error Rate, Concept Accuracy u Total Timeouts, Total Rejections u Ratio of Student Words to ITSPOKE Words u Ratio of Student Turns to ITSPOKE Turns

12 Tutoring-Specific Parameters u 9 Parameters related to Correctness of Student Turn u ITSPOKE labels: Correct, Incorrect, Partially Correct u Total and Percent for each label u Ratio of each label to every other label u Total number of essays per student u Student pretest and posttest score (for US) u Similar parameters available in most tutoring systems

25 User Affect Parameters u For each of our 4 Certainness labels: u Total, Percent, and Ratio to each other label u Total for each sequence of identical labels (e.g. Certain:Certain) u For each of our 2 Frustration labels u Total, Percent, and Ratio to each other label u Total for each sequence of identical labels

User Satisfaction Prediction Models u Predicted Variable: Total Survey Score u Range: 9 - 24 out of 5 - 25; no corpora differences (p =.46) u Input Parameters: Generic and Tutoring u Do models generalize across corpora (system versions)? u Train on PR05  Test on SYN05 u Train on SYN05  Test on PR05 u Do models generalize better within corpora? u Train on half PR05  Test on half PR05 (for each half) u Train on half SYN05  Test on half SYN05 (for each half)

User Satisfaction Prediction Models u Best Results (on Test Data) u Inter-corpus models are weak and don’t generalize well u Intra-corpus models generalize better, but are still weak predictors of User Satisfaction u Generic and Tutoring parameters selected TrainR2R2 Ranked PredictorsTestR2R2 SYN05.068ITSPOKE Words/TurnPR05.018 SYN05 half #2.685ITSPOKE Words/Turn, Student Words/Turn, #Correct SYN05 half #1.227

User Satisfaction Prediction Models u Comparison to Prior Work u Some of same parameters also selected as predictors, e.g. in (Walker et al., 2002) (User Words/Turn) u Higher best test results (R 2 =.3 -.5) in (Möller, 2005), (Walker et al., 2002) and (Bonneau- Maynard et al., 2000)

Student Learning Prediction Models u First Experiments: u Data and Input Parameters: same as for User Satisfaction experiments u Predicted Variable: Posttest controlled for Pretest (learning gains); significant learning independently of corpus (p <.001) TrainR2R2 Ranked PredictorsTestR2R2 PR05.556Pretest, %CorrectSYN05.636 SYN05 half #1.580Pretest, Student Words/Turn SYN05 half #2.556

Student Learning Prediction Models u First Experiments: (Best Results on Test Data in table) u All models account for ~ 50% of Posttest variance in train and test data u Intra-corpus models don’t show higher generalizability u Generic and Tutoring parameters selected TrainR2R2 Ranked PredictorsTestR2R2 PR05.556Pretest, %CorrectSYN05.636 SYN05 half #1.580Pretest, Student Words/Turn SYN05 half #2.556

Student Learning Prediction Models u Further experiments: u Including third corpus (SYN03) with Generic and Tutoring parameters yields similar results u Best Result (on Test Data): TrainR2R2 Ranked Predictors TestR2R2 PR05+SYN03.413Pretest, TimeSYN05.586

Student Learning Prediction Models u Further experiments: including User Affect Parameters can improve results: TrainR2R2 Ranked PredictorsTestR2R2 SYN03.644Time, Pretest, #NeutralsPR05-17.411 Posttest =.86 * Time +.65 * Pretest -.54 * #Neutrals TrainR2R2 Ranked PredictorsTestR2R2 SYN03.478Pretest, TimePR05-17.340 u Same experiment without User Affect Parameters:

Summary: Student Learning Models u This method of developing a Student Learning model: u useful for our tutoring application u User Affect parameters can increase usefulness of Student Learning Models

Summary: User Satisfaction Models u This method of developing a User Satisfaction model: u less useful for our tutoring application as compared to prior non-tutoring applications u Why are our User Satisfaction models less useful? u Per-student measure of User Satisfaction not fine- grained enough u Tutoring systems not designed to maximize User Satisfaction; goal is to maximize Student Learning

Conclusions u For the tutoring community: u PARADISE provides an effective method of extending single Student Learning correlations u For the spoken dialogue community: u When using PARADISE: u other performance metrics may be more useful for applications not optimized for User Satisfaction u task-specific and user affect parameters may be useful

Future Work u Investigate usefulness of additional input parameters for predicting Student Learning and User Satisfaction u User Affect annotations (once complete) u Tutoring Dialogue Acts (e.g. Möller, 2005; Litman and Forbes-Riley, 2006) u Discourse Structure annotations (Rotaru and Litman, 2006)

Thank You! Questions? Further information: http://www.cs.pitt.edu/~litman/itspoke.html

Architecture Cepstral www server www browser java ITSpoke Text Manager Spoken Dialogue Manager essay dialogue student text (xml) tutor turn (xml) html xml text Speech Analysis (Sphinx) dialogue repair goals Essay Analysis (Carmel, Tacitus- lite+) Content Dialogue Manager (Ape, Carmel) Why2 tutorial goals text essay

User Satisfaction Prediction Models TrainR2R2 PredictorsTestR2R2 PR05.274INC, EssaysSYN05.001 SYN05.068TwptPR05.018 PR05:half1.335PCOR/INCPR05:half2.137 PR05:half2.443StrnsPR05:half1.079 SYN05:half1.455Strn/TtrnSYN05:half2.051 SYN05:half2.685Twpt, Swpt, CORSYN05:half1.227

Student Learning Prediction Models u Posttest Score controlled for Pretest Score is target predicted variable (learning gains) u System-Generic and Tutoring-Specific parameters available as predictors (same corpora as for US models) TrainR2R2 PredictorsTestR2R2 PR05.556Pre, %CORSYN05.636 SYN05.736Pre, INC/COR, SwptPR05.472 PR05:half1.840Pre, PCORPR05:half2.128 PR05:half2.575PCOR/INC, PrePR05:half1.485 SYN05:half1.580Pre, SwptSYN05:half2.556 SYN05:half2.855Pre, ToutsSYN05:half1.384

Student Learning Prediction Models u Further Experiments: u Including SYN03 Corpus yields similar results (same parameters as prior experiments, different datasets) u Training on corpus most similar to test corpus yields highest generalizability TrainR2R2 PredictorsTestR2R2 PR05+SYN03.413Pre, TimeSYN05.586 PR05+SYN05.621Pre, INC/CORSYN03.237 SYN05+SYN03.590INC/COR, %INC, Pre, Time PR05.244

Student Learning Prediction Models u Including User Affect Parameters can increase generalizability of SL models (more parameters than prior experiments, different datasets) TrainR2R2 PredictorsTestR2R2 SYN03 (affect).644Time, Pre, NEUPR05:17.411 PR05:17 (affect).835Pre, NFA:NFA, Swpt SYN03.127 SYN03 (no affect).478Pre, TimePR05:17.340 PR05:17 (no affect).609Pre, Strns/TtrnsSYN03.164

Student Learning Prediction Models u Further experiments: u Including third corpus (SYN03) with same Generic and Tutoring Specific parameters yields similar results u Training set most similar to test set yields highest generalizability TrainR2R2 PredictorsTestR2R2 PR05+SYN03.413Pretest, TimeSYN05.586

User Satisfaction Prediction Models u Comparison to Prior Work u Some of same parameters also selected as predictors, e.g. in (Walker et al., 2002) (User Words/Turn) u Higher best test results (R 2 =.3 -.5) in (Möller, 2005), (Walker et al., 2002) and (Bonneau-Maynard et al., 2000) u Similar sensitivity to changes in training data in (Möller, 2005) and (Walker et al., 2000)

Student Learning Prediction Models u First Experiments: (Best Results on Test Data in table) u All models account for ~ 50% of Posttest variance in train and test data; less sensitive to training data changes u Intra-corpus models don’t have higher generalizability u Generic and Tutoring parameters selected TrainR2R2 PredictorsTestR2R2 PR05.556Pretest, %CorrectSYN05.636 SYN05 half #1.580Pretest Student Words/Turn SYN05 half #2.556

Modeling User Satisfaction and Student Learning in a Spoken Dialogue Tutoring System with Generic, Tutoring, and User Affect Parameters Kate Forbes-Riley.

Similar presentations

Presentation on theme: "Modeling User Satisfaction and Student Learning in a Spoken Dialogue Tutoring System with Generic, Tutoring, and User Affect Parameters Kate Forbes-Riley."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Modeling User Satisfaction and Student Learning in a Spoken Dialogue Tutoring System with Generic, Tutoring, and User Affect Parameters Kate Forbes-Riley.

Similar presentations

Presentation on theme: "Modeling User Satisfaction and Student Learning in a Spoken Dialogue Tutoring System with Generic, Tutoring, and User Affect Parameters Kate Forbes-Riley."— Presentation transcript:

Similar presentations

About project

Feedback