Student simulation and evaluation DOD meeting Hua Ai 03/03/2006.

Student simulation and evaluation DOD meeting Hua Ai (hua@cs.pitt.edu) 03/03/2006

2 Outline  Motivations  Backgrounds  Corpus  Student Simulation Model  Comparisons  Conclusions & Future Work

3 Motivations  For larger corpus  Reinforcement Learning (RL) is used to learn the best policy for spoken dialogue systems automatically  Best strategy may often not even be present in small dataset  For cheaper corpus  Human subjects are expensive

4 Simulated User Dialog Manager Strategy Reinforcement Learning Dialog Corpus Simulation models Strategy learning using a simulated user (Schatzmann et al., 2005)

5 Backgrounds (1)  Education community  Focusing on changes of student’s inner- brain knowledge representation forms  Usually not dialogue based  Simulated students for (Venlehn et al., 1994)  tutor training  Collaborative learning

6 Backgrounds (2)  Dialogue community  Focusing on interactions and dialogue behaviors  Simulated users have limited actions to take  (Schatzmann et al., 2005)  Simulating on DA level

7 Corpus (1)  Spoken dialogue physics tutor (ITSPOKE)

8 Corpus (2)  Tutoring procedure (T) Question (S) Answer Dialogue (T) Q (S) A … Essay revision Dialogue (T) Question (S) Answer Dialogue (T) Q (S) A … Essay revision Dialogue … 5 problems

9 Corpus (3)  Tutor’s behaviors  Defined in KCD (Knowledge Construction Dialogues) Correct Incorrect/ Partially Correct

10 Corpus (4) #dialogues stuWordstuTurntutorWordtutorTurn f03100avg57.1623.351256.9229.64 (Synthesized) stdev45.5763817.44334849.819519.76351 05syn136avg91.096330.785191655.46738.06667 (Synthesized) stdev53.8293114.42551757.874416.32469 05pre135avg87.3455930.117651597.20637.33088 (pre- recorded) stdev55.4800416.96972832.984518.20096 f03:s05 Different groups of subjects

11 Simulation Models (1)  Simulating on word level  Student’s have more complex behaviors  DA info alone isn’t enough for the system  Two models trained on two corpus ProbCorrect Random f03 s05 03ProbCorrect 03Random 05ProbCorrect 05Random

12 Simulation Models (2)  ProbCorrect Model  Simulates average knowledge level of real students  Simulate meaningful dialogue behaviors  Random Model  Non-sense  As a contrast

13 ProbCorrect Model Real corpus question1 Answer1_1 (c) Answer1_2 (ic) Answer1_3 (ic) question2 Answer2_1 (c) Answer2_2 (ic) Candidate Ans: For question1 c:ic = 1:2 c: Answer1_1 ic: Answer1_2 Answer1_3 For question2 c:ic = 1:1 c: Answer2_1 ic Answer2_2 ProbCorrect Model: Question 1 Answer: 1)Choose to give a c/ic answer with the same average probability as real student 2)Randomly choose one answers from the corresponding answer set

14 HC03&05 Question1 Answer1_1 Answer1_2 Answer1_3 Answer1_4 Question2 Answer2_1 Answer2_2 Candidate Ans: 1) Answer1_1 2) Answer1_2 3) Answer1_3 4) Answer1_4 5) Answer2_1 6) Answer2_2 Big random Model: Question i: Answer: any of the 6 answers with the same probability (Regardless the question!) Random Model

15 Experiments  Comparisons between real corpora  Comparisons between real & simulated corpora  Comparisons between simulated corpora

16  Evaluation metrics  High-level dialog features  Dialog style and cooperativeness  Dialog Success Rate and Efficiency  Learning Gains Real Corpora Comparisons (1)

17  High-level dialog features Real corpora comparisons (2)

18 Real corpora comparisons (3)  Dialogue style features

19 Real corpora comparisons (3)  Dialogue success rate

20 Real corpora comparisons (4)  Learning gains features

21 Results  Differences captured by these simple metrics can’t help to conclude whether a corpus is real or not (Schatzmann et al., 2005)  Differences could be due to different user population

22 Real Vs Simulated Corpora Comparisons

23 Results (1)  Most of the measurements are able to distinguish between Random and ProbCorrect model  ProbCorrect model generates more realistic behaviors  We can’t conclude on the power of these metrics since the two simulated corpus are really different

24 Results (2)  Differences between real and random models are captured clearly, but differences between real and ProbCorrect is not clear  We don’t expect this simple model to give very real corpus. It’s surprising that the differences are small

25 Results (3)  S05 variety > f03 variety  05probCorrect variety > 03probCorrect variety  However, we don’t get significantly more varieties in the simulated corpus than the real ones  Could be the computer tutor is simple (c/ic)  We’re using the same candidate answer set

26 Results (4)  ProbCorrect models trained on different real corpora are quite different  The ProbCorrect model is more similar to the real corpus it is trained from than to the other real corpus

27 Comparisons between simulated dialogues with different dialogue structure

28 Results  Larger differences between the two simulated corpora in prob7 than in prob34  Dialogue structure of prob34 is more restricted  The power of these simple metrics is restricted by the dialogue structure

29 Conclusions  The simple measurements can distinguish between  real corpora  Different population  simulated and real corpora  To different extent  simulated corpora  Different models  Trained on different corpora  Limited to different Dialog structure

30 Future work  Explore “deep” evaluation metrics  Test simulated corpus on policy  More simulation models  More human features  Emotion, learning  Special cases  Quick learners, slow learners

Student simulation and evaluation DOD meeting Hua Ai 03/03/2006.

Similar presentations

Presentation on theme: "Student simulation and evaluation DOD meeting Hua Ai 03/03/2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Student simulation and evaluation DOD meeting Hua Ai 03/03/2006.

Similar presentations

Presentation on theme: "Student simulation and evaluation DOD meeting Hua Ai 03/03/2006."— Presentation transcript:

Similar presentations

About project

Feedback