Presentation is loading. Please wait.

Presentation is loading. Please wait.

Observational Methods Part Two January 20, 2010. Today’s Class Survey Results Probing Question for today Observational Methods Probing Question for next.

Similar presentations


Presentation on theme: "Observational Methods Part Two January 20, 2010. Today’s Class Survey Results Probing Question for today Observational Methods Probing Question for next."— Presentation transcript:

1 Observational Methods Part Two January 20, 2010

2 Today’s Class Survey Results Probing Question for today Observational Methods Probing Question for next class Assignment 1

3 Survey Results Much broader response this time – thanks! – Good to go with modern technology Generally positive comments – Some contradictions in best-part, worst-part – The best sign one is doing well is when everyone wants contradictory changes? I will give another survey in a few classes

4 Today’s Class Survey Results Probing Question for today Observational Methods Probing Question for next class Assignment 1

5 Probing Question For today, you have read D'Mello, S., Taylor, R.S., Graesser, A. (2007) Monitoring Affective Trajectories during Complex Learning. Proceedings of the 29th Annual Meeting of the Cognitive Science Society, 203-208 Which used data from a lab study If you wanted to study affective transitions in real classrooms, which of the methods we discussed today would be best? Why?

6 What’s the best way? First, let’s list out the methods For now, don’t critique, just describe your preferred method – One per person, please – If someone else has already presented your method, no need to repeat it – If you propose something similar, quickly list the difference (no need to say why right now)

7 For each method What are the advantages? What are the disadvantages?

8 Votes for each method

9 Today’s Class Survey Results Probing Question for today Observational Methods Probing Question for next class Assignment 1

10 Topics Measures of agreement Study of prevalence Correlation to other constructs Dynamics models Development of EDM models (ref to later)

11 Agreement/ Accuracy The easiest measure of inter-rater reliability is agreement, also called accuracy # of agreements total number of codes

12 Agreement/ Accuracy There is general agreement across fields that agreement/accuracy is not a good metric What are some drawbacks of agreement/accuracy?

13 Agreement/ Accuracy Let’s say that Tasha and Uniqua agreed on the classification of 9200 time sequences, out of 10000 actions – For a coding scheme with two codes 92% accuracy Good, right?

14 Non-even assignment to categories Percent Agreement does poorly when there is non-even assignment to categories – Which is almost always the case Imagine an extreme case – Uniqua (correctly) picks category A 92% of the time – Tasha always picks category A Agreement/accuracy of 92% But essentially no information

15 An alternate metric Kappa (Agreement – Expected Agreement) (1 – Expected Agreement)

16 Kappa Expected agreement computed from a table of the form Rater 2 Category 1 Rater 2 Category 2 Rater 1 Category 1 Count Rater 1 Category 2 Count

17 Kappa Expected agreement computed from a table of the form Note that Kappa can be calculated for any number of categories (but only 2 raters) Rater 2 Category 1 Rater 2 Category 2 Rater 1 Category 1 Count Rater 1 Category 2 Count

18 Cohen’s (1960) Kappa The formula for 2 categories Fleiss’s (1971) Kappa, which is more complex, can be used for 3+ categories – I have an Excel spreadsheet which calculates multi-category Kappa, which I would be happy to share with you

19 Expected agreement Look at the proportion of labels each coder gave to each category To find the number of agreed category A that could be expected by chance, multiply pct(coder1/categoryA)*pct(coder2/categoryA) Do the same thing for categoryB Add these two values together and divide by the total number of labels This is your expected agreement

20 Example Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560

21 Example What is the percent agreement? Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560

22 Example What is the percent agreement? 80% Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560

23 Example What is Tyrone’s expected frequency for on-task? Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560

24 Example What is Tyrone’s expected frequency for on-task? 75% Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560

25 Example What is Pablo’s expected frequency for on-task? Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560

26 Example What is Pablo’s expected frequency for on-task? 65% Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560

27 Example What is the expected on-task agreement? Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560

28 Example What is the expected on-task agreement? 0.65*0.75= 0.4875 Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560

29 Example What is the expected on-task agreement? 0.65*0.75= 0.4875 Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560 (48.75)

30 Example What are Tyrone and Pablo’s expected frequencies for off-task behavior? Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560 (48.75)

31 Example What are Tyrone and Pablo’s expected frequencies for off-task behavior? 25% and 35% Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560 (48.75)

32 Example What is the expected off-task agreement? Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560 (48.75)

33 Example What is the expected off-task agreement? 0.25*0.35= 0.0875 Pablo Off-Task Pablo On-Task Tyrone Off-Task 205 Tyrone On-Task 1560 (48.75)

34 Example What is the expected off-task agreement? 0.25*0.35= 0.0875 Pablo Off-Task Pablo On-Task Tyrone Off-Task 20 (8.75)5 Tyrone On-Task 1560 (48.75)

35 Example What is the total expected agreement? Pablo Off-Task Pablo On-Task Tyrone Off-Task 20 (8.75)5 Tyrone On-Task 1560 (48.75)

36 Example What is the total expected agreement? 0.4875+0.0875 = 0.575 Pablo Off-Task Pablo On-Task Tyrone Off-Task 20 (8.75)5 Tyrone On-Task 1560 (48.75)

37 Example What is kappa? Pablo Off-Task Pablo On-Task Tyrone Off-Task 20 (8.75)5 Tyrone On-Task 1560 (48.75)

38 Example What is kappa? (0.8 – 0.575) / (1-0.575) 0.225/0.425 0.529 Pablo Off-Task Pablo On-Task Tyrone Off-Task 20 (8.75)5 Tyrone On-Task 1560 (48.75)

39 So is that any good? What is kappa? (0.8 – 0.575) / (1-0.575) 0.225/0.425 0.529 Pablo Off-Task Pablo On-Task Tyrone Off-Task 20 (8.75)5 Tyrone On-Task 1560 (48.75)

40 Interpreting Kappa Kappa = 0 – Agreement is at chance Kappa = 1 – Agreement is perfect Kappa = negative infinity – Agreement is perfectly inverse Kappa > 1 – You messed up somewhere

41 Kappa<0 It does happen, but usually not in the case of inter-rater reliability Occasionally seen when Kappa is used for EDM or other types of machine learning – More on this in 2 months!

42 0<Kappa<1 What’s a good Kappa? There is no absolute standard For inter-rater reliability, – 0.8 is usually what ed. psych. reviewers want to see – You can usually make a case that values of Kappa around 0.6 are good enough to be usable for some applications Particularly if there’s a lot of data Or if you’re collecting observations to drive EDM – Remember that Baker, Corbett, & Wagner (2006) had Kappa = 0.58

43 Landis & Koch’s (1977) scale κ Interpretation < 0No agreement 0.0 — 0.20Slight agreement 0.21 — 0.40Fair agreement 0.41 — 0.60Moderate agreement 0.61 — 0.80Substantial agreement 0.81 — 1.00Almost perfect agreement

44 Why is there no standard? Because Kappa is scaled by the proportion of each category When one class is much more prevalent – Expected agreement is higher than If classes are evenly balanced

45 Because of this… Comparing Kappa values between two studies, in a principled fashion, is highly difficult A lot of work went into statistical methods for comparing Kappa values in the 1990s No real consensus Informally, you can compare two studies if the proportions of each category are “similar”

46 There is a way to statistically compare two inter-rater reliabilities… “Junior high school” meta-analysis

47 There is a way to statistically compare two inter-rater reliabilities… “Junior high school” meta-analysis – Do a 1 df Chi-squared test on each reliability, convert the Chi-squared values to Z, and then compare the two Z values using the method in Rosenthal & Rosnow (1991)

48 There is a way to statistically compare two inter-rater reliabilities… “Junior high school” meta-analysis – Do a 1 df Chi-squared test on each reliability, convert the Chi-squared values to Z, and then compare the two Z values using the method in Rosenthal & Rosnow (1991) – Or in other words, nyardley nyardley nyoo

49 Comments? Questions?

50 Topics Measures of agreement Study of prevalence Correlation to other constructs Dynamics models Development of EDM models

51 Next step… Once you have analyzed inter-rater reliability, and you “trust” your observation codes You can conduct analyses with your findings

52 One simple question What is the prevalence of each category?

53 Why might this be interesting?

54 Some examples of studies What is the prevalence of teacher behavior X in Japan versus the USA? (Stigler & Hiebert, 1997) What is the prevalence of student off-task behavior in the USA versus the Philippines? (Baker et al, submitted) Does the prevalence of gaming the system drop when we try to reduce gaming with an animated agent? (Baker et al, 2006)

55 Approach Apply same coding scheme in situations A and B – Sometimes, find previous work where coding scheme was applied to situation A – And apply the same coding scheme to situation B Find prevalence of behavior for each student in each situation – Use an unpaired t-test to compare (in your favorite stats package)

56 Data might look like Student1 15% Student2 7% Student3 12% Student4 9% Student5 11% Student6 8% Student7 4% Student8 6% Student9 15% Student1010% Student114% Student1214% WPI students % off-task Student1 25% Student2 17% Student3 22% Student4 23% Student5 19% Student6 18% Student7 14% Student8 64% Student9 8% Student1030% Student1124% Student12101% Harvard stus.% off-task

57 Can also do Apply single coding scheme in situation A Find prevalence of behaviors B1 and B2 for each student – Use a paired t-test to compare (in your favorite stats package)

58 Data might look like Student1 15%11% Student2 7%13% Student3 12%14% Student4 9%1% Student5 11%8% Student6 8%7% Student7 4%4% Student8 6%18% Student9 15%6% Student1010%22% Student114%7% Student1214%19% WPI students % bored% frustrated

59 Comments? Questions?

60 Topics Measures of agreement Study of prevalence Correlation to other constructs Dynamics models Development of EDM models

61 Another question How do these behaviors we coded correlate to some other construct we’re interested in? “Correlation” = They vary together (e.g. as one goes up, the other goes up; as one goes down, the other goes down) (for more info on correlation, please attend optional session)

62 Why might this be interesting?

63 Some examples of studies What is the relationship between off-task behavior and learning? (Lahaderne, 1968; Karweit & Slavin, 1981; Baker et al, 2004; Gobel et al, 2008; Rowe et al, 2009) What is the relationship between gaming the system and learning? (Baker et al, 2004; Walonoski & Heffernan, 2006) What is the relationship between insults in collaborative learning, and learning? (Prata et al, 2008) What is the relationship between gaming the system and student attitudes, as measured by questionnaires? (Baker et al, 2005; Walonoski & Heffernan, 2006)

64 Potential Measures Knowledge Motivational/attitudinal surveys Learning Gain (we will talk about special statistical methods for correlating to this on Feb. 12) Robust Learning (discussed on Feb. 12)

65 Approach Apply coding scheme to find prevalence of behavior A for each student Collect additional measure for each student Compute correlation between prevalence of behavior and additional measure – Statistical significance can be computed in your favorite stats package, using linear regression, or using the formula on the inside cover of Rosenthal & Rosnow (1991) – Note that a different approach is needed for learning gains; will be discussed on Feb. 12

66 Data might look like Student1 15%1 Student2 7%3 Student3 12%4 Student4 9%1 Student5 11%6 Student6 8%6 Student7 4%4 Student8 6%5 Student9 15%6 Student1010%2 Student114%6 Student1214%6 WPI students % boredgrit scale (1-6)

67 Comments? Questions?

68 Topics Measures of agreement Study of prevalence Correlation to other constructs Dynamics models Development of EDM models

69 Dynamics Models Several approaches to creating dynamics models – Markov Models – Sequential Pattern Mining – D’Mello’s L (D’Mello et al, 2007) Note: This has very little relationship to Systems Dynamics models

70 In fact You can construct a Markov Model using D’Mello’s L Let’s take a look – I will define Markov Models when we get there

71 Step 1 Lay out all your data, in terms of what the observation is at time N, and what the observation is at time N+1

72 Data might look like Student1 1CONFUSEDFLOW Student1 2FLOW CONFUSED Student1 3CONFUSEDFLOW Student1 4FLOWFLOW Student1 5FLOWFLOW Student1 6FLOWnone Student2 1FRUSTRATEDBORED Student2 2BOREDBORED Student2 3BOREDBORED Student24BOREDBORED Student25BOREDBORED Student26BOREDnone WPI students obs-timecategory-nowcategory-next

73 Step 2 Break your data down by student

74 Step 3 For each student, compute the D’Mello’s L likelihood that category A will be followed by category B

75 D’Mello’s L (% time B follows A – expected % B) (1 – expected % B)

76 D’Mello’s L That’s right, it’s Kappa (% time B follows A – expected % B) (1 – expected % B)

77 D’Mello’s L Expected % B is computed as the overall % of time that B is seen after any other category (% time B follows A – expected % B) (1 – expected % B)

78 Example % BORED after FRUSTRATED: 20% % BORED after ANYTHING: 10% % FRUSTRATED after ANYTHING: 30% What is D’Mello’s L?

79 Example % BORED after FRUSTRATED: 20% % BORED after ANYTHING: 10% % FRUSTRATED after ANYTHING: 30% (20% - 10%) / (100% - 10%) 10% / 90% 0.111111

80 Step 4 For each transition, find the mean and standard error L across all students

81 Data might look like Student1 CONFUSEDFLOW0.4 Student2 CONFUSEDFLOW0.5 Student3CONFUSEDFLOW0.2 Student4CONFUSEDFLOW0.1 Student5CONFUSEDFLOW0.05 Student6CONFUSEDFLOW0.2 Student1 CONFUSEDBORED0.05 Student2 CONFUSEDBORED0.1 Student3CONFUSEDBORED0.05 Student4CONFUSEDBORED0.00 Student5CONFUSEDBORED-0.05 Student6CONFUSEDBORED0.1 WPI students category-nowcategory-nextD’Mello L

82 Data might look like Student1 CONFUSEDFLOW0.4 Student2 CONFUSEDFLOW0.5 Student3CONFUSEDFLOW0.2 Student4CONFUSEDFLOW0.1 Student5CONFUSEDFLOW0.05 Student6CONFUSEDFLOW0.2 Student1 CONFUSEDBORED0.05 Student2 CONFUSEDBORED0.1 Student3CONFUSEDBORED0.05 Student4CONFUSEDBORED0.00 Student5CONFUSEDBORED-0.05 Student6CONFUSEDBORED0.1 WPI students category-nowcategory-nextD’Mello L Mean = 0.24 Stdev = 0.17 Stderr = 0.07 Mean = 0.04 Stdev = 0.06 Stderr = 0.02

83 Step 5 You can now determine if a transition is significantly more likely than chance with a 1- sample t-test Or you can determine if two transitions differ in likelihood with a paired t-test

84 Step 6 Take the transitions that are significantly different than chance and graph them BOREDFRUSTCONFFLOWDELIGHT

85 This is… A Markov Model – Markov Model is a model of transitions and probabilities, which only considers single transitions

86 Similar to… Hidden Markov Models (HMMs) which you may have seen in AI classes – HMMs have latent (e.g. unknowable) states, and knowable outputs, which are emitted by each state with a certain probability – But in this case, our observations tell us what the student state is..

87 Differentiated From… Sequential Pattern Mining, where sequences of more than one transition are considered In a Markov Model – P(BORED->FRUSTRATED->BORED) = P(BORED->FRUSTRATED)*P(FRUSTRATED->BORED) In multi-step Sequential Pattern Mining approaches, this assumption does not hold

88 Comments? Questions?

89 Topics Measures of agreement Study of prevalence Correlation to other constructs Dynamics models Development of EDM models

90 Final use… Many times, observations are used to create EDM models that then are used instead of the original observations We will talk about this on March 3 rd – Why you might want to do this – Advantages and drawbacks – And *how* you do this

91 Today’s Class Survey Results Probing Question for today Observational Methods Probing Question for next class Assignment 1

92 Probing Question Think of something awesome that Stigler & Hiebert could do with their coded data – What could be done? – How would one go about doing it, at a very high level? If you want, you can also pretend that Stigler & Hiebert handed out and coded any kind of paper survey or measure, as long as a student can fill it out in less than an hour

93 Today’s Class Survey Results Probing Question for today Observational Methods Probing Question for next class Assignment 1

94 Please take a moment to read through assignment 1 Any questions about the assignment?

95 The End


Download ppt "Observational Methods Part Two January 20, 2010. Today’s Class Survey Results Probing Question for today Observational Methods Probing Question for next."

Similar presentations


Ads by Google