Presentation is loading. Please wait.

Presentation is loading. Please wait.

Evaluation and Control of Rater Reliability: Holistic vs. Analytic Scoring EALTA, Athens May 9-11, 2008 Claudia Harsch, IQB Guido Martin, IEA DPC.

Similar presentations


Presentation on theme: "Evaluation and Control of Rater Reliability: Holistic vs. Analytic Scoring EALTA, Athens May 9-11, 2008 Claudia Harsch, IQB Guido Martin, IEA DPC."— Presentation transcript:

1 Evaluation and Control of Rater Reliability: Holistic vs. Analytic Scoring EALTA, Athens May 9-11, 2008 Claudia Harsch, IQB Guido Martin, IEA DPC

2 Overview 1. Background - Standards-based assessment in Germany here: Writing in EFL -Writing tasks and rating approach 2. Feasibility Studies - Feasibility Study I, May 2007 trial scales and approach -Feasibility Study II, June 2007 trial holistic vs. analytic approach 3.Pilot Study, July/August 2007 - Training - Comparison FS II vs. Pilot Study Training

3 Overview 1. Background - Standards-based assessment in Germany here: Writing in EFL -Writing tasks and rating approach 2. Feasibility Studies - Feasibility Study I, May 2007 trial scales and approach -Feasibility Study II, June 2007 trial holistic vs. analytic approach 3.Pilot Study, July/August 2007 - Training -Comparison FS II vs. summer training

4 Background Assessing ES in Germany Evaluation of Educational Standards for grades 9 and 10 by IQB Berlin In Foreign Languages, standards are linked to the CEF, targeting A2 for lower track of secondary school B1 for middle track of secondary school Assessment of 4 skills: reading, listening, writing and speaking (under development) Tasks based on CEF-levels A1 to C1; uni-level approach

5 Sample task: Keeper, targeting B1

6 Assessment of Writing Tasks Criteria of assessment, each defined by descriptors based on CEF, Manual, Into Europe: task fulfilment organisation grammar vocabulary overall impression Rating approach A uni-level approach to grading the tasks in line with the specific target level Performance to be graded on a below / pass / pass plus basis "Holistic approach": Ratings are the result of a weighted assessment of several descriptors per criterion

7 Overview 1. Background - Standards-based assessment in Germany here: Writing in EFL -Writing tasks and rating approach 2. Feasibility Studies - Feasibility Study I, May 2007 trial scales and approach -Feasibility Study II, June 2007 trial holistic vs. analytic approach 3.Pilot Study, July/August 2007 - Training -Comparison FS II vs. summer training

8 Feasibility Study I May 2007 Aims Trial training / rating approach with student teachers Gain insight into scales and criteria Get feedback on accessibility of handbooks, benchmarks, coding software Procedure 2 tasks: A2 Lost dog / B1 Keeper for a day 6 raters: student teachers of English, proficient in writing English First training session (1day): introduction to CEF, scales and tasks Practice 1: 30 scripts per task (over 1 week) Second training session (1day): evaluation & discussion of practice results Practice 2: 28 scripts per task (over 1 week) Evaluation of results in terms of rating reliability

9 Feasibility Study I May 2007 Evaluation: Assessing Rater Reliability Index used: Percent Agreement with Mode Measures the percentage of agreement with the value most often awarded on the level of individual ratings Can be aggregated on item (variable) and rater level Easily interpreted No assumptions about scale level No assumptions about value distributions No estimation errors Can be interpreted as a proxy for validity

10 Outcome Feasibility Study I, May 2007 ITEMREL TaskFulfilment [Keeper]0,759 Organisation [Keeper]0,852 Grammar [Keeper]0,846 Vocabulary [Keeper]0,870 Overall [Keeper]0,858 TaskFulfilment [Lost dog]0,839 Organisation [Lost dog]0,863 Grammar [Lost dog]0,845 Vocabulary [Lost dog]0,869 Overall [Lost dog]0,833 Reliability per Item

11 Outcome Feasibility Study I, May 2007 ITEMR01R02R03R04R05R06 Overall [Keeper]0,8521,0000,7410,8890,7040,963 Overall [Lost dog]0,8570,9290,7860,8570,5711,000 REL Average0,8470,9310,7700,8260,7570,931 Reliability per Rater & Item

12 Approach appears feasible Scales seem to be usable and applicable BUT: We do not know what raters do on the sub- criterion-level Need to further explore behaviour at descriptor level => Feasibility Study II Outcome Feasibility Study I, May 2007

13 Overview 1. Background - Standards-based assessment in Germany here: Writing in EFL -Writing tasks and rating approach 2. Feasibility Studies - Feasibility Study I, May 2007 trial scales and approach -Feasibility Study II, June 2007 trial holistic vs. analytic approach 3.Pilot Study, July/August 2007 - Training -Comparison FS II vs. summer training

14 Feasibility Study II, June 2007 Comparison: Holistic scores for the five criteria (FS I) Scoring each descriptor on its own and in addition scoring the criteria holistically (FS II) Reasons behind: below – pass – pass plus in a uni-level approach targeting a specific population: tendency towards the pass value Similar outcomes can be achieved by purely random value distributions at the descriptor level Data on scoring each descriptor show whether raters interpret descriptors uniformly before using them to compile the weighted overall criterion rating Reliable usage of descriptors is a precondition for valid ratings on the criterion-level

15 Outcome Feasibility Study II, June 2007 CRITERIAREL TaskFulfilment [Keeper]0,81 Organisation [Keeper]0,83 Grammar [Keeper]0,85 Vocabulary [Keeper]0,84 Overall [Keeper]0,87

16 Outcome Feasibility Study II, June 2007 Descriptors/Criterion OrganisationREL Organisation_01 [Keeper for a day]0,75 Organisation_02 [Keeper for a day]0,56 Organisation_03 [Keeper for a day]0,73 Organisation_04 [Keeper for a day]0,82 Organisation_05 [Keeper for a day]0,54 Organisation_06 [Keeper for a day]0,83 Organisation_07 [Keeper for a day]0,84 Organisation_08 [Keeper for a day]0,66 Organisation_09 [Keeper for a day]0,63 Organisation [Keeper for a day]0,83

17 Outcome Feasibility Study II, June 2007 Fairly high agreement on criterion-level ratings is NOT the result of uniform interpretation of descriptors … BUT rather results from cancellation of deviations on the descriptor-level during the compilation of the criterion ratings Rating holistic criteria by evaluation of several pre- defined descriptors can only be valid if descriptors are understood uniformly by all raters Descriptors need to be revised Training and assessment of pilot study has to be conducted on the descriptor level in order to be able to control rating behavior

18 Overview 1. Background - Standards-based assessment in Germany here: Writing in EFL -Writing tasks and rating approach 2. Feasibility Studies - Feasibility Study I, May 2007 trial scales and approach -Feasibility Study II, June 2007 trial holistic vs. analytic approach 3.Pilot Study, July/August 2007 - Training -Comparison FS II vs. summer training

19 Background Pilot Study Sample Size: N = 2932 Number of Items: Listening: 349 Reading: 391 Writing: 19 Tasks n = 300 – 370 / item (M = 330) All Länder All school types 8th, 9th and 10th graders

20 Summer Training 13 Raters, selected on the basis of English language proficiency, study background and DPC coding test Challenge of piloting tasks, rating approach and scales simultaneously First one-week seminar: - Introduction of CEF, scales and tasks - Introduction of rating procedures - Introduction of benchmarks

21 Summer Training 6 one-day sessions: - Weekly practice - Discussion & Evaluation of practice results - Introduction of further tasks / levels - Revision of scale descriptors Five levels, 19 tasks: Simultaneous introduction of several levels and tasks necessary in order to control level and task interdependencies Three rounds of practice per task ideal: 1. Intro – practice 2. Feedback – practice 3. Feedback – practice 4. Evaluation of reliabilities …

22 Criterion/descriptors Task Fulfilment REL Practice 4 REL Practice 5 REL Practice 6 TF 1 [Sports Accident]0,650,760,88 TF 2 [Sports Accident]0,660,770,79 TF 3 [Sports Accident]0,870,850,92 TF 4 [Sports Accident]0,800,720,77 TF 5 [Sports Accident]0,700,780,83 TF gen [Sports Accident]0,710,80 Training Progress "Sports Accident", B1

23 Criterion/descriptors Organisation REL Practice 4 REL Practice 5 REL Practice 6 O 1 [Sports Accident]0,73 0,770,85 O 2 [Sports Accident]0,81 O 3 [Sports Accident]0,720,710,80 O 4 [Sports Accident]0,770,790,82 O 5 [Sports Accident]0,96 O gen [Sports Accident]0,710,760,81 Training Progress "Sports Accident", B1

24 Summer Training Second one-week seminar: -Feedback on last round of practice -Addition of benchmarks for borderline cases - Addition of detailed justifications for benchmarks - Finalisation of scale descriptors - Revision of rating handbooks

25 Comparison FS II - Training FS IIPRACTICE 4 CriteriaREL [Keeper - TaskFulfilment]0,810,71 [Keeper – Organization]0,830,74 [Keeper - Grammar]0,850,76 [Keeper - Vocabulary]0,840,74 [Keeper - Overall]0,870,77

26 Comparison FS II - Training FS II ITEMREL O_01 [Keeper]0,75 O_02 [Keeper]0,56 O_03 [Keeper]0,73 O_04 [Keeper]0,82 O_05 [Keeper]0,54 O_06 [Keeper0,83 O_07 [Keeper]0,84 O_08 [Keeper]0,66 O_09 [Keeper]0,63 O_gen [Keeper]0,83 Practice 4 ITEMREL O 1 [Keeper]0,75 O 2 [Keeper]0,73 skipped O 3 [Keeper]0,72 O 4 [Keeper]0,74 O 5 [Keeper]0,95 O gen [Keeper]0,74

27 Conclusion Training concept for the future Materials prepared – weekly seminars not necessary Training and rating on descriptor level Multiple one-day sessions, one per week to give time for practice - Introduction - Practice: 3 rounds per task ideal - Feedback

28 Thank you for your attention!

29 Claudia Harsch Phone + 49 + (0)30 + 2093 - 5508 Telefax + 49 + (0)30 + 2093 - 5336 E-mail Claudia.Harsch@IQB.hu-berlin.de Website www.IQB.hu-berlin.de Mail Address Humboldt-Universität zu Berlin Unter den Linden 6 10099 Berlin GERMANY Guido Martin Phone + 49 + (0)40 + 48 500 612 E-mail guido.martin@iea-dpc.de Website www.iea-dpc.de Mail Address IEA DPC Mexikoring 37 D-22297 Hamburg GERMANY


Download ppt "Evaluation and Control of Rater Reliability: Holistic vs. Analytic Scoring EALTA, Athens May 9-11, 2008 Claudia Harsch, IQB Guido Martin, IEA DPC."

Similar presentations


Ads by Google