Presentation on theme: "Dávid Gergely: Building a Case for Euro Examinations or A case study."— Presentation transcript:
Dávid Gergely: Building a Case for Euro Examinations or A case study
Piloting the Manual and seeing how good the methodology of linking is. Getting initial measures for items and tasks calibrated to the CEF Establishing a link for Euro examinations with the CEF. In sum, validate the test by following the methodology outlined in the Manual. Build a case for the CEF link by collecting validity evidence. The Mission of the Study
Initial decisions by Euro Management decision to select GramVoc only a question of finances North: The most difficult task you could pick Unpopular kind of test The Dutch CEF Construct project focused on reading and listening ALTE produced grids for speaking and listening Any CEF scales relevant to the GramVoc paper?
Productive orientation of CEF General Linguistic Range B2 Has a sufficient range of language to be able to give clear descriptions, express viewpoints and develop arguments without much conspicuous searching for words, using some complex sentence forms to do so.
In retrospect Advantages of selecting the GramVoc paper: The knowledge of language underlies all other skills in the examination GramVoc project as pilot for the rest of the Euro papers. As part of the efforts of the Hungarian Accreditation Board, Euro Examinations will do level setting exercises to all skills-based Euro papers.
Process and Audience for the Case Study Four phases of action according to the Manual Familiarization Specification Standardization (of judgements) Empirical validation Working with the team of full-time item writers as holders of standards
The Familiarization Phase 1 Survey of familiarity with the CEF scales Descriptors from 15 scales, 133 items, as in a test Statistical analyses Initial facility value of responses: 0.4 Low? How low? 16/133 descriptors nobody got the level right. Significantly more B1 descriptors. No descriptor -- same descriptor problem
The Familiarization Phase 2 Insights from categorizing descriptors: No correct identification of level, spread of responses: 16 <50% of team correctly identified level: 55 50% correctly identified level: 62 In cases uncertainty, tendency to place level of descriptor higher than in CEF. Lower Euro standards? Leniency? Chi-squares: Leniency not related to any of the scales, but it is to level B2.
The Specification Phase a qualitative content audit Lack of yardsticks for a test like GramVoc Van Ek and Trim volumes not useful. CEF provides description of 15 categories, but without level specification pp. (108-117). Euro specifications need attention. Two lines of work Elucidating item-writers concepts Expert analysis of what (item focuses) actually goes into the test on the basis of the scope, the gradation and stability between 2 consecutive test administrations.
Specification Phase 2 Elucidating item-writers concepts Item-writers conceptualisations of levels coherent? In line with CEF? In line with Euro specifications? Item writers select best task for each task type and level. Answer: What is it that makes this task the best for you? Series of workshops to bring item- writers conceptualisations to light.
Specification Phase 3 Expert analysis of item focuses Evidence of construct under- representation? Anything else measured, other than the construct? Items to generate construct-irrelevant variance? 2 experts identify item focuses, then jointly finalize classification of items acc. to 15 CEF categories. Predict problematic items.
Results Specification Phase Item-writers concepts broadly match CEF. Better overall results than in familiarization phase. Statistical analysis of expert classifications Distribution of focuses related to task type and author (text), but not related to level and administration. Results similar when two administrations at the same level were compared. Lack of significant focus differences by level prompted investigation of item complexity. Statistical test inconclusive: p = 0.05
The Standardization of Judgements: Line 1 Investigating the gap between Local Euro standards and the CEF standards Item-writers identified descriptors on the basis of collations the content of which exceeded local standards Tabulation and qualitative analysis of responses. History of descriptors taken into account. The gap does not widen up the CEF scale. Most conspicuous at B2, but less considerable if descriptor history is accounted for. Why do the uncalibrated descriptors represent a higher level of requirements than those that went through it?
Standardization of Judgements Line 2: Video rating conference CEF Performance Samples: Link to Norths rating conference (1996/2000) A second-best option and problems How similar was the rating of the Euro item-writers to each others? Encouraging results Reliability of scale use: Chronbachs Alpha 0.96 Kendalls W: 0.85
Standardization of Judgements Line 3: Standard Setting With about 20 scripts per level for both test 2003 and 2004 An examinee-based method. Scripts carefully chosen, arranged in decreasing order of ability Overfitting candidates Info about items Rating done twice bearing in mind Round1: conventional Euro standards, Kendalls W ranged 0.8 - 0.83 Round 2: CEF standards, Kendalls W ranged 0.75 - 0.79 Results provided additional info about Line 1
Empirical Validation Phase Empirical validation started very early Internal validation: item analyses Independent analyses Joint analyses of same level papers External validation Using standard setting data from the Standardization phase as ratings Calibrate overall test difficulties Anchor item means of independent analyses to calibrated overall test difficulties Use a corrected version of Norths scale Compare cutoffs obtained in this way with conventional Euro cutoffs.