1 The Swiss ‘IEF’ Project - Assessment Instruments Supporting the ELP by Peter Lenz University of Fribourg/CH Voss/N, 3/06/05.

The Swiss 'IEF' Project - Assessment Instruments Supporting the ELP by Peter Lenz University of Fribourg/CH Voss/N, 3/06/05.

1 1 The Swiss ‘IEF’ Project - Assessment Instruments Supporting the ELP by Peter Lenz University of Fribourg/CH Voss/N, 3/06/05

2 2 2001 - EYL: Launch of ELP 15+ in Switzerland In 2001 the Swiss Conference of Cantonal Ministers of Education recommends to the cantons  to consider the CEFR  in curricula (objectives and levels)  in the recognition of diplomas  to facilitate wide use of the ELP 15+  make ELP accessible to learners  help teachers to integrate ELP in their teaching  to develop ELPs for younger learners

3 3 Integrating the ELP – School Teachers’ Wishlist : More descriptors taylored to young learners ‘ needs : Less abstract formulations : Self-assessment grid and checklists with finer levels : Tools facilitating “hard” assessment :  Test tasks relating to descriptors  Marked and assessed learner texts  Assessed spoken learner performances on video  Assessment criteria relating to finer levels for Speaking and Writing

4 4 Meeting the Needs for ELP 11-15 IEF Project (2002-2005) “Instruments for the Assessment of Foreign-Language Competences - English, French” FL German-speaking cantons of Switzerland Principality of Liechtenstein Bildungsplanung Zentralschweiz  Peter Lenz & Thomas Studer (UniFR)

5 5 IEF: Overview of Expected Products Bank of validated test tasks (  5 “skills”; C-tests) Benchmark performances (Speaking, Writing) Bank of target-group-specific descriptors (levels A1.1-B2.1) Tests available from publisher Tests for evaluations Assessment criteria (Speaking, Writing) (Self-)assessment checklists Training packages for teacher training ELP

6 6 Phase I A New Bank of Can-do Statements How did the new descriptors take shape? 1) Collecting from written sources (ELPs, textbooks, other sources)  Teachers decide on relevance for target learners and on suitability for assessment  Teachers complement collection 2) Validating, amending the collection in workshops 3) Fine-tuning and selecting descriptors  Make formulations non-ambiguous and accessible; add examples  Select descriptors to cover whole range of levels A1.1 - B2.1  Represent wide range of skills and tasks  ~330 descriptors for empirical phase

7 7 Assessment questionnaires – Teachers assess their pupils Following Schneider & North‘s methodology for the CEFR Phase I Calibrating Additional Descriptors

8 8 Linked and anchored assessment questionnaires of 50 descriptors each for different levels 2 parallel sets of descrip- tors of similar difficulty per assumed level Identical descriptors as links (& sometimes CEFR anchors) Too few learners at B2

9 9 Phase I Calibrating Additional Descriptors Statistical analysis and scale-building (A1.1 - B1.2)

10 10 Phase II Adapting Descriptors for Self-assessment (Self-)assessment checklists ELP Bank of target-group-specific descriptors (levels A1.1-B2.1)

11 11 Phase II Reformulations – Can …  I can... 1.Some Can do s are transformed into I can s 2.Learners are asked for feedback: learners assess themselves and give feedback on that 3.Whole bank of Can do s is transformed into I-can statements

12 12 Phase II Checklists for the New Swiss ELP (Self-)assessment checklists Bank of target group-specific descriptors (levels A1.1-B2.1) Drawing on 3 sources

13 13 ELP II: Self-assessment in Relation to Finer Levels

14 14 Phase III Developing Test Tasks and Instruments Bank of validated test tasks (Self-)assessment checklists Bank of target-group-specific descriptors (levels A1.1-B2.1) ELP

15 15 Phase III Test Tasks and Instruments  Speaking tasks (production and interaction)  Writing tasks  Listening tasks  Reading tasks 1) Test tasks relating to communicative language ability 2) C-Tests (integrative tests)  C-Tests (type of CLOZE) are said to provide reliable information on a learner‘s linguistic resources esp. for (written) Production.  Most test tasks are related to one descriptor, sometimes two – but descriptor difficulty vs. task diff.?  The test tasks are field-tested and attributed to a level at least tentatively  Validation: tests + teacher questionnaires

16 16 Phase IV Assessment Criteria for Performances Assessment criteria for Speaking and Writing Bank of target-group-specific descriptors (levels A1.1-B2.1) (Self-)assessment checklists ELP Bank of validated test tasks (mainly performance-oriented)

17 17 Phase IV Developing Criteria for Speaking How did the criteria take shape? – Steps taken:  Collect criteria from various sources: CEFR, examination schemes... 1) Collecting criteria  Teachers bring video recordings  Teachers describe differences between learner performances they can watch on video  more criteria  Teachers adopt and apply descriptors from existing collection  Teachers agree on essential categories (e.g. Vocab range, Pronunciation/Int. ) and build a scale for each analytical category 2) Assessing spoken performances in workshops 3) Preparing empirical validation  Decide on categories to be retained  Revise and complete proposed scales of analytical criteria

18 18 Phase IV Producing Video Tapes With Spoken Performances One learner - different tasks in various settings

19 19 Phase IV Empirical Validation of Speaking Criteria Methodology A total of 35 teachers (14 Fr, 21 En) apply  58 analytical criteria (some from CEFR ) belonging to 5 categories  28 task-based descriptors (matching performed tasks)  to 10 or 11 video-taped learners per language, each performing 3-4 spoken tasks Criteria categories  Interaction  Vocabulary range  Grammar  Fluency  Pronunciation & Intonation

20 20 Phase IV Calibrating Criteria for Speaking Criteria and questionnaires - a linked and anchored design CEFR Anchors 3 assessment questionnaires for three different learner levels “Statement applies to this pupil but s/he can do clearly better” “Statement generally applies to this pupil ” “Statement doesn‘t apply to this pupil” For reasons of practicality: only 3-step rating scale for Can descriptors/criteria !! Links between questionnaires

21 21 Phase IV Criteria for Speaking – Analysis (1) The 5 analytical categories retained – Correlations and Fit InteractionVocabGrammarFluencyPronuncia- tion Overall Interaction 1.00 Vocab 0.991.00 Grammar 1.000.991.00 Fluency 0.980.99 1.00 Pronuncia- tion 0.920.93 1.00 Overall 1.00 0.931.00 Disattenuated correlations between pupil measures suggest proximity of categories/competences – except Pronunciation/Intonation FACETS indicates slight misfit for Fluency ; overfit for Interaction (.88)

22 22 Phase IV Criteria for Speaking – Analysis (2) Criteria applied to French and English – Diagnosing DIF

23 23 Phase IV Criteria for Speaking – Analysis (3) Teacher severity and consistency Consistency: 5 out of 35 raters were removed from the analysis due to misfit of up to 2.39 logits (infit mean square) Severity: Some extreme raters (severe or lenient) show a strong need for rater training although every criterium makes a meaningful (though somewhat abstract) statement on mostly observable aspects of competence. Map for English

24 24 Phase IV Criteria for Speaking – Anchoring (1)  11 analytical criteria from the CEFR linked design (mostly Interaction and Vocabulary)  A total of 28 task-oriented descriptors from the IEF bank in a linked design  Known scores of 3 learners of English rated in teacher workshop on CEFR basis Used here Potential anchors towards the CEFR in the data:

25 25 Phase IV Criteria for Speaking – Anchoring (2)  CEFR difficulties (x-axis) and IEF difficulties (y-axis) of criteria are plotted (blue diamonds) using a scaling factor for equating the separate calibrations.  Lines visualize the usual 95% confidence interval that helps detect items that are not suitable as anchors (outliers). Outlier – perceived more difficult in IEF (over 3 logits): Can link groups of words with simple connectors like ‘and’, ‘but’ and ‘because’. Outlier removed

26 26 Phase IV Criteria for Speaking – Taking Stock  The calibrations of the video-taped learners are very plausible.  IEF now has video-taped examples of learners from below A1 (French only) to a very high B2.  The additional, newly-developed criteria are well spread across the targeted level range A1-B2 (A1.1?). But: What will the assessment instruments to be used in schools look like?

27 27 Phase IV Assessment instruments for Speaking Problem: middle category “Statement generally applies to this pupil “ – desirable (because of its meaningfulness) but possibly too broad Range here: -2.57 to +2.57 logits Solutions?  Other formulations for narrower categories?  Use e.g. B1.2 descrip- tor to establish A2.1 of a learner?  ???

28 28 Phase IV Assessment instruments for Speaking Narrower categories - Can the middle category be divided up into three? Range of middle category: -1.2 to +0.8 logits Main problem: Raters have the impression to apply modifiers upon modi- fiers, new restrictions upon restrictions already present in the criteria. 0…0… 1 Pupil has this ability only partially 2 Pupil generally has this ability. 3 Pupil fully has this ability 4…4…

