Presentation on theme: "Principles and Practice in Language Testing: Compliance or Conflict?"— Presentation transcript:
1Principles and Practice in Language Testing: Compliance or Conflict? J Charles Alderson,Department of Linguistics and English Language,Lancaster University
2European Standards in Language Assessment INTO EUROPEEuropean Standards in Language Assessment
3Trailer for the whole Conference OutlineTrailer for the whole ConferenceThe PastThe Past becoming Present – Present Perfect?The Future?
4Standards? Shorter OED: · Standard of comparison or judgement · Definite level of excellence or attainment· A degree of quality· Recognised degree of proficiency· Authoritative exemplar of perfection· The measure of what is adequate for a purpose· A principle of honesty and integrity
5Standards? Report of the Testing Standards Task Force, ILTA 1995 (International Language Testing Association = ILTA)Levels to be achievedPrinciples to follow
6Standards as Levels FSI/ILR/ACTFL/ASLPR Foreign Service Institute Interagency Language Round TableAmerican Council for the Teaching of Foreign LanguagesAustralian Second Language Proficiency Ratings
7Standards as Levels Europe? Beginner/ False Beginner/Intermediate/Post Intermediate/AdvancedHow defined?Threshold Level?
8Standards as Principles ValidityReliabilityAuthenticity?Washback?Practicality?
9Psychometric tradition · Tests externally developed and administered· National or regional agencies responsible for development, following accepted standards· Tests centrally constructed, piloted and revised· Difficulty levels empirically determined· Externally trained assessors· Empirical equating to known standards or levels of proficiency
10Standards as Principles In Europe:Teacher knows bestHaving a degree in a language means you are an ‘Expert’Experience is allBut 20 years experience may be one year repeated twenty times! and is never checked
11Past (?) European tradition Quality of important examinations not monitoredNo obligation to show that exams are relevant, fair, unbiased, reliable, and measure relevant skillsUniversity degree in a foreign language qualifies one to examine language competence, despite lack of training in language testingIn many circumstances merely being a native speaker qualifies one to assess language competence.Teachers assess students’ ability without having been trained.
12Past (?) European tradition · Teacher-centred· Teacher develops the questions· Teacher's opinion the only one that counts· Teacher-examiners are not standardised· Assumption that by virtue of being a teacher, and having taught the student being examined, teacher- examiner makes reliable and valid judgements· Authority, professionalism, reliability and validity of teacher rarely questioned· Rare for students to fail
13Past becoming Present: Levels Threshold 1975/ Threshold 1990Waystage/ VantageBreakthrough/ Effective Operational / MasteryCEFR 2001A1 – C2Translated into 23 languages so far, including Japanese!
14Past becoming Present: Levels CEFR enormous influence since 2001ELP contributes to spreadClaims aboundNot just exams but also curricula/ textbooksBut: Alderson 2005 survey
15Survey of use of CEFR in universities “Which universities are trying to align their curricula for language majors and non-language majors to the CEFR?”ConsultedEALTA (European Association for Language Testing and Assessment)Thematic Network Project for LanguagesEuropean Language Council
16Survey of use of CEFR in universities Follow-up questions about methodology:“Exactly what process and procedures do you use to do the alignment of curricula to the CEFR?”“How do you know when students have achieved the appropriate standard?”
17Survey of use of CEFR in universities Answers?“You certainly know how to ask the very tricky questions”In general: Familiarity with CEFR claimed, but evidence suggests that this is extremely superficial and little thought has been given to either question. Claims of levels are made without accompanying evidence – in universities!!!
18Manual for linking exams to CEFR Familiarisation – essential, even for “experts” – Knowledge is usually superficialSpecificationStandard settingEmpirical validation
19Manual for linking exams to CEFR BUT FIRSTIf an exam is not valid or reliable, it is meaningless to link it to the CEFR
20Standards as Principles: Validity Rational, empirical, constructInternal and external validityFace, content, constructConcurrent, predictiveConstruct
21How can validity be established? My parents think the test looks good.The test measures what I have been taught.My teachers tell me that the test is communicative and authentic.If I take the SFLEB (Rigó utca) instead of the FCE, I will get the same result.I got a good English test result, and I had no difficulty studying in English at university.
22How can validity be established? Does the test match the curriculum, or its specifications?Is the test based adequately on a relevant and acceptable theory?Does the test yield results similar to those from a test known to be valid for the same audience and purpose?Does the test predict a learner’s future achievements?
23How can validity be established? Note: a test that is not reliable cannot, by definition, be validAll tests should be piloted, and the results analysed to see if the test performed as predictedA test’s items should work well: they should be of suitable difficulty, and good students should get them right, whilst weak students are expected to get them wrong.
24Factors affecting validity Unclear or non-existent theoryLack of specificationsLack of training of item/ test writersLack of / unclear criteria for markingLack of piloting/ pre-testingLack of detailed analysis of items/ tasksLack of standard settingLack of feedback to candidates and teachers
25Standards as Principles: Reliability Over time: test – re-testOver different forms: parallelOver different samples: homogeneityOver different markers: inter-raterWithin one rater over time: intra-rater
26Standards as Principles: Reliability If I take the test again tomorrow, will I get the same result?If I take a different version of the test, will I get the same result?If the test had had different items, would I have got the same result?Do all markers agree on the mark I got?If the same marker marks my test paper again tomorrow, will I get the same result?
27Factors affecting reliability Poor administration conditions – noise, lighting, cheatingLack of information beforehandLack of specificationsLack of marker trainingLack of standardisationLack of monitoring
28Present: Practice and Principles · Teacher-based assessment vs central development· Internal vs external assessment· Quality control of exams or no quality control· Piloting or not· Test analysis and the role of the expert· The existence of test specifications – or not· Guidance and training for test developers and markers – or not
30Exam Reform in Europe (mainly school-leaving exams) SloveniaThe Baltic StatesHungaryRussiaSlovakiaCzech RepublicPolandGermany
31Hungarian Exams Reform Teacher Support Project Project philosophy:“The ultimate goal of examination reform is to encourage, to foster and to bring about change in the way language is taught and learned in Hungary.”
32Hungarian Exams Reform Teacher Support Project “Testing is about ensuring that those tests and examinations which society decides it needs, for whatever purpose, are the best possible and that they represent the best not only in testing practice but in teaching practice, and that the test reflect the aspirations of professional language teachers. Anything less is a betrayal of teachers and learners, as is a refusal to engage in testing.”
33Achievements of Exam Reform Teacher Support Project Trained item writers, including class teachersTrained teacher trainers and disseminatorsDeveloped, refined and published Item Writer Guidelines and Test SpecificationsDeveloped a sophisticated item production system
34Achievements of Exam Reform Teacher Support Project Developed sets of rating scales and trained markersDeveloped Interlocutor Frame for speaking tests and trained interlocutorsItems / tasks piloted, IRT-calibrated and standard set to CEFR using DIALANG/ Kaftandjieva procedures
35Achievements of Exam Reform Teacher Support Project Into Europe series: textbook series for test preparation:many calibrated tasksexplanations of rationale for task designexplanations of correct answersCDs of listening tasksDVDs of speaking performances
36Achievements of Exam Reform Teacher Support Project Into EuropeReading + Use of EnglishWriting HandbookListening + CDsSpeaking Handbook + DVD
37Achievements of Exam Reform Teacher Support Project In-service courses for teachers in modern test philosophy and exam preparationModern Examinations Teacher Training (60 hrs)Assessing Speaking at A2/B1 (30 hrs)Assessing Speaking at B2 (30 hrs)Assessing Writing at A2/B1 (30 hrs)Assessing Writing at B2 (30 hrs)Assessing Receptive Skills (30hrs)
38Present Perfect: Positive features National exams, designed, administered and marked centrallyExternal exam replaces locally produced, poor quality examsNational and regional exam centres to manage the logisticsResults are comparable across schools and provincesExams are recognised for university entrance
39Present Perfect: Positive features Secondary school teachers are involved in all stages of test developmentTests of communicative skills rather than traditional grammarTeams of testing experts firmly located in classrooms have been developedItems developed by teams of trained item writersTests piloted and results analysedRating scales developed for rating performances
40Present Perfect: Positive features Scripts anonymised and marked by trained examiners, not own class teacherNature and rationale for changes communicated to teachersMany training courses for teachers, including explicit guidance on exam preparationTeachers largely enthusiastic about the changesPositive washback claimed by teachers
41Present Perfect: Positive features Exams beginning to be related to CEFRComparability across cities, provinces, countries and regionsTransparency, recognition and portability of qualificationsValuable for employersYardstick for evaluating achievement of pupils and schools
42Unprofessional No piloting, especially of Speaking and Writing tasks Using calibrated (speaking) tasks but then changing rubrics, aspects of items, textsLeaving speaking tasks up to teachers to design and administer, typically without any training in task designAdministering speaking tasks to Year 9 students in front of the whole classAdministering speaking tasks to one candidate whilst four or more others are preparing their performance in the same room
43Unprofessional No training of markers No double marking No monitoring of markingNo comparability of results across schools, across markers/towns/ regions or across years (test equating)No guidance on how to use centrally devised scales, how to resolve differences, how to weight different components, no guidance on what is an “adequate” performance
44Unprofessional No developed item production system: Pre-setting cut scores without knowledge of test difficultyNo understanding that the difficulty of a task item or test will affect the appropriacy of a given cut-scoreBelief that a “good teacher” can write good test items: that training, moderation, revision, discussion, is not neededLack of provision of feedback to item writers on how their items performed, either in piloting, or in live exam
45UnprofessionalFailure to accept that a “good test” can be ruined by inadequate application of suitable administrative conditions, lack of or inadequate training of markers, lack of monitoring of marking, lack of double / triple marking.
46Dubious activities? Using other people’s tasks without acknowledgement “Calibrating” new tasks with Into Europe or UCLES Specimen tasks without any reference to Into Europe or UCLES statisticsIf a test is supposed to be A2/B1 (eg Hungarian érettségi), when and how do you decide that a given performance is A2, not B1?Exemption from school exams if a recognised exam has been passed. Free valid certificates should complete free valid public education
47Naïve?Use of terminology, eg “calibration”, “validity”, “reliability”, without understanding what it means, or knowing that there are technical definitionsDoing classical item analysis and calling that “calibration”Not using population-independent statistics with an appropriate anchor designLack of acknowledgement that it is impossible to know in advance how difficult an item or a task will be
48Naïve?No standard-setting: simple and naïve belief that if an item writer says an item is B1, then it is.No problematising of “mastery”: is a test taker “at a level” if she gets all 100% of B1 items right? 80%? 60%? 50%?What if a test-taker gets some items at a higher level right? At what point does that person “go up a level”?No problematising of the conversion of a performance on a test of a given level to a grade result (1- 5 or A - D)
49Questions to ask any exam provider ITEM WRITINGWho are the item writers? How are they chosen?Do they include those who routinely teach at that level?How and for how long are they trained?What feedback do they get on their work?Are there Item Writer Guidelines?Are there Test Specifications?
50Questions to ask any exam provider ANALYSISWhat quality control procedures are routinely in place?Is there a statistical manual?Are the test items routinely piloted?What is the normal size of the pilot sample, and how does it compare with the test population?What is the mean facility and discrimination of the sample/ population?Is the sample / population normally distributed: are there skewed or kurtic patterns?
51Questions to ask any exam provider ANALYSISWhat is the interrater reliability?What is the intra rater reliability?What is the Cronbach alpha or equivalent for item-based tests?If there are different versions of the test (eg year by year, specialisation by specialisation) what is the evidence for the equivalence of these different versions?
52Questions to ask any exam provider TEST ADMINISTRATIONWhat are the security arrangements?Are test administrators trained?Is the test administration monitored?Is there a post-test analysis of results?Is there an examiner’s report each year or each administration?
53Questions to ask any exam provider REVIEWHow often are the tests reviewed and revised?What special validation studies are conducted?Does the test keep pace with changes in teaching or in the curriculum?
54Questions to ask any exam provider WASHBACKWhat is the washback effect? What studies have been conducted?Are there preparatory materials?How are teachers trained (encouraged) to prepare their students for the exam?
55Present Perfect? Negative features Political interferencePoliticians want instant results, not aware of how complex test development isPoliticians afraid of public opinion as drummed up by newspapersPoor communication with teachers and publicResistance from some quarters, especially university “experts”, who feel threatened by and who disdain secondary teachers
56Present Perfect? Negative features Often exam centres are unprofessional and have no idea of basic principles and practiceSimplistic notions of what tests can achieve and measureVariable quality and resultsSchool league tables
57Present Perfect? Negative features Assessment not seen as a specialised field: “anybody can design a test”Decisions taken by people who know nothing about testingLack of openness and consultation before decisions are takenUrge to please everybody – the political is more important than the professional
60The Future Gradual acceptance of principles and need for standards Revision of Manual 2008Forthcoming Guidelines and Recommendations.Validation of claims: Self regulation acceptable? Role of ALTE? Role of EALTA?Validation is not rubber stampingClaims of links will need rigorous inspectionEALTA Code of Practice? Not just for exams but also for classroom assessment
61The Future Gaps in CEFR – needs to evolve Linguistic content parallel to CEFR action-orientationMore critical scrutiny of CEFR needed: text types do not determine difficultyNeed much more research into what causes difficultyNeed to combine SLA research and LT research related to CEFR to know what aspects of language map onto which CEFR levels for which learners
62The Future Change is painful: Europe still in middle of change Testing not just a technical matter: teachers need to understand the change and the reasons for change, they need to be involved and respectedDissemination, exemplification and explanation are crucial for acceptancePRESET and INSET teacher training in testing and assessment is essential
63Good tests and assessment, following European standards, cost money and time ButBad tests and assessment, ignoring European standards, waste money, time and LIVES
64Internet addressesEuropean Association for Language Testing and Assessment (EALTA)Dutch CEFR Construct Project (Reading and Listening)Diagnostic testing in 14 European languages