Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog 22-26 June 2008.

Slides:



Advertisements
Similar presentations
Knowledge Dietary Managers Association 1 PART II - DMA Certification Exam Blueprint and Exam Development-
Advertisements

Functional Skills Support Programme OfQual Functional Skills Qualifications Criteria – Issued November 2009.
Presented by Eroika Jeniffer.  We want to set tasks that form a representative of the population of oral tasks that we expect candidates to be able to.
A Tale of Two Tests STANAG and CEFR Comparing the Results of side-by-side testing of reading proficiency BILC Conference May 2010 Istanbul, Turkey Dr.
Reliability for Teachers Kansas State Department of Education ASSESSMENT LITERACY PROJECT1 Reliability = Consistency.
Advanced Topics in Standard Setting. Methodology Implementation Validity of standard setting.
Issues of Technical Adequacy in Measuring Student Growth for Educator Effectiveness Stanley Rabinowitz, Ph.D. Director, Assessment & Standards Development.
1 New England Common Assessment Program (NECAP) Setting Performance Standards.
1 The New Adaptive Version of the Basic English Skills Test Oral Interview Dorry M. Kenyon Funded by OVAE Contract: ED-00-CO-0130 The BEST Plus.
The State of the State TOTOM Conference September 10, 2010 Jim Leigh Office of Assessment and Information Services Oregon Department of Education.
Questions to check whether or not the test is well designed: 1. How do you know if a test is effective? 2. Can it be given within appropriate administrative.
BILC Standardization Initiatives and Conference Objectives
Stages of testing + Common test techniques
Instructional Strategies Instructional strategies – refer to the arrangement of the teacher, learner, and environment Many different types – we will explore.
Raili Hildén University of Helsinki Relating the Finnish School Scale to the CEFR.
Francesco Gratton 2013 Testing in the time of crisis BILC PROFESSIONAL SEMINAR Stockholm, October , 2013 INNOVATIVE TEST DESIGNS AND FORMATS Lt.Col.
Thinking Actively in a Social Context T A S C.
The BILC BAT: A Research and Development Success Story Ray T. Clifford BILC Professional Seminar Vienna, Austria 11 October.
LECTURE 06B BEGINS HERE THIS IS WHERE MATERIAL FOR EXAM 3 BEGINS.
Bureau for International Language Coordination
MEPA Reporting Teleconference September 2009.
Bureau for International Language Co-ordination BILC Update Julie J. Dubeau Secretary Bucharest Oct 08.
 Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing.
Standardizing Testing in NATO Peggy Garza and the BAT WG Bureau for International Language Co-ordination.
1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?
Bureau for International Language Coordination Julie J. Dubeau BILC Secretary Istanbul, Turkey May 24, 2010.
BILC UPDATE Rome, Italy, June 8, 2009 BILC Secretary & D/Secretary Bureau de Coordination Linguistique Internationale Bureau for International Language.
Closing Remarks and Report from the Steering Committee This has been a remarkable week! –Educational. –Engaging. –Productive. We want to thank our hosts.
NATO BAT Testing: The First 200 BILC Professional Seminar 6 October, 2009 Copenhagen, Denmark Dr. Elvira Swender, ACTFL.
Program Standards for Bilingual Authorization Jo A. Birdsell, Ed.D. Commission on Teacher Credentialing Technical Assistance Meetings November 21 & 25,
Using the IRT and Many-Facet Rasch Analysis for Test Improvement “ALIGNING TRAINING AND TESTING IN SUPPORT OF INTEROPERABILITY” Desislava Dimitrova, Dimitar.
Welcome! - Current BILC activities. - Comments regarding the theme of this seminar. Dr. Ray T. Clifford BILC Seminar, Vienna 8 October 2007.
STANAG OPI Testing Julie J. Dubeau Bucharest BILC 2008.
What are the stages of test construction??? Take a minute and try to think of these stages???
Sheltered Instruction: Making Content Comprehensible for ELLs London Middle School April 18, 2008.
Tests can be categorised according to the types of information they provide. This categorisation will prove useful both in deciding whether an existing.
PARCC Field Test Study Comparability of High School Mathematics End-of- Course Assessments National Conference on Student Assessment San Diego June 2015.
BILC Testing Seminars Language Testing Seminar (LTS) Advanced Language Testing Seminar (ALTS)
Aligning Program Goals, Instructional Practices, and Outcomes Assessment Dr. Ray T. Clifford BILC Conference, Budapest 29 May 2006.
Peggy Garza Associate BILC Secretary For Testing Programs Standardization Initiatives.
Educational Research Chapter 8. Tools of Research Scales and instruments – measure complex characteristics such as intelligence and achievement Scales.
Criterion-Referenced Proficiency Testing BILC 26 May 2016 Ray Clifford.
AAPPL Assessment Follow Up June What is AAPPL Measure? The ACTFL Assessment of Performance toward Proficiency in Languages (AAPPL) is a performance-
OFFICE OF EDUCATION Consultations – 13 August 2014 Work to date related to the ELP loading.
BILC Conference Athens, Greece 22 – 26 June 2008 Ray T. Clifford
Test Validation Topics in the BILC Testing Seminars
BILC and Workshop Overview
50 Years of BILC: The Evolution of STANAG – 2016 and the first Benchmark Advisory Test Ray Clifford 24 May 2016.
Introduction to the Workshop
ECML Colloquium2016 The experience of the ECML RELANG team
ARDHIAN SUSENO CHOIRUL RISA PRADANA P.
Ray Clifford Brno 6 September
Bureau for International
Test Standardization: From Design to Concurrent Validation
STANAG 6001 Testing Update and Introduction to the 2017 Workshop
Understanding Your Child’s Report Card
BILC Standardization Efforts & BAT, Round 2
RELATING NATIONAL EXTERNAL EXAMINATIONS IN SLOVENIA TO THE CEFR LEVELS
Bursting the assessment mythology: A discussion of key concepts
Chapter Eight: Quantitative Methods
Chief of English Testing, Language Programs
Adult Education Survey
Best Practices in STANAG 6001 Testing
Basic Statistics for Non-Mathematicians: What do statistics tell us
STANAG 6001 Testing Workshop
Developing Valid STANAG 6001 Proficiency Tests: Reading and Listening
FUTURE BILC THEMES AND TOPICS
ELP Assessment: Screening, Placement, and Annual Test Participation
Presentation transcript:

Benchmark Advisory Test (BAT) Update BILC Conference Athens, Greece Dr. Ray Clifford and Dr. Martha Herzog June 2008

This Update Has Four Parts Why we began the BAT project. The role of proficiency standards. Why the BAT follows a construct-based, evidence-centered test design. When will BAT be available.

Why we began the BAT project A survey was conducted on the desirability of a BILC-sponsored, “benchmark” test with advisory ratings.

Participation in the Survey 16 countries responded to the survey: Austria Bulgaria Canada Denmark EstoniaFinland Germany Hungary Italy Latvia Lithuania Poland Romania Spain Sweden Turkey

Survey Results 1.Would your country use a Benchmark Test if one were available? Definitely yes: 8 Probably yes: 5 Perhaps: 2 Most likely not: 0 Definitely not: 1

Survey Results 2.Does your country use “plus levels” when assigning STANAG ratings? Definitely yes: 3 Probably yes: 0 Perhaps: 1 Most likely not: 1 Definitely not:11

Survey Results 3.Would you like to have plus levels incorporated into a Benchmark Test? Definitely yes: 5 Probably yes: 5 Perhaps: 2 Most likely not: 2 Definitely not: 2

Conclusions A “benchmark” test would be welcomed by most countries. (The scores should be advisory in nature.) Providing “plus” level ratings would allow greater fidelity in making comparisons. BILC should proceed with plans to: –Develop a benchmark STANAG test of reading comprehension. –Explore internet delivery options.

The Role of Proficiency Standards Dr. Martha Herzog BILC Athens, Greece June 2008

TRAINING IN TEST DESIGN LANGUAGE TESTING SEMINAR – 20 Iterations –265 participants – 38 nations – 4 NATO officers –Facilitators from 10 nations

BENCHMARK TESTS Tests of all four skills Measures Level 1 through Level 3

STANDARDS All standards have three components –Content –Tasks –Accuracy

TEAMWORK The Working Group functions as a team –13 members from 8 nations –Contributions from many other nations

Summary STANDARDS TRAINING IN TEST DESIGN TEAMWORK TECHNOLOGY

Why does the BAT follow a construct-based, evidence- centered test design?

Because a C BT, ECD design solves a major problem encountered when testing proficiency in the receptive skills, i.e. in testing Reading and Listening.

In contrast to traditional test development procedures, CBT allows direct (rather than indirect application) of the STANAG 6001 Proficiency Scales to the development and scoring of Reading and Listening Proficiency Tests.

Test Development Procedures: Norm-Referenced Tests Create a table of test specifications. Train item writers in item-writing techniques. Develop items. Test the items for difficulty, discrimination, and reliability by administering them to several hundred learners. Use statistics to eliminate “bad” items. Administer the resulting test. Report results compared to other students.

Test Development Procedures: Norm-Referenced Tests (cont.) Each test administration yields a total score. However, setting “cut” scores or “passing” scores on norm-referenced tests is a major challenge. And relating scores on norm-referenced tests to a polytomous set of criteria (such as levels in the STANAG 6001 or other proficiency scales) is even more problematic.

A Traditional Method Of Setting Cut Scores Level 3 Grou p Level 2 Group Level 1 Group Test to be calibrated Groups of ”known” ability

The Results One Hopes For: Level 3 Grou p Level 2 Group Level 1 Group Distinct “cut” scores between the scores of the calibration groups Groups of “known” ability

The Results One Always Gets: 100 ??? 50 ??? 0 Level 3 Grou p Level 2 Group Level 1 Group Bands of Overlapping Test Scores Groups of ”known” ability

No matter where the cut scores are set, they are wrong for someone. 100 ??? 50 ??? 0 Level 3 Grou p Level 2 Group Level 1 Group Where in the overlapping range should the cut score be set? Groups of ”known” ability

Why is this “overlap” in scores always present? A single test score on a multi-level test… –Gives equal credit for every right answer regardless of its proficiency level. –Camouflages by-level abilities. –Is a “compensatory” score. Proficiency abilities… –Are by definition “non-compensatory”. –Require demonstration of sustained ability at each level.

A Better Test Design: Construct-Based Proficiency Testing Uses a “floor” and “ceiling” approach similar to that used in Speaking and Writing tests. The proficiency rating is assigned based on two separate scores: –A “floor” proficiency level of sustained ability across a range of tasks and contexts specific to that level. –A “ceiling” proficiency level of non-sustained ability at the next higher proficiency level.

Therefore Construct-Based Testing Tests each proficiency level separately. –Three tests for levels 1 through 3. –Or three subtests within a longer test. Rates each level-specific test separately. Applies the “floor” and “ceiling” criteria used in rating productive skills using a scale such as: –Sustained (consistent evidence) = 70% to 100% –Developing (present, inconsistent)= 55% to 65% –Emerging (some limited evidence)= 40% to 50% –Random (no visible evidence)= 0% to 35%

Does it make a difference? Consider the following example.

A Total Score (where 195=Level 1) Versus Construct-Based Scoring Level 1 Results Level 2 Results Level 3 Results Total Score True Level Alice195? Bob195? Carol195?

A Total Score (where 195=Level 1) Versus Construct-Based Scoring Learner Level 1 Learner Level 2 Learner Level 3 Total Score True Level Alice Bob195 Carol195

A Total Score (where 195=Level 1) Versus Construct-Based Scoring Learner Level 1 Learner Level 2 Learner Level 3 Total Score True Level Alice85 Sustained 70 Sustained 40 Emerging 195 Bob195 Carol195

A Total Score (where 195=Level 1) Versus Construct-Based Scoring Learner Level 1 Learner Level 2 Learner Level 3 Total Score True Level Alice85 Sustained 70 Sustained 40 Emerging 1952 (Barely) Bob195 Carol195

A Total Score (where 195=Level 1) Versus Construct-Based Scoring Learner Level 1 Learner Level 2 Learner Level 3 Total Score True Level Alice85 Sustained 70 Sustained 40 Emerging 1952 (Barely) Bob Carol195

A Total Score (where 195=Level 1) Versus Construct-Based Scoring Learner Level 1 Learner Level 2 Learner Level 3 Total Score True Level Alice85 Sustained 70 Sustained 40 Emerging 1952 (Barely) Bob90 Sustained 85 Sustained 20 Random 195 Carol195

A Total Score (where 195=Level 1) Versus Construct-Based Scoring Learner Level 1 Learner Level 2 Learner Level 3 Total Score True Level Alice85 Sustained 70 Sustained 40 Emerging 1952 (Barely) Bob90 Sustained 85 Sustained 20 Random 1952 (Clearly) Carol195

A Total Score (where 195=Level 1) Versus Construct-Based Scoring Learner Level 1 Learner Level 2 Learner Level 3 Total Score True Level Alice85 Sustained 70 Sustained 40 Emerging 1952 (Barely) Bob90 Sustained 85 Sustained 20 Random 1952 (Clearly) Carol

A Total Score (where 195=Level 1) Versus Construct-Based Scoring Learner Level 1 Learner Level 2 Learner Level 3 Total Score True Level Alice85 Sustained 70 Sustained 40 Emerging 1952 (Barely) Bob90 Sustained 85 Sustained 20 Random 1952 (Clearly) Carol90 Sustained 60 Developing 45 Emerging 195

A Total Score (where 195=Level 1) Versus Construct-Based Scoring Learner Level 1 Learner Level 2 Learner Level 3 Total Score True Level Alice85 Sustained 70 Sustained 40 Emerging 1952 (Barely) Bob90 Sustained 85 Sustained 20 Random 1952 (Clearly) Carol90 Sustained 60 Developing 45 Emerging 1951 (Clearly)

A Total Score (where 195=Level 1) Versus Construct-Based Scoring Learner Level 1 Total Score True Level Alice85 Sustained 70 Sustained 40 Emerging 1952 (Barely) Bob90 Sustained 85 Sustained 20 Random 1952 (Clearly) Carol90 Sustained 60 Developing 45 Emerging 1951 (+ developing 2)

Scores on Construct-Based Tests are: valid, easily explained, and informative! But how is a CBT developed?

Test Development Procedures: Construct-Based Proficiency Tests 1.Define each proficiency level as a construct to be tested. 2.Follow a construct-based, evidence- centered test design. 3.Train item writers –In the proficiency scales. –In matching text types to the tasks in the scales. –In item writing.

Test Development Procedures: Construct-Based Proficiency Tests 4.Develop items that exactly match all of the specifications for each level in the proficiency scale, with... –Examinee task aligned with the author’s [or the speaker’s ] purpose. –Level-appropriate topics and contexts.

Test Development Procedures: Construct-Based Proficiency Tests 5.Use “alignment”, “bracketing”, and “modified Angoff” item review and quality control procedures. –A specifications review to insure alignment of author purpose, text type, and reader task. –A bracketing review to check the adequacy of the item’s response options for test takers at higher and at lower proficiency levels. –Modified Angoff ratings of item difficulty for “at-level” test takers to set passing levels.

Test Development Procedures: Construct-Based Proficiency Tests 6.Use data from the Angoff reviews to define “sustained ability” for each level of the test. 7.Assemble the “good” items into level- specific tests or subtests. 8.Do validation testing. 9.Use statistical analyses to confirm reviewer ratings.

Test Development Procedures: Construct-Based Proficiency Tests 10.Replace items that do not “cluster” or act like the other items at each level. 11.Score and report results for each level using “sustained” proficiency criteria. 12.Continue to build the item data bases to enable: –Random selection of test items for multiple forms. –Computer adaptive testing.

What do the results of a CBT Reading proficiency look like? Here are some initial results from the BAT English Reading Proficiency Test.

Results on the Level 1 Test Results on the Level 2 Test Results on the Level 3 Test

When will BAT be available? Funds have been set aside for administering and scoring of 200 free advisory tests. –All four skills will be tested. –BAT Reading, Listening, and Writing tests will be online tests. –The Speaking test will be conducted over the telephone. These tests are to be used in test norming or calibration studies.

When will BAT be available? We anticipate the following timeline: –About October, Directions on how to apply will be sent out. –About November, Applications will be submitted. –About December, Applications will be reviewed and decisions made about how the 200 tests will be allocated. –Between February and June. The first round of advisory testing will be conducted.

When will BAT be available? More specific information will be sent out after consultation with ACT.

Are there any questions? ? ? ?