CLEAR 2008 Annual Conference Anchorage, Alaska Fundamental Testing Assumptions Revisited: Examination Length and Number of Options Karine Georges & Kelly.

Slides:

Advertisements

Similar presentations

An Introduction to Computer- assisted Assessment Joanna Bull and Ian Hesketh CAA Centre Teaching and Learning Directorate.

Advertisements

Test Development.

FACULTY DEVELOPMENT PROFESSIONAL SERIES OFFICE OF MEDICAL EDUCATION TULANE UNIVERSITY SCHOOL OF MEDICINE Using Statistics to Evaluate Multiple Choice.

Standardized Scales.

Item Response Theory in a Multi-level Framework Saralyn Miller Meg Oliphint EDU 7309.

1 COMM 301: Empirical Research in Communication Kwan M Lee Lect4_1.

© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.

Item Response Theory in Health Measurement

Using Multiple Choice Tests for Assessment Purposes: Designing Multiple Choice Tests to Reflect and Foster Learning Outcomes Terri Flateby, Ph.D.

Item Analysis: A Crash Course Lou Ann Cooper, PhD Master Educator Fellowship Program January 10, 2008.

Constructing Exam Questions Dan Thompson & Brandy Close OSU-CHS Educational Development-Clinical Education.

Statistics of EBO 2010 Examination EBO General Assembly Sunday June 21st, 2010 (Tallin, Estonia) Danny G.P. Mathysen MSc. Biomedical Sciences EBOD Assessment.

Test Construction Processes 1- Determining the function and the form 2- Planning( Content: table of specification) 3- Preparing( Knowledge and experience)

The value of e-assessment in interprofessional education and large student numbers Melissa Owens* John Dermo* Fiona MacVane Phipps * Presenters.

1 General look at testing Taking a step backwards.

Item Response Theory. Shortcomings of Classical True Score Model Sample dependence Limitation to the specific test situation. Dependence on the parallel.

Today Concepts underlying inferential statistics

Research Methods in MIS

Chapter 7 Correlational Research Gay, Mills, and Airasian

Reliability of Selection Measures. Reliability Defined The degree of dependability, consistency, or stability of scores on measures used in selection.

Stages of testing + Common test techniques

Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Modified for EPE/EDP 711 by Kelly Bradley on January 8, 2013.

Copyright © 2007 Pearson Education Canada 1 Chapter 12: Audit Sampling Concepts.

1 Item Analysis - Outline 1. Types of test items A. Selected response items B. Constructed response items 2. Parts of test items 3. Guidelines for writing.

© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 9. Hypothesis Testing I: The Six Steps of Statistical Inference.

Multiple Choice Questions for discussion

Standard Setting Methods with High Stakes Assessments Barbara S. Plake Buros Center for Testing University of Nebraska.

Kaizen–What Can I Do To Improve My Program? F. Jay Breyer, Ph.D. Presented at the 2005 CLEAR Annual Conference September Phoenix,

Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 14 Measurement and Data Quality.

Test item analysis: When are statistics a good thing? Andrew Martin Purdue Pesticide Programs.

Introduction Neuropsychological Symptoms Scale The Neuropsychological Symptoms Scale (NSS; Dean, 2010) was designed for use in the clinical interview to.

CONSTRUCTING OBJECTIVE TEST ITEMS: MULTIPLE-CHOICE FORMS CONSTRUCTING OBJECTIVE TEST ITEMS: MULTIPLE-CHOICE FORMS CHAPTER 8 AMY L. BLACKWELL JUNE 19, 2007.

CLEAR 2008 Annual Conference Anchorage, Alaska “Problems and Priorities in Pretesting Pondered” Beth Noeller Grady Barnhill Carol O’Byrne.

Modified Achievement Tests for Students with Disabilities: Distractor Analysis Michael C. Rodriguez University of Minnesota CCSSO’s National Conference.

EVIDENCE ABOUT DIAGNOSTIC TESTS Min H. Huang, PT, PhD, NCS.

MEASUREMENT: SCALE DEVELOPMENT Lu Ann Aday, Ph.D. The University of Texas School of Public Health.

1 Item Analysis - Outline 1. Types of test items A. Selected response items B. Constructed response items 2. Parts of test items 3. Guidelines for writing.

Copyright © 2008 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 17 Assessing Measurement Quality in Quantitative Studies.

Validity and Item Analysis Chapter 4. Validity Concerns what the instrument measures and how well it does that task Not something an instrument has or.

Validity and Item Analysis Chapter 4.  Concerns what instrument measures and how well it does so  Not something instrument “has” or “does not have”

NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.

Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.

Tests and Measurements

Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Heriot Watt University 12th February 2003.

2. Main Test Theories: The Classical Test Theory (CTT) Psychometrics. 2011/12. Group A (English)

Language Testing How to make multiple choice test.

TEST SCORES INTERPRETATION - is a process of assigning meaning and usefulness to the scores obtained from classroom test. - This is necessary because.

Dan Thompson Oklahoma State University Center for Health Science Evaluating Assessments: Utilizing ExamSoft’s item-analysis to better understand student.

Psychometrics: Exam Analysis David Hope

Assessment and the Institutional Environment Context Institutiona l Mission vision and values Intended learning and Educational Experiences Impact Educational.

Objective Examination: Multiple Choice Questions Dr. Madhulika Mistry.

5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)

 Good for:  Knowledge level content  Evaluating student understanding of popular misconceptions  Concepts with two logical responses.

Copyright © Springer Publishing Company, LLC. All Rights Reserved. DEVELOPING AND USING TESTS – Chapter 11 –

ARDHIAN SUSENO CHOIRUL RISA PRADANA P.

QUESTIONNAIRE DESIGN AND VALIDATION

Data Analysis and Standard Setting

Understanding Results

Classical Test Theory Margaret Wu.

Item Analysis: Classical and Beyond

Partial Credit Scoring for Technology Enhanced Items

Constructing Exam Questions

Evaluation of measuring tools: reliability

Using statistics to evaluate your test Gerard Seinhorst

Mohamed Dirir, Norma Sinclair, and Erin Strauts

Item Analysis: Classical and Beyond

Multitrait Scaling and IRT: Part I

Item Analysis: Classical and Beyond

Tests are given for 4 primary reasons.

Presentation transcript:

CLEAR 2008 Annual Conference Anchorage, Alaska Fundamental Testing Assumptions Revisited: Examination Length and Number of Options Karine Georges & Kelly Piasentin Assessment Strategies Inc.

2 Overview  Credentialing organizations seek to balance many factors such as program validity and credibility with more tangible aspects such as costs and ease of development. Two such aspects are investigated: Method to reduce the total number of test questions while retaining validity and reliability. The effects of reducing the typical number of options from four (4) to three (3).

3 Part I Examination Length: A Case Study Karine Georges, MSc.

4 Case Study: Certification Program Tasked in 2007 to determine whether 180-item, 4-hour examinations could be shortened in light of a potential move to CBT.

5 Validity and Examination Length Content Validity: The number of items on an examination must be sufficient to ensure adequate representative coverage. Face Validity: If shortened, perceptions of stakeholders need to be considered vis-a-vis comparable professions.

6 Examination Length and Reliability What is an acceptable reliability index for credentialing? “ A reliability correlation coefficient should fall in the high.80s or above for longer examinations (e.g., 150 or more items)”. [NOCA, 2004]. What is the range of reliability indices for the current 180- item certification examinations? Average :.84 Min:.78 Max:.92

7 Examination Length and Practical Considerations If reliability is related to item length why shorten the examination? Costs and efficiency Each item costs between $300-$1000 to develop (Vale, 2006). Need additional items for safeguard purposes, or ancillary materials such as prep guides or readiness tests. Client’s intention to go to CBT makes it an advantage to have shorter examinations so seat time can be reduced and more candidates accommodated within the testing period.

8 Research Approaches Two approaches: Classical Test Theory (CTT) approach  Examining reliability coefficient using Spearman- Brown formula. Item Response Theory (IRT) approach  Examining the item information function using empirical data.

9 CTT Results for the Two Certification Programs Spearman Brown Formulation: Pxx= Npxx 1+ (N-1) pxx Results show that examinations can be lowered by questions (or about 10%) and still remain above.80. Number of Items 100%90%75%50% A B

10 Limitations of CTT Results General Limitations of Spearman Brown: Assumption that examinations are exactly parallel Only one value for a range of abilities Largely impacted by cohort

11 IRT Approach: Item Information Curve Research has shown that in higher stakes examinations with Pass/Fail decisions such as certification examinations, examinations can be shortened without impacting classification abilities (Schulz & Wang, 2001) What would be the impact if the certification examinations had 10% fewer items? – How about 25% or 50%?

12 IRT - Item Information Curve IRT models specify the probability of a discrete outcome such as a correct response to an item, in terms of person and item parameters. Person parameter: ability of a candidate (theta) Item parameters: a: Discrimination (slope) b: Difficulty (location) c: Guessing

13 IRT - Test Information Curve All Item Information Curves add to a Test Information Curve Amount of information scale differs based on length of examination and quality of the items Pass/Fail decision must be made where error is minimal (ideally where the passmark is located) and where level of ability can be clearly differentiated

14 IRT Results for Program A

15 IRT Results for Program B

16 IRT - Results and Implications The examinations can be reduced by at least 10% without significantly impacting the pass/fail decision. Other factors to take into consideration Number of candidates Robustness of item bank

17 Other Considerations What about face validity? How would an examination with 90 items be viewed by other professionals compared to a comparable examination of 180 items?

18 Other Certification Programs Review of over 75 certification programs within the same profession. The average number of items: 164 or between items (including experimental items) Minimum: 100 Maximum: 250

19 Summary Data suggest that the number of items can be reduced by 10% with minimal impact on the validity and reliability.

20 Part II How Many Options is Optimal in Multiple Choice Testing? Kelly Piasentin, PhD

21 Multiple Choice Testing Most common format used in Licensure and Certification examinations Consists of a stem (i.e., the question being asked) and a series of options to choose from (usually 4) Example: In which state is the 2008 CLEAR conference being held? 1.Arkansas 2.Alaska 3.Arizona 4.Alabama Stem Options

22 Advantages of Multiple Choice Versatility Efficiency Scoring accuracy and economy Reliability Diagnosis Control of difficulty Amenable to item analysis

23 Disadvantages of Multiple Choice Time consuming to write Difficult to create effective distracters (i.e., options that are plausible, but incorrect)

24 Time Spent Writing MCQs Sample of 75 Item Writers for 3 different licensing/certification examinations Average time spent writing an MCQ: 52 minutes Percentage of time spent writing: Stem26% Correct Response12% 1 st Distracter11% 2 nd Distracter13% 3 rd Distracter17% Rationales/References21%

25 Effort Spent Writing Distracters Of the 75 Item Writers… 25% reported that it was difficult to write the 1 st distracter 40% reported that it was difficult to write the 2 nd distracter 75% reported that it was difficult to write the 3 rd distracter

26 How many options should an MCQ have? 4-option MCQs are widely used in standardized testing everywhere But, are 4 options ideal? Some IW guidelines say, “develop as many options as feasible” (Haladyna & Downing, 1989) More recently, “develop as many functional distractors as are feasible” (Haladyna, Downing, & Rodriguez, 2002) Increasing emphasis on the quality of distractors as opposed to the quantity

27 Definition of a Functional Distracter “A functional distracter is one that has (a) a significant negative point-biserial correlation with the total test score, (b) a negative sloping item characteristic curve, and (c) a frequency of response greater than 5% for the total group.” Haladyna & Downing (1988)

28 How does # options impact guessing? With 4 options, candidates have a 25% chance of getting any one question correct by simply guessing –Probability is reduced to 20% if there are 5 options –Probability is increased to 33% if there are 3 options BUT…. if a typical examination has 25 items, each with 3-options, chance of getting at least a 70% on the examination by pure blind guessing is 1 in 25,000 So, do you get more bang for your buck by having more options?

29 Are 4-option MCQs optimal? Factors to consider: Time and cost it takes to develop distracters Time it takes for candidates to complete the examination Psychometric properties of examination –Item difficulty –Item discrimination –Test reliability (Coefficient alpha)

30 Arguments in favour of 3-options: Less time is needed to develop two plausible distracters More 3-option items can be administered without increasing testing time –Inclusion of additional high quality items per unit of time should improve test score reliability Having fewer options decreases the likelihood of exposing additional aspects of the domain to candidates (e.g., context clues to other questions)

31 Data from a Licensing/Certification Examination Number of MCQs: 235 Number of candidates: 5,393 Mean item difficulty:.721 Mean discrimination index:.166 Test reliability:.88 Most chosen distracter: nd most chosen distracter:.077 Least chosen distracter:.035

32 Reducing Examination Items to 3 Options What would be the effect on item difficulty, discrimination and reliability of reducing the items on the examination to 3 options if the least chosen distracter was: Attributed to correct answer? Attributed to 2 nd least chosen distracter? Randomly distributed to each of the other 3 choices?

33 Reducing Examination Items to 3 Options If least chosen attributed to correct answer: Item difficulty:.752 Mean discrimination index:.136 Coefficient Alpha:.834

34 Reducing Examination Items to 3 Options If least chosen attributed to 2 nd least chosen distracter: Item difficulty:.720 Mean discrimination index:.168 Reliability:.881

35 Reducing Examination Items to 3 Options If least chosen distributed randomly to each of the other 3 choices: Item difficulty.731 Mean discrimination index:.158 Reliability :.868

36 Summary DifficultyDiscriminationReliability 4 options LCD → Correct LCD → 2 nd LCD LCD → Random

37 4 Options vs. 3 Options Moving from 4 options to 3 options did not have a significant impact on average item difficulty, discrimination or test reliability.

38 Summary Two primary benefits of using 3 options (as opposed to 4 options) –Faster item writing –Better testing Better quality items Cost savings Shorter test time More questions in same amount of time (potential for increased reliability)

39 Conclusion These two presentations demonstrate that you can accrue some efficiencies from reducing test length and number of response options without compromising test validity. Further research needed to confirm findings.

40 Contact Information Assessment Strategies 1400 Blair Place, Suite 210 Ottawa, ON K1J 9B8 Canada. Telephone: Karine Georges, MSc Kelly Piasentin, PhD