David DiBattista, Ph.D. Brock University Department of Psychology Creating Effective Multiple-choice Questions July, 2012 ©D. DiBattista 2012.

Slides:



Advertisements
Similar presentations
Writing constructed response items
Advertisements

Assessing Student Performance
Critical Reading Strategies: Overview of Research Process
Developing a Questionnaire
VALIDITY AND RELIABILITY
Part II Sigma Freud & Descriptive Statistics
Reliability for Teachers Kansas State Department of Education ASSESSMENT LITERACY PROJECT1 Reliability = Consistency.
Part II Sigma Freud & Descriptive Statistics
Item Analysis: A Crash Course Lou Ann Cooper, PhD Master Educator Fellowship Program January 10, 2008.
Seminar /workshop on cognitive attainment ppt Dr Charles C. Chan 28 Sept 2001 Dr Charles C. Chan 28 Sept 2001 Assessing APSS Students Learning.
Principles of High Quality Assessment
Understanding Validity for Teachers
Classroom Assessment A Practical Guide for Educators by Craig A. Mertler Chapter 9 Subjective Test Items.
© Curriculum Foundation1 Section 2 The nature of the assessment task Section 2 The nature of the assessment task There are three key questions: What are.
March 21, 2011 Bassett High School Bloom’s Taxonomy Revised and Revisited.
Oscar Vergara Chihlee Institute of Technology July 28, 2014.
Designing and evaluating good multiple choice items Jack B. Monpas-Huber, Ph.D. Director of Assessment & Student Information.
Depth of Knowledge A HEAP of Complexity. BLOOM’S TAXONOMYBLOOM’S REVISED TAXONOMY KNOWLEDGE “The recall of specifics and universals, involving little.
LECTURE 06B BEGINS HERE THIS IS WHERE MATERIAL FOR EXAM 3 BEGINS.
Technical Adequacy Session One Part Three.
Completion, Short-Answer, and True-False Items
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
HOW DOES ASKING OUR STUDENTS QUESTIONS ENGAGE THEM IN THEIR LEARNING? Campbell County Schools.
Classroom Assessments Checklists, Rating Scales, and Rubrics
Bloom's Taxonomy: The Sequel (What the Revised Version Means for You!)
CONSTRUCTING OBJECTIVE TEST ITEMS: MULTIPLE-CHOICE FORMS CONSTRUCTING OBJECTIVE TEST ITEMS: MULTIPLE-CHOICE FORMS CHAPTER 8 AMY L. BLACKWELL JUNE 19, 2007.
The Revised Bloom’s Taxonomy (RBT): Improving Curriculum, Instruction, and Assessment in an Accountability-Driven, Standards-Based World Developed and.
T 7.0 Chapter 7: Questioning for Inquiry Chapter 7: Questioning for Inquiry Central concepts:  Questioning stimulates and guides inquiry  Teachers use.
Ferris Bueller: Voodoo Economics Voodoo_Economics_Anyone_Anyone. mp4Voodoo_Economics_Anyone_Anyone. mp4.
Exam Taking Kinds of Tests and Test Taking Strategies.
Measuring Complex Achievement
Session 2 Traditional Assessments Session 2 Traditional Assessments.
Teaching Today: An Introduction to Education 8th edition
Dillon School District Two Revised Bloom’s Taxonomy.
EDU 8603 Day 6. What do the following numbers mean?
Using questions to achieve Higher Order Thinking
ASSESSING STUDENT ACHIEVEMENT Using Multiple Measures Prepared by Dean Gilbert, Science Consultant Los Angeles County Office of Education.
6. Evaluation of measuring tools: validity Psychometrics. 2012/13. Group A (English)
Assessment Item Writing Workshop Ken Robbins FDN-5560 Classroom Assessment Click HERE to return to the Documentation HERE.
Bloom’s Taxonomy vs. Bloom’s Revised Taxonomy. Bloom’s Taxonomy 1956 Benjamin Bloom, pyschologist Classified the functions of thought or coming to know.
Revised Bloom's Taxonomy. Bloom’s Taxonomy (1956) Evaluation Synthesis Analysis Application Comprehension Knowledge.
Measurement Validity.
Appraisal and Its Application to Counseling COUN 550 Saint Joseph College For Class # 3 Copyright © 2005 by R. Halstead. All rights reserved.
Revised Bloom’s Taxonomy
Selecting a Sample. Sampling Select participants for study Select participants for study Must represent a larger group Must represent a larger group Picked.
1 Item Analysis - Outline 1. Types of test items A. Selected response items B. Constructed response items 2. Parts of test items 3. Guidelines for writing.
Educational Research CECS 5610 Dr. Gerald Knezek University of North Texas Clicking on the Speaker or Quicktime icon will play the audio associated with.
Assessment and Testing
SOCW 671: #5 Measurement Levels, Reliability, Validity, & Classic Measurement Theory.
14 Statistical Testing of Differences and Relationships.
Georgia will lead the nation in improving student achievement. 1 Georgia Performance Standards Day 3: Assessment FOR Learning.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Review: Alternative Assessments Alternative/Authentic assessment Real-life setting Performance based Techniques: Observation Individual or Group Projects.
Chapter 7 Measuring of data Reliability of measuring instruments The reliability* of instrument is the consistency with which it measures the target attribute.
RESEARCH METHODS IN INDUSTRIAL PSYCHOLOGY & ORGANIZATION Pertemuan Matakuliah: D Sosiologi dan Psikologi Industri Tahun: Sep-2009.
Chapter 6 - Standardized Measurement and Assessment
The Instructional Design Process
Test Question Writing Instructor Development ANSF Nurse Training Program.
Bloom’s Revised Taxonomy Creating Higher Level Discussions.
Assessment Issues Presented by Jeffrey Oescher Southeastern Louisiana University 4 January 2008.
Measurement Chapter 6. Measuring Variables Measurement Classifying units of analysis by categories to represent variable concepts.
Objective Examination: Multiple Choice Questions Dr. Madhulika Mistry.
 Good for:  Knowledge level content  Evaluating student understanding of popular misconceptions  Concepts with two logical responses.
Assessment in Education ~ What teachers need to know.
Copyright © Springer Publishing Company, LLC. All Rights Reserved. DEVELOPING AND USING TESTS – Chapter 11 –
Chapter 10: Bloom’s Taxonomy
Multiple Choice Item (MCI) Quick Reference Guide
Assessment for Learning
Multiple Choice Item (MCI) Quick Reference Guide
EDUC 2130 Quiz #10 W. Huitt.
Presentation transcript:

David DiBattista, Ph.D. Brock University Department of Psychology Creating Effective Multiple-choice Questions July, 2012 ©D. DiBattista 2012

Overview   Some essential terminology   The why and the how of testing   Two challenges in MC testing   Addressing the challenges ©D. DiBattista 2012

You are reading an article in which the world’s major cities are ranked with respect to the quality of life for their residents. This is an example of what type of measurement scale? A. ordinal scale B. nominal scale C. ratio scale D. interval scale A well-constructed four-option multiple-choice question ©D. DiBattista 2012

You are reading an article in which the world’s major cities are ranked with respect to the quality of life for their residents. This is an example of what type of measurement scale? A. ordinal scale B. nominal scale C. ratio scale D. interval scale  This part is the STEM. ©D. DiBattista 2012

You are reading an article in which the world’s major cities are ranked with respect to the quality of life for their residents. This is an example of what type of measurement scale? A. ordinal scale B. nominal scale C. ratio scale D. interval scale  These are the OPTIONS. ©D. DiBattista 2012

You are reading an article in which the world’s major cities are ranked with respect to the quality of life for their residents. This is an example of what type of measurement scale? A. ordinal scale B. nominal scale C. ratio scale D. interval scale  The one correct (or best) option is the KEYED OPTION. ©D. DiBattista 2012

You are reading an article in which the world’s major cities are ranked with respect to the quality of life for their residents. This is an example of what type of measurement scale? A. ordinal scale B. nominal scale C. ratio scale D. interval scale  The incorrect options are called DISTRACTORS. ©D. DiBattista 2012

Overview   Some essential terminology   The why and the how of testing   Two challenges in MC testing   Addressing the challenges ©D. DiBattista 2012

The Why and How of Testing A primary goal of testing To measure the extent to which test-takers have learned the facts, concepts, procedures, and skills that have been taught in the course. An effective test Test-takers who have learned more will obtain higher test scores, and those who have learned less will obtain lower scores. To be effective, a test must consist of effective items. ©D. DiBattista 2012

The Why and How of Testing A primary goal of testing To measure the extent to which test-takers have learned the facts, concepts, procedures, and skills that have been taught in the course. An effective test Test-takers who have learned more will obtain higher test scores, and those who have learned less will obtain lower scores. To be effective, a test must consist of effective items. ©D. DiBattista 2012

Top 25% of test-takers Bottom 25% of test-takersTop-Bottom A good item A poor item An awful item What is an effective test item? For an individual MC test item to be effective, test-takers with higher test scores must be more likely to answer it correctly than those with lower scores. Percent answering correctly This chart is on Page 2 of the handout. ©D. DiBattista 2012

Top 25% of test-takers Bottom 25% of test-takersTop-Bottom A good item90 A poor item An awful item Percent answering correctly ©D. DiBattista 2012 What is an effective test item? For an individual MC test item to be effective, test-takers with higher test scores must be more likely to answer it correctly than those with lower scores. This chart is on Page 2 of the handout.

Top 25% of test-takers Bottom 25% of test-takersTop-Bottom A good item9050 A poor item An awful item Percent answering correctly ©D. DiBattista 2012 What is an effective test item? For an individual MC test item to be effective, test-takers with higher test scores must be more likely to answer it correctly than those with lower scores.

Top 25% of test-takers Bottom 25% of test-takersTop-Bottom A good item A poor item An awful item Percent answering correctly ©D. DiBattista 2012 What is an effective test item? For an individual MC test item to be effective, test-takers with higher test scores must be more likely to answer it correctly than those with lower scores.

Top 25% of test-takers Bottom 25% of test-takersTop-Bottom A good item A poor item71 An awful item ©D. DiBattista 2012 Percent answering correctly What is an effective test item? For an individual MC test item to be effective, test-takers with higher test scores must be more likely to answer it correctly than those with lower scores.

Top 25% of test-takers Bottom 25% of test-takersTop-Bottom A good item A poor item7169 An awful item ©D. DiBattista 2012 Percent answering correctly What is an effective test item? For an individual MC test item to be effective, test-takers with higher test scores must be more likely to answer it correctly than those with lower scores.

Top 25% of test-takers Bottom 25% of test-takersTop-Bottom A good item A poor item An awful item ©D. DiBattista 2012 Percent answering correctly What is an effective test item? For an individual MC test item to be effective, test-takers with higher test scores must be more likely to answer it correctly than those with lower scores. This poor item is simply not pulling enough weight.

Top 25% of test-takers Bottom 25% of test-takersTop-Bottom A good item A poor item An awful item ©D. DiBattista 2012 Percent answering correctly What is an effective test item? For an individual MC test item to be effective, test-takers with higher test scores must be more likely to answer it correctly than those with lower scores. Note that these two items are equally difficult (i.e., 70% chose the keyed option in each item).

Top 25% of test-takers Bottom 25% of test-takersTop-Bottom A good item A poor item An awful item5 ©D. DiBattista 2012 Percent answering correctly What is an effective test item? For an individual MC test item to be effective, test-takers with higher test scores must be more likely to answer it correctly than those with lower scores.

Top 25% of test-takers Bottom 25% of test-takersTop-Bottom A good item A poor item An awful item525 ©D. DiBattista 2012 Percent answering correctly What is an effective test item? For an individual MC test item to be effective, test-takers with higher test scores must be more likely to answer it correctly than those with lower scores.

Top 25% of test-takers Bottom 25% of test-takersTop-Bottom A good item A poor item An awful item525B20 ©D. DiBattista 2012 Percent answering correctly What is an effective test item? For an individual MC test item to be effective, test-takers with higher test scores must be more likely to answer it correctly than those with lower scores.

The Discrimination Index is often expressed as a proportion: +40  We always want the Discrimination Index to be positive, and the bigger the better. (Top-Bottom) = Discrimination Index Top 25% of test-takers Bottom 25% of test-takersTop-Bottom A good item A poor item An awful item525B20 ©D. DiBattista 2012 Percent answering correctly What is an effective test item? For an individual MC test item to be effective, test-takers with higher test scores must be more likely to answer it correctly than those with lower scores.

Interpreting the Discrimination Index Value of DIInterpretation ≥ to to to to <0 This chart is on Page 3 of the handout. ©D. DiBattista 2012

Value of DIInterpretation ≥+0.50Outstanding to to to to <0 ©D. DiBattista 2012 Interpreting the Discrimination Index This chart is on Page 3 of the handout.

Value of DIInterpretation ≥+0.50Outstanding to +0.49Very good to to to <0 ©D. DiBattista 2012 Interpreting the Discrimination Index

Value of DIInterpretation ≥+0.50Outstanding to +0.49Very good to +0.39Good to to <0 ©D. DiBattista 2012 Interpreting the Discrimination Index

Value of DIInterpretation ≥+0.50Outstanding to +0.49Very good to +0.39Good to Acceptable (but could be better!) 0 to <0 ©D. DiBattista 2012 Interpreting the Discrimination Index

Value of DIInterpretation ≥+0.50Outstanding to +0.49Very good to +0.39Good to Acceptable (but could be better!) 0 to Unsatisfactory (despite being ≥0) <0 ©D. DiBattista 2012 Interpreting the Discrimination Index

Value of DIInterpretation ≥+0.50Outstanding to +0.49Very good to +0.39Good to Acceptable (but could be better!) 0 to Unsatisfactory (despite being ≥0) <0Harmful! ©D. DiBattista 2012 Interpreting the Discrimination Index

A key point The Discrimination Index tends to suffer when items are either very easy or very hard.  Very easy: 85% or more answer correctly  Very hard: 35% or less answer correctly  “Just right”: 40 to 80% answer correctly ©D. DiBattista 2012 This information is on Page 3 of the handout.

100 MC items Class mean = 62.6% Mean Discrimination Index = Harder Easier 10% of items are weak discriminators. This chart is on Page 4 of the handout. ©D. DiBattista 2012 Difficulty Index=0.71 Discrimination Index=0.30

211 MC items Class mean = 66.0% Mean Discrimination Index = Harder Easier 51% of items are poor discriminators, and 6.6% are negative. ©D. DiBattista 2012

Key points In general, the Discrimination Index of MC items will be greatest when:  they are in the mid-range of difficulty,  they conform to widely-accepted item- writing guidelines,  their content is consistent with the course learning objectives, and  the instruction provided has allowed motivated students to learn the material. ©D. DiBattista 2012 This information is on Page 3 of the handout.

Overview   Some essential terminology   The why and the how of testing   Two challenges in MC testing   Addressing the challenges ©D. DiBattista 2012

A good MCQ is difficult to write. Many will contain item writing flaws and most will do no more than test factual recall. Our study has shown that this does not necessarily have to be the case, but it cannot be assumed that (just) anyone can write a quality MCQ unaided and without peer review. Palmer & Devitt, 2007 ©D. DiBattista 2012

A good MCQ is difficult to write. Many will contain item writing flaws and most will do no more than test factual recall. Our study has shown that this does not necessarily have to be the case, but it cannot be assumed that (just) anyone can write a quality MCQ unaided and without peer review. Palmer & Devitt, 2007 ©D. DiBattista 2012

A good MCQ is difficult to write. Many will contain item writing flaws and most will do no more than test factual recall. Our study has shown that this does not necessarily have to be the case, but it cannot be assumed that (just) anyone can write a quality MCQ unaided and without peer review. Palmer & Devitt, 2007 ©D. DiBattista 2012

A good MCQ is difficult to write. Many will contain item writing flaws and most will do no more than test factual recall. Our study has shown that this does not necessarily have to be the case, but it cannot be assumed that (just) anyone can write a quality MCQ unaided and without peer review. Palmer & Devitt, 2007 ©D. DiBattista 2012

A good MCQ is difficult to write. Many will contain item writing flaws and most will do no more than test factual recall. Our study has shown that this does not necessarily have to be the case, but it cannot be assumed that (just) anyone can write a quality MCQ unaided and without peer review. Palmer & Devitt, 2007 ©D. DiBattista 2012

A good MCQ is difficult to write. Many will contain item writing flaws and most will do no more than test factual recall. Our study has shown that this does not necessarily have to be the case, but it cannot be assumed that (just) anyone can write a quality MCQ unaided and without peer review. Palmer & Devitt, 2007 ©D. DiBattista 2012

A good MCQ is difficult to write. Many will contain item writing flaws and most will do no more than test factual recall. Our study has shown that this does not necessarily have to be the case, but it cannot be assumed that (just) anyone can write a quality MCQ unaided and without peer review. Palmer & Devitt, 2007 ©D. DiBattista 2012

A good MCQ is difficult to write. Many will contain item writing flaws and most will do no more than test factual recall. Our study has shown that this does not necessarily have to be the case, but it cannot be assumed that (just) anyone can write a quality MCQ unaided and without peer review. Palmer & Devitt, 2007 ©D. DiBattista 2012 Some good news: Writing high-quality MC items is a learnable skill!

Challenge #1 “Many will contain item-writing flaws”  Flawed items less effectively discriminate among students who differ in achievement. ©D. DiBattista 2012 Two Challenges in MC Testing

Challenge #2 “Most will do no more than test factual recall”  An emphasis on memory-based items over higher-level items may threaten the content validity of the test. ©D. DiBattista 2012 Two Challenges in MC Testing

Overview   Some essential terminology   The why and the how of testing   Two challenges in MC testing   Addressing the challenges  Constructing high-quality items  Assessing higher-level thinking ©D. DiBattista 2012

Tips for MC Item Construction When writing the stem, use question format rather than sentence-completion format. ©D. DiBattista 2012 The complete list of tips is on Page 5 of the handout.

You are reading an article in which the world’s major cities are ranked with respect to the quality of life for their residents. This is an example of what type of measurement scale? A. ordinal scale B. nominal scale C. ratio scale D. interval scale ©D. DiBattista 2012 Here the stem is in question format.

You are reading an article in which the world’s major cities are ranked with respect to the quality of life for their residents. This is an example of what type of measurement scale? A. ordinal scale B. nominal scale C. ratio scale D. interval scale ©D. DiBattista 2012 Here the stem is in question format.

You are reading an article in which the world’s major cities are ranked with respect to the quality of life for their residents. This is an example of a(n) A. ordinal measurement scale. B. nominal measurement scale. C. ratio measurement scale. D. interval measurement scale. ©D. DiBattista 2012 Here the stem ends with an incomplete sentence. Question format works better. Shuttling—ESL

 The stem should present the issue under consideration CLEARLY and contain as much information as possible.  Do not include irrelevant information in the stem unless it plays a role in the assessment procedure.  Avoid using long, complex sentences. ©D. DiBattista 2012 Tips for MC Item Construction

Poor South America A. imports coffee from Australia. B. is where the Gobi Desert is located. C. was heavily colonized by people from Spain. D. has a larger population than the United States of America. The stem is not informative at all, and there really is no question here. ©D. DiBattista 2012 Having lengthy options increases the amount of reading that students must do.

Even worse South America, which has an area of more than 17 million square kilometres, A. imports coffee from Australia. B. is where the Gobi Desert is located. C. was heavily colonized by people from Spain. D. has a larger population than the United States of America. Avoid window dressing. ©D. DiBattista 2012

Poor Which of the following statements about South America is true? A. South America imports coffee from Australia. B. The Gobi Desert is located in South America. C. South America was heavily colonized by people from Spain. D. South America has a larger population than the United States of America. This is really a multiple true-false question. The stem contains little information and does not pose a question related to the topic. ©D. DiBattista 2012

Better People from which of these countries colonized a large part of South America? A. Spain B. France C. Holland D. England ©D. DiBattista 2012 This is a clear, straightforward question, and it is focused on a single topic. Even poorly constructed items can sometimes provide the inspiration for a useful item!

Which classical theorist’s insights were tested by Zurcher in the real-life social laboratory provided by the Kansas tornado about change and social solidarity? A. Martineau B. Marx C. Durkheim D. Weber ©D. DiBattista 2012

Which classical theorist’s insights about change and social solidarity did Zurcher study in the context of the Kansas tornado? A. Martineau B. Marx C. Durkheim D. Weber ©D. DiBattista 2012

Which classical theorist’s insights about change and social solidarity did Zurcher study in the context of the Kansas tornado? A. Martineau B. Marx C. Durkheim D. Weber The importance of good writing! ©D. DiBattista 2012

If we suppose that John’s score on a recent Canadian History test is 80, and the distribution of test scores, which has a mean of 70 and a standard deviation of 10, contains 100 scores and is positively skewed, then what is John’s standard score? A B C D ©D. DiBattista 2012 This question has 45 words, all in one sentence!

Henry David Thoreau ( ) “Simplify, simplify.”

If we suppose that John’s score on a recent Canadian History test is 80, and the distribution of test scores, which has a mean of 70 and a standard deviation of 10, contains 100 scores and is positively skewed, then what is John’s standard score? A B C D ©D. DiBattista 2012 So let’s simplify this 45-word question…

A set of 100 test scores is positively skewed, with a mean of 70 and a standard deviation of 10. John’s test score is 80. What is his standard score? A B C D ©D. DiBattista 2012 Easy reading: The stem now has 30 words in three sentences, including a straightforward question.

©D. DiBattista 2012 A set of 100 test scores is positively skewed, with a mean of 70 and a standard deviation of 10. John’s test score is 80. What is his standard score? A B C D Aim to make sentences shorter and simpler, rather than longer and more complex.

Note that some information in the stem is not needed to answer the questionC but that’s okay here. For math-based problems:  Focus on the principles.  Keep the numbers simple.  Put options in reader-friendly order. ©D. DiBattista 2012 A set of 100 test scores is positively skewed, with a mean of 70 and a standard deviation of 10. John’s test score is 80. What is his standard score? A B C D

After a bad day at work, George comes home and yells at his young son, who starts to cry. What type of behaviour is A. projection B. displacement C. sublimation D. reaction formation he demonstrating? ©D. DiBattista 2012 Watch for ambiguity!

After a bad day at work, George comes home and yells at his young son, who starts to cry. What type of behaviour is A. projection B. displacement C. sublimation D. reaction formation George demonstrating? ©D. DiBattista 2012 Watch for ambiguity!

Mary loves being in the limelight. On which Big Five factor would you expect her to have a very high score? A. conscientiousness B. extroversion C. agreeableness D. neuroticism ©D. DiBattista 2012 Watch for idioms and uncommon words.

Mary enjoys talking and spending time with others, and many of her friends consider her a natural leader. On which Big Five factor would you expect Mary to have a very high score? A. conscientiousness B. extroversion C. agreeableness D. neuroticism Of course, discipline-related technical terms are perfectly appropriate. ©D. DiBattista 2012 Watch for idioms and uncommon words.

Whenever possible, avoid negative wording in the stem, and be sure to emphasize it when it does occur. ©D. DiBattista 2012 Tips for MC Item Construction

Poor Which of the following terms is not usually associated with Sigmund Freud? A. superego B. extinction C. repression D. latent content ©D. DiBattista 2012

Better Which of the following terms is NOT usually associated with Sigmund Freud? A. superego B. extinction C. repression D. latent content Negation adds an extra cognitive burden, so use it only when really necessary. ©D. DiBattista 2012

Also better Which of the following terms is usually associated with behaviourism? A. synesthesia B. extinction C. repression D. closure ©D. DiBattista 2012 The keyed option is still the same, but the question is now positively framed.

Check carefully for spelling and grammatical errors, giving special attention to distractors. ©D. DiBattista 2012 Tips for MC Item Construction

What do stamp collectors use stamp hinges for? A. to pick up stamps B. to fold learge stamps in half C. to mount stamps in albums D. to joining stamps together ©D. DiBattista 2012 Errors like these are more likely to crop up in the distractors than in the keyed option. Such errors can give clues to testwise students!

What do stamp collectors use stamp hinges for? A. to pick up stamps B. to fold learge stamps in half C. to mount stamps in albums D. to joining stamps together ©D. DiBattista 2012 Errors like these are more likely to crop up in the distractors than in the keyed option. Such errors can give clues to testwise students!

What do stamp collectors use stamp hinges for? A. to pick up stamps B. to fold large stamps in half C. to mount stamps in albums D. to join stamps together ©D. DiBattista 2012 Errors like these are more likely to crop up in the distractors than in the keyed option. Such errors can give clues to testwise students!

 All distractors should be plausible.  Four options will usually be quite adequate, but the number used is best determined by the number of PLAUSIBLE distractors you can supply. ©D. DiBattista 2012 Tips for MC Item Construction

Which river flows through the city of Edmonton? A. North Saskatchewan River B. Peace River C. Milk River D. Athabasca River ©D. DiBattista 2012 These four rivers are all in the same “domain.”

E. Mississippi River F. Seine River Which river flows through the city of Edmonton? A. North Saskatchewan River B. Nile River C. Amazon River D. Rhine River Distractor plausibility is a key to success! ©D. DiBattista 2012

 To generate plausible distractors  Use students’ most common errors on constructed-response tests.  Use distractors that are similar to the correct answer in content, length, and complexity.  Use words that sound important or have associations to the stem.  Use distractors that are true, but do not correctly answer the question. ©D. DiBattista 2012 Tips for MC Item Construction

Name the river that flows through the city of Edmonton. Athabasca River What do stamp collectors use stamp hinges for? To join stamps together   ©D. DiBattista 2012 And listen carefully to questions students ask, and watch for their misconceptions.

 To generate plausible distractors  Use students’ most common errors on constructed-response tests.  Use distractors that are similar to the correct answer in content, length, and complexity.  Use words that sound important or have associations to the stem.  Use distractors that are true, but do not correctly answer the question. ©D. DiBattista 2012 Tips for MC Item Construction

In severe cases of obesity, there may be a substantial increase in the number of adipocytes. Which of the following terms is used to refer to this increase? A. hyperbole B. hyperplasia C. hypertrophy D. hypertonicity ©D. DiBattista 2012 Knowing that the answer is “hyper-something” is not enough to get this item correct.

A. Barack Obama B. Muhammad Ali C. Martin Luther D. Joseph Wolpe Who developed the behavioural therapy known as systematic desensitization? These four people have little in common— that is, they are not in the same domain. ©D. DiBattista 2012

A. Anna Freud B. Jean Piaget C. Wilhelm Wundt D. Joseph Wolpe Who developed the behavioural therapy known as systematic desensitization? More challenging: All four of these people are well known within the domain of psychology. ©D. DiBattista 2012

A. Ivan Pavlov B. Albert Ellis C. B. F. Skinner D. Joseph Wolpe Even more challenging: All four of these people have a connection to the domain of behavioural psychology. ©D. DiBattista 2012 Who developed the behavioural therapy known as systematic desensitization?

 To generate plausible distractors  Use students’ most common errors on constructed-response tests.  Use distractors that are similar to the correct answer in content, length, and complexity.  Use words that sound important or have associations to the stem.  Use distractors that are true, but do not correctly answer the question. ©D. DiBattista 2012 Tips for MC Item Construction

In responding to a lengthy survey, a man answers “yes” to every yes-no question asked. It is reasonable to suspect that his responses may be influenced by which of the following? A. response acquiescence B. opportunistic characterization C. the partial reinforcement effect D. the conspicuous agreement predisposition ©D. DiBattista 2012

 To generate plausible distractors  Use students’ most common errors on constructed-response tests.  Use distractors that are similar to the correct answer in content, length, and complexity.  Use words that sound important or have associations to the stem.  Use distractors that are true, but do not correctly answer the question. ©D. DiBattista 2012 Tips for MC Item Construction

Which of the following events caused the Prime Minister of Canada to proclaim the War Measures Act? A. Quebec was invaded by Germany in B. The October Crisis occurred in C. The first Quebec Referendum was held in D. The Meech Lake Accord was defeated in ©D. DiBattista 2012 Option A can be ruled out simply because it is a FALSE statement. Because Options C and D are TRUE, they must be considered as possible answers to the question posed in the stem.

 Avoid patterns in the length and location of correct answers that could provide clues that are unrelated to content.  Balance the answer key so that the correct response appears in each position about the same number of times. ©D. DiBattista 2012 Tips for MC Item Construction

What characteristic of hallucinations would make their occurrence sufficient for a diagnosis of schizophrenia? A. a satanic or religious theme B. bizarre content C. derailment and neologisms D. voices providing a running commentary on the person’s behaviour, or two or more voices conversing with one another The keyed response is too often the longest– and testwise students know this! ©D. DiBattista 2012

In a four-option multiple-choice test, about how often should the correct answer appear in each of the four locations? A. 10% of the time B. 25% of the time C. 40% of the time D. 60% of the time Balance the answer key! ©D. DiBattista 2012

Who invented the binaural recording system commonly known as “stereo”? A. Xxxxxxxxxxxxxx B. Xxxxxxxxxxxxxx C. Xxxxxxxxxxxxxx D. Xxxxxxxxxxxxxx “Edge avoidance” ©D. DiBattista 2012 When the four options appear, make your best guess as quickly as you can! Edge avoidance can be a major problem for the creators of MC tests! In one test I came across, 74% of the keyed options were either B or C. Think about those testwise students!

In one test I came across, 74% of the keyed options were either B or C. Think about those testwise students! Thanks, Alan! ©D. DiBattista 2012 Who invented the binaural recording system commonly known as “stereo”? A. Xxxxxxxxxxxxxx B. Xxxxxxxxxxxxxx C. Xxxxxxxxxxxxxx D. Xxxxxxxxxxxxxx Who invented the binaural recording system commonly known as “stereo”? A. Alan Dower Blumlein B. Alan Dower Blumlein C. Alan Dower Blumlein D. Alan Dower Blumlein

For numerical options, let the correct answer appear in each of the positions about the same number of times. ©D. DiBattista 2012 Tips for MC Item Construction

How many chromosomes are found in an ovum of a healthy adult woman? A. 18 B. 23 C. 37 D. 46 Item-writers tend NOT to let the key be either the smallest or largest value in the option list. Knowing this, testwise students discount the smallest and largest values. ©D. DiBattista 2012 ← Options are in reader-friendly order.

Avoid having the options include a single pair of opposites, one of which is the keyed option. ©D. DiBattista 2012 Tips for MC Item Construction

A psychologist administers an aptitude test to 200 people, and then one month later she has the same people take the test again. The correlation between the two sets of scores is What should she conclude about the test? A. 91% of the items are effective. B. It has poor test-retest reliability. C. It has good test-retest reliability. D. It has poor criterion-related validity. A problem: When the options include a single pair of opposites, one member of the pair is the keyed option 75-80% of the time. ©D. DiBattista 2012

A psychologist administers an aptitude test to 200 people, and then one month later she has the same people take the test again. The correlation between the two sets of scores is What should she conclude about the test? A. It has poor test-retest reliability. B. It has good test-retest reliability. C. It has poor criterion-related validity. D. It has good criterion-related validity. Using two pairs of opposites generally solves the problem. ©D. DiBattista 2012

Do not use “none of the above.” ©D. DiBattista 2012 Tips for MC Item Construction

Which of these 19th century authors wrote Middlemarch? A. Jane Austen B. Anne Bronte C. Wilkie Collins D. none of the above “None of the above” as the key ©D. DiBattista 2012

I’m sure Dickens wrote Middlemarch, so I’ll go with “none of the above.” ©D. DiBattista 2012

Dickens didn’t write Middlemarch. George Eliot wrote it! But here is the problem: When NOTA is the keyed option, misinformed students often earn full marks. ©D. DiBattista 2012 NOTA is often used as “the distractor of last resort.”

Which of these 19th century authors wrote Middlemarch? A. Jane Austen B. Anne Bronte C. Wilkie Collins D. none of the above ©D. DiBattista 2012 D. George Eliot So let’s fix this NOTA item…

Do not use “all of the above.” ©D. DiBattista 2012 Tips for MC Item Construction

Which of these terms is associated with Sigmund Freud? A. superego B. repression C. latent content D. all of the above ©D. DiBattista 2012

I never heard of latent content, but superego and repression are both definitely Freudian terms, so it must be “all of the above.” ©D. DiBattista 2012

Which of these terms is associated with Sigmund Freud? A. superego B. repression C. latent content D. all of the above ©D. DiBattista 2012

Which of these terms is associated with Sigmund Freud? A. superego B. repression C. latent content D. all of the above ©D. DiBattista 2012

Which of these terms is associated with Sigmund Freud? A. superego B. repression C. latent content ??? D. all of the above ©D. DiBattista 2012

Which of these terms is associated with Sigmund Freud? A. superego B. repression C. latent content ??? D. all of the above When AOTA is the keyed option, students with partial knowledge can still earn full marks. ©D. DiBattista 2012 Moreover, AOTA usually serves as the keyed option–and testwise students know this!

Better Which of these terms is associated with Sigmund Freud? A. latent content B. fixed-interval schedule C. cognitive dissonance D. bulimia nervosa ©D. DiBattista 2012

Overview   Some essential terminology   The why and the how of testing   Two challenges in MC testing   Addressing the challenges  Constructing high-quality items  Assessing higher-level thinking ©D. DiBattista 2012 

Two Challenges in MC Testing Challenge #2 “Most (MCQs) do no more than test factual recall”  An emphasis on memory-based items over higher-level items may threaten the content validity of the test. ©D. DiBattista 2012

Evaluation Synthesis Analysis Application Comprehension Knowledge The Original Bloom’s Taxonomy ©D. DiBattista 2012

Factual Knowledge Dimension Conceptual ProceduralMetacognitive Cognitive Process Dimension Remember Understand Apply Analyze Evaluate Create The Revised Bloom’s Taxonomy Anderson and Krathwohl, 2001 ©D. DiBattista 2012 See Pages 6-7 of the handout!

Factual Knowledge Dimension Conceptual ProceduralMetacognitive Cognitive Process Dimension Remember Understand Apply Analyze Evaluate Create The Revised Bloom’s Taxonomy Anderson and Krathwohl, 2001  These are all ACTION verbs—i.e., things students can DO with their knowledge. ©D. DiBattista 2012

Good tests allow us to determine what our students are capable of. …but they can be used effectively to assess all of the other cognitive processes. MC questions are not useful for assessing creativity… Factual Knowledge Dimension Conceptual ProceduralMetacognitive Cognitive Process Dimension Remember Understand Apply Analyze Evaluate Create The Revised Bloom’s Taxonomy Anderson and Krathwohl, 2001 Can you remember this? Can you understand this? Can you apply this? Can you evaluate this? Can you create this? Can you analyze this?      ©D. DiBattista 2012 Some thoughts about REMEMBER …  These are all ACTION verbs—i.e., things students can DO with their knowledge.

ALL assessment tasks involve using memory to at least some degree. BUT “If assessment tasks are to tap higher-order cognitive processes, they must require that students cannot answer them correctly by relying on memory ALONE.” —Anderson and Krathwohl, 2001, page 71 A simple, unfortunate fact: Creating MC items that rely on memory alone is far easier than creating higher-level items. ©D. DiBattista 2012

Let’s take a closer look at how the cognitive processes in the Revised Bloom’s Taxonomy relate to multiple-choice questions.

COGNITIVE PROCESS DIMENSION REMEMBER Retrieve relevant knowledge from long-term memory Recognize; Recall UNDERSTAND Determine the meaning of instructional messages, including oral, written, and graphic communications Interpret; Exemplify; Classify; Summarize; Infer; Compare; Explain  Observable behaviours ©D. DiBattista 2012 See Pages 8-9 of handout for further details.

Because MC is a selected response technique, remember-level items always involve recognition rather than recall. ©D. DiBattista 2012 What city is the capital of the state of California? A. Sacramento B. Los Angeles C. San Francisco D. Fresno

Remember-level items are very easy to create, which is probably why there are so many of them on classroom tests! If the options were not included in this item, it would involve recall rather than recognition. ©D. DiBattista 2012 What city is the capital of the state of California?

COGNITIVE PROCESS DIMENSION REMEMBER Retrieve relevant knowledge from long-term memory Recognize; Recall UNDERSTAND Determine the meaning of instructional messages, including oral, written, and graphic communications Interpret; Exemplify; Classify; Summarize; Infer; Compare; Explain  Observable behaviours ©D. DiBattista 2012

“If assessment tasks are to tap higher-order cognitive processes, they must require that students cannot answer them correctly by relying on memory ALONE.” —Anderson and Krathwohl, 2001, page 71 Interpret. In the graph shown below, which group has the most variability in its scores? ©D. DiBattista 2012 Note the important role of NOVELTY. If the exact same chart is shown in the textbook, then this will actually be a remember-level item! A. Group 1 B. Group 2 C. Group 3 D. Group 4

Classify. You are reading an article in which the world’s major cities are ranked with respect to the quality of life for their residents. This is an example of what type of measurement scale? Exemplify. Which of the following is an example of negative feedback? Summarize. Which of the following statements best summarizes Carol Gilligan’s response to Lawrence Kohlberg’s theory of moral development? ©D. DiBattista 2012

Infer. Which of the words listed below best completes the following analogy? Retina is to Cranial Nerve II as hair cells are to ______. Compare. In what way are a neuron and a battery similar to each other? Explain. Why is the z-test for independent samples so rarely used? ©D. DiBattista 2012

APPLY Carry out or use a procedure in a given situation Execute; Implement ANALYZE Break material into its constituent parts and detect how the parts relate to one another and to an overall structure or purpose Differentiate; Organize; Attribute  Observable behaviours ©D. DiBattista 2012

Execute Working with an ordinal data scale, Jeff obtained the following five scores: 0, 0, 2, 5, 18. What is the value of the median for this set of scores? A. 0 B. 2 C. 3 D. 5 ©D. DiBattista 2012

Execute Working with an ordinal data scale, Jeff obtained the following five scores: 0, 0, 2, 5, 18. What is the value of the median for this set of scores? A. 0 B. 2 C. 3 D. 5 Execution involves being told what procedure to apply and then carrying it out. ©D. DiBattista 2012

Implement Working with an ordinal data scale, Jeff obtained the following five scores: 0, 0, 2, 5, 18. What is the value of the most appropriate measure of central tendency for this set of scores? A. 0 B. 2 C. 3 D. 5 ©D. DiBattista 2012

Implement Working with an ordinal data scale, Jeff obtained the following five scores: 0, 0, 2, 5, 18. What is the value of the most appropriate measure of central tendency for this set of scores? A. 0 B. 2 C. 3 D. 5 Implementation involves deciding what procedure to apply and then carrying it out. ©D. DiBattista 2012

APPLY Carry out or use a procedure in a given situation Execute; Implement ANALYZE Break material into its constituent parts and detect how the parts relate to one another and to an overall structure or purpose Differentiate; Organize; Attribute  Observable behaviours ©D. DiBattista 2012

Differentiate Keri’s history test grade was 70. A total of 200 students took the test, and the lowest score was 30. The class mean was 60, and the variance was 100. Which of these values must you use to obtain Keri’s standard score? A. 30, 70, 100 B. 30, 70, 200 C. 60, 70, 100 D. 60, 70, 200 Differentiation involves distinguishing the parts of a whole with respect to their relevance or importance. ©D. DiBattista 2012

Organize Suppose you are reviewing the research literature on a particular topic. Which of the following patterns would be most likely to describe the methodological progress of the research over time? A. case studies first, then experimental studies, then correlational studies B. case studies first, then correlational studies, then experimental studies C. experimental studies first, then case studies, then correlational studies D. experimental studies first, then correlational studies, then case studies Organization involves identifying the elements of a situation and recognizing how they fit together into a coherent structure. ©D. DiBattista 2012

Attribution involves determining the point of view, bias, values, or intent associated with a written work or an action. Attribute Which of the following would a Rogerian therapist be MOST likely to say when working with a client? A. You seem to be feeling a bit down today. B. Your dream about going to the zoo—what do you think it might signify? C. You should talk to your sister and find out if she agrees with you. D. There are some things I want you to work on before we meet again next week. ©D. DiBattista 2012

EVALUATE Make judgments based on criteria and standards Check; Critique CREATE Put elements together to form a novel, coherent whole or make an original product Generate; Plan; Produce  Observable behaviours ©D. DiBattista 2012

Checking involves looking for internal contradictions and determining whether a conclusion is appropriate, and assessing whether evidence supports or disconfirms a hypothesis. Check Alyssa has carried out a one-way ANOVA for independent groups and rejected the null hypothesis. Which of the following would indicate to you that Alyssa has made an error in her work? A. She says that df-total is 197. B. She says that F-critical is C. She says that the F-statistic is D. She says that eta-squared is ©D. DiBattista 2012

Check Which of these research findings would suggest that differences in Trait X are influenced by genetic factors? A. Sisters reared apart have more similar scores on X than sisters reared together. B. Sisters reared together have more similar scores on X than sisters reared apart. C. Identical twins reared together have more similar scores on X than fraternal twins reared together. D. Fraternal twins reared together have more similar scores on X than identical twins reared together. Checking involves looking for internal contradictions and determining whether a conclusion is appropriate, and assessing whether evidence supports or disconfirms a hypothesis. ©D. DiBattista 2012

Critiquing involves assessing the positive and negative aspects of a product, idea or action and making a judgment based on external criteria. Critique Bill wants to compare the effectiveness of two training methods for teaching people to juggle. He obtains a group of non-jugglers and randomly assigns each person to one of the two training methods. He sets alpha at 0.05, two-tailed, and he determines that beta is equal to Which of the following is a valid criticism of this research study? A. The power of the statistical test is too low. B. The probability of a Type I error is too high. C. He should use a one-tailed test. D. People should select their own training method. ©D. DiBattista 2012

EVALUATE Make judgments based on criteria and standards Check; Critique CREATE Put elements together to form a novel, coherent whole or make an original product Generate; Plan; Produce  Observable behaviours ©D. DiBattista 2012

EVALUATE Make judgments based on criteria and standards Check; Critique CREATE Put elements together to form a novel, coherent whole or make an original product Generate; Plan; Produce Because multiple choice is a selected response technique, it is NOT useful for assessing the ability to create. Other testing techniques are needed to do this.  Observable behaviours ©D. DiBattista 2012

Overview   Some essential terminology   The why and the how of testing   Two challenges in MC testing   Addressing the challenges  Constructing high-quality items  Assessing higher-level thinking ©D. DiBattista 2012  

David DiBattista, Ph.D. Brock University Department of Psychology Creating Effective Multiple-choice Questions July, 2012 ©D. DiBattista 2012