Test and Scale Development

Slides:



Advertisements
Similar presentations
Writing constructed response items
Advertisements

An Introduction to Test Construction
& dding ubtracting ractions.
Assessing Student Performance
1. 2 Begin with the end in mind! 3 Understand Audience Needs Stakeholder Analysis WIIFM Typical Presentations Expert Peer Junior.
Continuous Numerical Data
1 What Is The Next Step? - A review of the alignment results Liru Zhang, Katia Forêt & Darlene Bolig Delaware Department of Education 2004 CCSSO Large-Scale.
Create an Application Title 1Y - Youth Chapter 5.
CALENDAR.
Leon County Schools Next Generation Content Area Reading Professional Development (NGCARPD) Summer 2012 Using Common Core to Enhance your Instruction 1.
Woodburn Interchange EA Evaluation Framework Presentation SWG Meeting #2 April 10, 2003.
1 Career Pathways for All Students PreK-14 2 Compiled by Sue Updegraff Keystone AEA Information from –Iowa Career Pathways –Iowa School-to-Work –Iowa.
Strategies for Taking the End of Grade Test Testing Dates: May Monday: Reading (3-5) Tuesday: Math Active (3-5) Wednesday: Math In-Active (3-5)
Which ones are you using?
Converting Data to Information. Know your data Know your audience Tell a story.
The 5S numbers game..
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
1 SESSION 5- RECORDING AND REPORTING IN GRADES R-12 Computer Applications Technology Information Technology.
The SCPS Professional Growth System
Student & Work Study Employment Facts & Time Card Training
The basics for simulations
Factoring Quadratics — ax² + bx + c Topic
Results and Statistics on Questionnaire for Foreign Staff Members Human Resources Services Santiago Osorio Alzate September
Pennsylvania Value-Added Assessment System (PVAAS) High Growth, High Achieving Schools: Is It Possible? Fall, 2011 PVAAS Webinar.
Briana B. Morrison Adapted from William Collins
Understanding the ELA/Literacy Evidence Tables. The tables contain the Reading, Writing and Vocabulary Major claims and the evidences to be measured on.
Success Planner PREPARE FOR EXAMINATIONS Student Wall Planner and Study Guide.
Regression with Panel Data
Dynamic Access Control the file server, reimagined Presented by Mark on twitter 1 contents copyright 2013 Mark Minasi.
Problems, Skills and Training Needs in Nonprofit Human Service Organizations Dr. Rick Hoefer University of Texas at Arlington School of Social Work.
When you see… Find the zeros You think….
Copyright © 2014 by Educational Testing Service. ETS, the ETS logo, LISTENING. LEARNING. LEADING. and GRE are registered trademarks of Educational Testing.
2011 WINNISQUAM COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=1021.
2011 FRANKLIN COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=332.
Digging Deeper Into the K-5 ELA Standards College and Career Ready Standards Implementation Team Quarterly – Session 2.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Middle School Lesson 2 Activity 3 – The Guessing Game
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
& dding ubtracting ractions.
9. Two Functions of Two Random Variables
Patient Survey Results 2013 Nicki Mott. Patient Survey 2013 Patient Survey conducted by IPOS Mori by posting questionnaires to random patients in the.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Introduction Embedded Universal Tools and Online Features 2.
Presented to: By: Date: Federal Aviation Administration FAA Safety Team FAASafety.gov AMT Awards Program Sun ‘n Fun Bryan Neville, FAASTeam April 21, 2009.
Designing the Test and Test Questions Jason Peake.
Designing Scoring Rubrics. What is a Rubric? Guidelines by which a product is judged Guidelines by which a product is judged Explain the standards for.
Principles of High Quality Assessment
Oscar Vergara Chihlee Institute of Technology July 28, 2014.
Building Effective Assessments. Agenda  Brief overview of Assess2Know content development  Assessment building pre-planning  Cognitive factors  Building.
ASSESSMENT IN EDUCATION ASSESSMENT IN EDUCATION. Copyright Keith Morrison, 2004 PERFORMANCE ASSESSMENT... Concerns direct reality rather than disconnected.
Technical Adequacy Session One Part Three.
Out with the Old, In with the New: NYS Assessments “Primer” Basics to Keep in Mind & Strategies to Enhance Student Achievement Maria Fallacaro, MORIC
Prepare and Use Knowledge Assessments. IntroductionIntroduction Why do we give knowledge tests? What problems did you have with tests as a student? As.
Assessment and Testing
Assessment. Workshop Outline Testing and assessment Why assess? Types of tests Types of assessment Some assessment task types Backwash Qualities of a.
KEY STAGE 2 SATS Session Aims To understand what SATs are and why we have them. What will be different in SATs 2016? To share timetable for SATs.
Test Question Writing Instructor Development ANSF Nurse Training Program.
COURSE AND SYLLABUS DESIGN
Key Stage 2 SATs Willand School. Key Stage 2 SATs Changes In 2014/15 a new national curriculum framework was introduced by the government for Years 1,
Charlton Kings Junior School INFORMATION EVENING FOR YEAR 6 PARENTS.
 Good for:  Knowledge level content  Evaluating student understanding of popular misconceptions  Concepts with two logical responses.
Copyright © Springer Publishing Company, LLC. All Rights Reserved. DEVELOPING AND USING TESTS – Chapter 11 –
Assessment Framework and Test Blueprint
Math Field Day Meeting #2 October 28, 2015
Math-Curriculum Based Measurement (M-CBM)
Multiple Choice Item (MCI) Quick Reference Guide
Multiple Choice Item (MCI) Quick Reference Guide
EDUC 2130 Quiz #10 W. Huitt.
Constructing a Test We now know what makes a good question:
Presentation transcript:

Test and Scale Development Margaret Wu

Item Development Development of a Framework and Test Blueprint Draft items Item panelling (shredding!) Iterative process: Draft items to illustrate, and clarify and sharpen up framework. Framework to guide item development.

Framework and Test Blueprint-1 Clearly identify ‘Why’ you are assessing (Purpose) ‘Whom’ to assess (Population) ‘What’ to assess (Construct domain) Define parameters for the test,e.g.: Duration of the test and test administration procedures Scoring/marking constraints; item formats Other issues: security, feedback.

Specifying the Purpose How will the results be used? Determine pass/fail, satisfactory/unsatisfactory Award prizes Provide diagnostic information Compare students Set standards Provide information to policy makers Who will use the information? Teachers, parents, students, managers, politicians

Specifying the Population Grade, age level. In an industry. Profession. Ethnicity/Culture/Language issues. Gender. Notion of population and sample. Sampling method: random, convenience Size of sample Validity of test results could depend on the population/sample you are assessing

Specifying the Construct Domain - Examples Familiarity with sport: What is meant by ‘sport’? Include “Gym”? “Taichi?” “Gymnastics”? In Australian contexts? Define Problem Solving: As viewed in the workforce Workforce competencies As viewed in the cognitive sciences Cognitive processes: decoding, reasoning, domain specific knowledge Interpersonal skills Negotiation/conflict resolution skills Leadership skills Work with people from diverse backgrounds

Specifying the Construct Domain – Examples Achievement domains: Content oriented: Number, measurement, data, algebra Competency oriented: Conceptual understanding, procedural knowledge, problem solving Taxonomy of educational objectives (Bloom’s taxonomy of learning outcomes): cognitive and affective.

Blooms’ Taxonomy - cognitive Knowledge Comprehension Application Analysis Synthesis Evaluation

Considerations in defining the Construct of a test Validity Consideration Does the construct cover what the test is claimed to be assessing? E.g., language proficiency: speaking, listening, reading, writing Measurement Consideration How well the specifications for a construct “hang together” to provide meaningful scores? The idea of “unidimensionality” On-balance judgement Boundaries are never clear

Test Blueprint Sufficiently detailed so that test developers can work from these specifications. Range of difficulty Target reliability Item format. Weights of sub-domains Test administration procedures Timing, equipment, resources Marking requirements

Test Blueprint – example (PISA Reading) Aspect % of test % constructed % MC Retrieving information 20 7 13 Broad understanding Interpretation 30 11 19 Reflecting on content 15 10 5 Reflecting on form Total 100

Uses of Frameworks & Blueprints To guide item development Don’t ignore specifications. Cross-check with specs constantly. To ensure that there is a clear and well-defined construct that can be stable from one testing occasion to another. Different item writing team Parallel tests

Item Writing Science or Art? Creativity following scientific principles Established procedures to guide good item development (as covered in this course) Inspiration, imagination and originality (difficult to teach, but can be gained through experience) Most important pre-requisite is subject area expertise Teacher’s craft

Item Writers Best done by a team 24-hour job! Ideas emerge, not necessarily in item writing sessions, or even during office hours. Ideas appear as a rough notion, like an uncut stone. It needs shaping and polishing, and many re-works! Keep a notebook for item ideas. Have camera ready!

Make items inter- esting!

Capture Potential Item Ideas

lattice

But, no tricks Keep materials interesting, but don’t try to “trick” students i.e. no trickery (as in trying to mislead) but items can be tricky (as in difficult) Don’t dwell on trivial points. No room to waste test space. Think of the bigger picture of the meaning of “ability” in the domain of testing. Every item should contribute one good piece of information about the overall standing of a student in the domain being tested. Collectively, all items need to provide one measure on a single “construct”

Item Types Multiple choice Constructed Easiest to score Not good face validity Research showed MC do have good concurrent validity and reliability, despite guessing factor Constructed High face validity Difficult to score Marker reliability is an issue

Writing Multiple Choice Item What is a multiple choice item?

Is this a MC item? If August 1st is a Monday, what day of the week is August 7th? A. Sunday B. Monday C. Tuesday D. Wednesday E. Thursday F. Friday G. Saturday

Writing Multiple Choice Items Many students think MC items are easier than open-ended items, and they often will not study as hard if they know the test consists of MC items only. They often try to memorise facts, because they think that MC items can only test facts. This promotes rote-learning. We must discourage this.

Test-wise strategies for MC items Pick the longest answer Pick “b” or “c”. They are more likely than “a” or “d”. Pick the scientific sounding answer. Pick a word related to the topic. We must demonstrate that there are no clear strategies to guess an answer.

Item format can make a difference to cognitive processes -1 Make sure that we are testing what we think we are testing The following is a sequence; 3, 7, 11, 15, 19, 23,…. What is the 10th term in this sequence? A 27 B 31 C 35 D 39 67% correct (ans D). 24% chose A. That is, about ¼ of students worked out the pattern of the sequence but missed the phrase “10th term”.

Item format can make a difference to cognitive processes -2 The following is a sequence; 2, 9, 16, 23, 30, 37, … What is the 10th term in this sequence? A 57 B 58 C 63 D 65 85% correct, even when this item is considered more difficult than the previous one (counting by 7 instead of by 4). The next number in the sequence (“44”) is not a distractor.

Item format can make a difference to cognitive processes -3 16x - 7 = 73. Solve for x. A. 5 B. 6 C. 7 D. 8 Substitution is one strategy. Substitute 5,6,7 8 for x and see if the answer is 73.

Item format can make a difference to cognitive processes -4 The fact that the answer is present in a list can alter the process of solving a problem. Students look for clues in the options. That can interfere with the cognitive processes the test setter has in mind.

Avoid confusing language - 1 Avoid using similar names. Peter, Petra, Mary and Mark. Democratic progressive, People’s democratic, Progressive socialist party Best Butter and Better Margarine Minimise the amount of reading, if the test is not about reading. Avoid “irrelevant” material .

Avoid confusing language - 2 Square slate slabs 1m by 1m are paved around a 10 m by 8 m rectangular pool. How many such slabs are needed? Show your work.

Evidence of language confusion Student drawings Item Statistics Item 17: SLAB017R Weighted MNSQ = 1.17 Disc = 0.39 Categories 0 [0] 1 [0] 2 [1] 3 [2] (Ans.) (other) (80) (36) (40) Count 11 36 17 9 Percent (%) 15.1 49.3 23.3 12.3 Pt-Biserial -0.39 0.06 -0.13 0.50 Mean Ability -0.98 0.25 -0.10 2.16 While the question meant “paving around the outside of the swimming pool”, many students thought it meant “ around the inside of the swimming pool” (hence the answer “80”).

Improve the language Square slate slabs 1m by 1m are paved around the outside of a 10 m by 8 m rectangular pool. How many such slabs are needed? Show your work. Tiles around the pool. Added the words “the outside of”, and a diagram, to clarify the meaning.

Improved item statistics Item 3: slab04R Weighted MNSQ = 1.02 Disc = 0.49 Categories 0 [0] 1 [0] 2 [1] 3 [2] 4 [2] (other) (80) (36) (32) (40) Count 109 111 125 35 53 Percent (%) 22.7 23.1 26.0 7.3 11.0 Pt-Biserial -0.21 -0.09 0.04 0.23 0.39 Mean Ability-0.41 -0.14 0.06 0.76 1.11 StDev Ability0.96 0.77 0.73 0.66 0.81 The percentage of students who gave the answer “80” has reduced by half. The fit of this item to the item response model has improved, and the discrimination index has improved.

Partial Credit for MC options Which of the following is the capital city of Australia? A Brisbane B Canberra C Sydney D Vancouver E Wellington Partial credit can be awarded even for multiple choice items. In the above example, the answer Vancouver is obviously “worse” than the answer “Sydney”. One can give a score of 2 to option B, and give a score of 1 to Options A and C. Do not think that multiple choice items can only be scored “right” or “wrong”.

Avoid dependency between items One glass holds 175 ml water. If I pour three glasses of water into a container, how much water would I have? If I dissolve 50 g of gelatin in the container, what is the proportion of gelatin to water? When items are dependent in that the answer to one item depends on the answer to another item, one does not collect “independent” information from each item, and the total score becomes difficult to interpret.

Formatting MC items Options in logical, alphabetic, or numerical order 11-13 14-17 18-22 23-40 Vertical better than horizontal

MC options - 1 Terminology: “key” and “distractors” Don’t use “All of the above” Use “None of the above” with caution. Keep length of options similar. Students like to pick the longest, often more scientific sounding ones. Make each alternative (a,b,c,d) the same number of times for the key.

MC options - 2 Avoid having an odd one out. Which word means the same as amiable in this sentence? Because Leon was an amiable person, he was nice to everyone. A. friendly B. strict C. moody D. mean

MC options - 3 How many options should be provided for a MC item? 4? 5? 3? It is not necessary to pre-determine a fixed number of MC options. It depends on the specific item Which two of the primary colours, red, blue and yellow make up green? (1) red and blue, (2) red and yellow, (3) blue and yellow Which day of the week is August 3? 7 options.

Testing higher-order thinking with MC Closed the textbook when you write items. If you can’t remember it, don’t ask the students. Lower-order thinking item: What is the perimeter of the following shape? 15 m 9m

A better item for testing higher-order thinking skills Which two shapes have the same perimeter? A B C D

MC can be useful - 1 When open-ended is too difficult A small hose can fill a swimming pool in 12 hours, and a large hose can fill it in 3 hours. How long will it take to fill the pool if both hoses are used at the same time? A. 2.4 hours B. 4.0 hours C.   7.5 hours D.  9.0 hours E. 15.0 hours This item is too difficulty for Grade 6 students in terms of the mathematics involved. However, the item intends to test for students’ sense-making ability. When two hoses are used, the time required to fill the pool should be less than the time when either hose is used. So there is only one correct answer: A. 2.4 hours. If this item is open-ended, very few students will be able to carry out the correct mathematics.

MC can be useful - 2 To avoid vague answers, e.g., How often do you watch sport on TV? Ans: When there is nothing else to watch on TV. Once in a while A few time a year

MC: problem with face validity Music performance IT familiarity Pilot licence testing Language proficiency Problem solving – not just with the MC solution format; reading gets in the way as well; general validity issues

Summary about MC items Don’t be afraid to ask MC items Check the cognitive processes required, as the answer is given among the options. Make sure the distractors do not distract in unintended way. Make sure the key is not attractive for unintended reasons.

Other Closed Constructed Item Formats

True/false A B C D 10 m 6 m Circle either Yes or No for each design to indicate whether the garden bed can be made with 32 metres of timber. Garden bed design Using this design, can the garden bed be made with 32 metres of timber? Design A Yes / No Design B Design C Design D

True/false Be aware of high chance of guessing Consider appropriate scoring rule. E.g Each statement counts 1 score All statements correct = 1 score Something in-between Examine item “model fit” to guide scoring decision

Matching/Ordering Arrange the actions in sequence A neighbourhood committee of a city decided to create a public garden in a run-down area of about 4000 m2. Arrange the actions in sequence 1st phase 2nd phase 3rd phase 4th phase 5th phase Actions A. Buying materials and plants. B. Issuing the authorisations. C. Project designing. D. Care and maintenance. E. Building the garden.

Matching/Ordering Useful to test relationships Easy to mark. Need to be treated as one single item, as there is dependency between the responses.

More generally on item writing Are you really testing what you think you are testing. For example, in a reading test, can you arrive at the correct answer without reading the stimulus? in a science test, can you extract the information from the stimulus alone, and not from the scientific knowledge that you profess. in a maths test, is the stumbling block to do with understanding the stimulus, or to do with solving the problem?

Constructed Response Items

Non multiple choice format Examples: Constructed response Performance Motivation: Face validity, for testing higher order thinking School reform: Avoid multiple choice teaching, and avoid testing fragmented knowledge and skills.

Caution about Performance format Check validity carefully E.g., Evaluation of Vermont statewide assessment of collecting “portfolios” (1991) concluded that the assessments have low reliability and validity. Problems with rater judgement and scoring reliably. E.g, quality of handwriting; presentation 3-10 times more expensive Bennett & Ward (1993); Osterlind (1998); Haladyna (1997)

This slide may have some truth about it This slide may have some truth about it! A study was carried out examining the differences in scores between a writing task administered online and on paper-and-pencil. Next slide shows some results from this study.

Example - a study comparing Online and Paper writing task A writing task was administered online and on paper. Online scores have been found to be lower than paper-and-pencil scores. Low ability students do “better” on paper-and-pencil writing task, about 1 score point difference out of 10.

Improve between-rater agreement Clear and comprehensive marking guide Need trialling to get a wide range of responses Need training for markers Need monitoring for marker leniency/harshness Better to mark by item than by student – to reduce dependency between items The last dot point: Markers should mark question 1 for all students, and then question 2 for all students, rather than marking all question for student 1, then all questions for student 2. This is because there is the so-called “halo” effect: Markers form an impression about a student’s work. For example, Marker often thinks a student as an “A” student, or a “B” student, and tends to award more similar scores when they mark all questions for each student. More independent markings can be obtained if marking is not done by student booklet, but by item. Of course, sometimes the logistics may be too difficult and this will not be possible.

Work towards some middle ground? Constructed response format with computer assisted scoring: N 6.4m 9.6m Estimate the floor area of the house Capture raw numeric response, e.g., 61.44 60 6144 Computer will recode and score

Computer assisted scoring Formulate firm scoring rules AFTER we examine the data Other examples, Household spending Hours spent on homework Idea is to capture maximum amount of information with lowest cost. Capture all different responses. Can always collapse categories later

Scoring – formal vs psychometric Technically correct/incorrect, versus manifestation of latent ability In deciding on how to score a response, always think of the level of latent ability to produce that response. E.g., In which sport is the Bledisloe Cup competed for? sample answers: rugby union, rugby league, rugby, sailing. How to score? Where are the corresponding levels of familiarity with sport?

Psychometric considerations in scoring - 1 Consider where you would place a person on the ability continuum. Score the response according to the location on the continuum. Measurement is about predicting a person’s level of ability. In which sport is the Bledisloe Cup competed for? Sailing! Rugby league! Rugby! Rugby union! Not Familiar with sport Familiar with sport Score 0 Score 1 Scale of familiarity with sport

Psychometric considerations in scoring - 2 May be better to place the persons as follows: In which sport is the Bledisloe Cup competed for? Sailing! Rugby league! Rugby! Rugby union! Not Familiar with sport Familiar with sport Score 0 \---Score 1---/ Scale of familiarity with sport

Scoring - another example What is the area of the following shape? Consider these responses: 16 m2; 16m; 16; 32m2; 32; 12m2; no response How to score these? 4m 8m Where are the levels of latent ability corresponding to these responses? Ideally, we need scoring that satisfies both technical soundness and psychometric property.

How to decide on weights? Should more difficult items get higher scores? Should items requiring more time get higher scores? In general, more difficult items should not have more weight, unless the items are more discriminating. One rationale is that there should be equal penalty whether a person fails on an easy item or a difficult item. If all items tap into the same “latent variable”, then a person high on the latent variable will not likely to get easy items wrong. So the situation of having easy ones wrong and difficult ones right is an indication that the items do not tap into the same construct. On the hand, if all items do tap into the same construct, then “item difficulty” should not play a part in the weight of the score. If a test is not speeded, that is, everyone has enough time to complete the test, then items that require more time to complete should not get more weight. However, if a test is speeded, then students who completed items that required more time may be disadvantaged as they did not have much opportunity to complete shorter items. It is recommended that tests should not be speeded.

Partial Credit Scoring If the data support partial credit scoring*, then it is better to use partial credit rather than dichotomous. Information will be lost if dichotomous scoring is used. *Data support partial credit scoring when the average ability for each score category increases with increasing score, and the point-biserial increases in order of score categories.

Practical guide to partial credit scoring Within an item Increasing score should correspond with increasing proficiency/ability Across items The maximum score for each item should correspond with the amount of “information” provided by the item about students’ proficiency/ability