Presentation on theme: "Test and Scale Development Margaret Wu. Item Development Development of a Framework and Test Blueprint Draft items Item panelling (shredding!) Iterative."— Presentation transcript:
Test and Scale Development Margaret Wu
Item Development Development of a Framework and Test Blueprint Draft items Item panelling (shredding!) Iterative process: –Draft items to illustrate, and clarify and sharpen up framework. Framework to guide item development.
Framework and Test Blueprint-1 Clearly identify –Why you are assessing (Purpose) –Whom to assess (Population) –What to assess (Construct domain) Define parameters for the test,e.g.: –Duration of the test and test administration procedures –Scoring/marking constraints; item formats –Other issues: security, feedback.
Specifying the Purpose How will the results be used? –Determine pass/fail, satisfactory/unsatisfactory –Award prizes –Provide diagnostic information –Compare students –Set standards –Provide information to policy makers Who will use the information? –Teachers, parents, students, managers, politicians
Specifying the Population Grade, age level. In an industry. Profession. Ethnicity/Culture/Language issues. Gender. Notion of population and sample. Sampling method: random, convenience Size of sample Validity of test results could depend on the population/sample you are assessing
Specifying the Construct Domain - Examples Familiarity with sport: –What is meant by sport? Include Gym? Taichi? Gymnastics? In Australian contexts? Define Problem Solving: –As viewed in the workforce Workforce competencies –As viewed in the cognitive sciences Cognitive processes: decoding, reasoning, domain specific knowledge Interpersonal skills –Negotiation/conflict resolution skills –Leadership skills –Work with people from diverse backgrounds
Specifying the Construct Domain – Examples Achievement domains: –Content oriented: Number, measurement, data, algebra –Competency oriented: Conceptual understanding, procedural knowledge, problem solving Taxonomy of educational objectives (Blooms taxonomy of learning outcomes): cognitive and affective.
Considerations in defining the Construct of a test Validity Consideration –Does the construct cover what the test is claimed to be assessing? E.g., language proficiency: speaking, listening, reading, writing Measurement Consideration –How well the specifications for a construct hang together to provide meaningful scores? The idea of unidimensionality On-balance judgement –Boundaries are never clear
Test Blueprint Sufficiently detailed so that test developers can work from these specifications. –Range of difficulty –Target reliability –Item format. –Weights of sub-domains –Test administration procedures Timing, equipment, resources –Marking requirements
Test Blueprint – example (PISA Reading) Aspect% of test % constructed % MC Retrieving information20713 Broad understanding20713 Interpretation Reflecting on content15105 Reflecting on form15105 Total100
Uses of Frameworks & Blueprints To guide item development –Dont ignore specifications. Cross-check with specs constantly. To ensure that there is a clear and well- defined construct that can be stable from one testing occasion to another. –Different item writing team –Parallel tests
Item Writing Science or Art? –Creativity following scientific principles –Established procedures to guide good item development (as covered in this course) –Inspiration, imagination and originality (difficult to teach, but can be gained through experience) Most important pre-requisite is subject area expertise –Teachers craft
Item Writers Best done by a team 24-hour job! –Ideas emerge, not necessarily in item writing sessions, or even during office hours. –Ideas appear as a rough notion, like an uncut stone. It needs shaping and polishing, and many re-works! –Keep a notebook for item ideas. –Have camera ready!
Make items inter- esting !
Capture Potential Item Ideas
But, no tricks Keep materials interesting, but dont try to trick students –i.e. no trickery (as in trying to mislead) –but items can be tricky (as in difficult) Dont dwell on trivial points. No room to waste test space. Think of the bigger picture of the meaning of ability in the domain of testing. Every item should contribute one good piece of information about the overall standing of a student in the domain being tested. Collectively, all items need to provide one measure on a single construct
Item Types Multiple choice –Easiest to score –Not good face validity –Research showed MC do have good concurrent validity and reliability, despite guessing factor Constructed –High face validity –Difficult to score –Marker reliability is an issue
Writing Multiple Choice Item What is a multiple choice item?
Is this a MC item? If August 1 st is a Monday, what day of the week is August 7 th ? A.Sunday B.Monday C.Tuesday D.Wednesday E.Thursday F.Friday G.Saturday
Writing Multiple Choice Items Many students think MC items are easier than open-ended items, and they often will not study as hard if they know the test consists of MC items only. They often try to memorise facts, because they think that MC items can only test facts. This promotes rote-learning. We must discourage this.
Test-wise strategies for MC items Pick the longest answer Pick b or c. They are more likely than a or d. Pick the scientific sounding answer. Pick a word related to the topic. We must demonstrate that there are no clear strategies to guess an answer.
Item format can make a difference to cognitive processes -1 Make sure that we are testing what we think we are testing –The following is a sequence; 3, 7, 11, 15, 19, 23,…. What is the 10th term in this sequence? A27 B31 C35 D39 67% correct (ans D). 24% chose A. That is, about ¼ of students worked out the pattern of the sequence but missed the phrase 10 th term.
Item format can make a difference to cognitive processes -2 The following is a sequence; 2, 9, 16, 23, 30, 37, … What is the 10th term in this sequence? A57 B58 C63 D65 85% correct, even when this item is considered more difficult than the previous one (counting by 7 instead of by 4). The next number in the sequence (44) is not a distractor.
Item format can make a difference to cognitive processes -3 16x - 7 = 73. Solve for x. –A.5 –B.6 –C.7 –D.8 Substitution is one strategy. Substitute 5,6,7 8 for x and see if the answer is 73.
Item format can make a difference to cognitive processes -4 The fact that the answer is present in a list can alter the process of solving a problem. Students look for clues in the options. That can interfere with the cognitive processes the test setter has in mind.
Avoid confusing language - 1 Avoid using similar names. –Peter, Petra, Mary and Mark. –Democratic progressive, Peoples democratic, Progressive socialist party –Best Butter and Better Margarine Minimise the amount of reading, if the test is not about reading. Avoid irrelevant material.
Avoid confusing language - 2 Square slate slabs 1m by 1m are paved around a 10 m by 8 m rectangular pool. How many such slabs are needed? Show your work.
Improve the language Square slate slabs 1m by 1m are paved around the outside of a 10 m by 8 m rectangular pool. How many such slabs are needed? Show your work. Tiles around the pool. Added the words the outside of, and a diagram, to clarify the meaning.
Partial Credit for MC options Which of the following is the capital city of Australia? ABrisbane BCanberra CSydney DVancouver EWellington
Avoid dependency between items One glass holds 175 ml water. If I pour three glasses of water into a container, how much water would I have? If I dissolve 50 g of gelatin in the container, what is the proportion of gelatin to water?
Formatting MC items Options in logical, alphabetic, or numerical order – – – – Vertical better than horizontal
MC options - 1 Terminology: key and distractors Dont use All of the above Use None of the above with caution. Keep length of options similar. Students like to pick the longest, often more scientific sounding ones. Make each alternative (a,b,c,d) the same number of times for the key.
MC options - 2 Avoid having an odd one out. –Which word means the same as amiable in this sentence? Because Leon was an amiable person, he was nice to everyone. A.friendly B.strict C.moody D.mean
MC options - 3 How many options should be provided for a MC item? 4? 5? 3? It is not necessary to pre-determine a fixed number of MC options. It depends on the specific item –Which two of the primary colours, red, blue and yellow make up green? –(1) red and blue, (2) red and yellow, (3) blue and yellow –Which day of the week is August 3? –7 options.
Testing higher-order thinking with MC Closed the textbook when you write items. If you cant remember it, dont ask the students. Lower-order thinking item: –What is the perimeter of the following shape? 15 m 9m
A better item for testing higher-order thinking skills ABC D Which two shapes have the same perimeter?
MC can be useful - 1 When open-ended is too difficult A small hose can fill a swimming pool in 12 hours, and a large hose can fill it in 3 hours. How long will it take to fill the pool if both hoses are used at the same time? –A.2.4 hours –B.4.0 hours –C. 7.5 hours –D. 9.0 hours –E.15.0 hours
MC can be useful - 2 To avoid vague answers, e.g., How often do you watch sport on TV? Ans: –When there is nothing else to watch on TV. –Once in a while –A few time a year
MC: problem with face validity Music performance IT familiarity Pilot licence testing Language proficiency Problem solving – not just with the MC solution format; reading gets in the way as well; general validity issues
Summary about MC items Dont be afraid to ask MC items Check the cognitive processes required, as the answer is given among the options. Make sure the distractors do not distract in unintended way. Make sure the key is not attractive for unintended reasons.
Other Closed Constructed Item Formats
True/false Circle either Yes or No for each design to indicate whether the garden bed can be made with 32 metres of timber. Garden bed design Using this design, can the garden bed be made with 32 metres of timber? Design AYes / No Design BYes / No Design CYes / No Design DYes / No AB CD 10 m 6 m 10 m 6 m
True/false Be aware of high chance of guessing Consider appropriate scoring rule. E.g –Each statement counts 1 score –All statements correct = 1 score –Something in-between –Examine item model fit to guide scoring decision
Matching/Ordering Arrange the actions in sequence 1 st phase 2 nd phase 3 rd phase 4 th phase 5 th phase Actions A. Buying materials and plants. B. Issuing the authorisations. C. Project designing. D. Care and maintenance. E. Building the garden. A neighbourhood committee of a city decided to create a public garden in a run-down area of about 4000 m 2.
Matching/Ordering Useful to test relationships Easy to mark. Need to be treated as one single item, as there is dependency between the responses.
More generally on item writing Are you really testing what you think you are testing. For example, –in a reading test, can you arrive at the correct answer without reading the stimulus? –in a science test, can you extract the information from the stimulus alone, and not from the scientific knowledge that you profess. –in a maths test, is the stumbling block to do with understanding the stimulus, or to do with solving the problem?
53 Constructed Response Items
54 Non multiple choice format Examples: –Constructed response –Performance Motivation: –Face validity, for testing higher order thinking –School reform: Avoid multiple choice teaching, and avoid testing fragmented knowledge and skills.
55 Caution about Performance format Check validity carefully –E.g., Evaluation of Vermont statewide assessment of collecting portfolios (1991) concluded that the assessments have low reliability and validity. Problems with rater judgement and scoring reliably. –E.g, quality of handwriting; presentation 3-10 times more expensive Bennett & Ward (1993); Osterlind (1998); Haladyna (1997)
57 Example - a study comparing Online and Paper writing task A writing task was administered online and on paper. Online scores have been found to be lower than paper-and-pencil scores. Low ability students do better on paper- and-pencil writing task, about 1 score point difference out of 10.
58 Improve between-rater agreement Clear and comprehensive marking guide –Need trialling to get a wide range of responses Need training for markers Need monitoring for marker leniency/harshness Better to mark by item than by student – to reduce dependency between items
59 Work towards some middle ground? Constructed response format with computer assisted scoring: 6.4m 9.6m Estimate the floor area of the house Capture raw numeric response, e.g., Computer will recode and score
60 Computer assisted scoring Formulate firm scoring rules AFTER we examine the data Other examples, –Household spending –Hours spent on homework Idea is to capture maximum amount of information with lowest cost. Capture all different responses. Can always collapse categories later
61 Scoring – formal vs psychometric Technically correct/incorrect, versus manifestation of latent ability In deciding on how to score a response, always think of the level of latent ability to produce that response. –E.g., In which sport is the Bledisloe Cup competed for? – sample answers: rugby union, rugby league, rugby, sailing. How to score? Where are the corresponding levels of familiarity with sport?
62 Psychometric considerations in scoring - 1 Consider where you would place a person on the ability continuum. Score the response according to the location on the continuum. Measurement is about predicting a persons level of ability. Familiar with sport Not Familiar with sport Rugby union! Rugby league! Rugby!Sailing! In which sport is the Bledisloe Cup competed for? Scale of familiarity with sport Score 1 Score 0
63 Psychometric considerations in scoring - 2 May be better to place the persons as follows: Familiar with sport Not Familiar with sport Rugby union! Rugby league! Rugby!Sailing! In which sport is the Bledisloe Cup competed for? Scale of familiarity with sport \---Score 1---/ Score 0
64 Scoring - another example What is the area of the following shape? 4m 8m Consider these responses: 16 m 2 ; 16m; 16; 32m 2 ; 32; 12m 2 ; no response How to score these? Where are the levels of latent ability corresponding to these responses? Ideally, we need scoring that satisfies both technical soundness and psychometric property.
65 How to decide on weights? Should more difficult items get higher scores? Should items requiring more time get higher scores?
66 Partial Credit Scoring If the data support partial credit scoring*, then it is better to use partial credit rather than dichotomous. Information will be lost if dichotomous scoring is used. * Data support partial credit scoring when the average ability for each score category increases with increasing score, and the point- biserial increases in order of score categories.
Practical guide to partial credit scoring Within an item –Increasing score should correspond with increasing proficiency/ability Across items –The maximum score for each item should correspond with the amount of information provided by the item about students proficiency/ability