2 Item Development Development of a Framework and Test Blueprint Draft itemsItem panelling (shredding!)Iterative process:Draft items to illustrate, and clarify and sharpen up framework. Framework to guide item development.
3 Framework and Test Blueprint-1 Clearly identify‘Why’ you are assessing (Purpose)‘Whom’ to assess (Population)‘What’ to assess (Construct domain)Define parameters for the test,e.g.:Duration of the test and test administration proceduresScoring/marking constraints; item formatsOther issues: security, feedback.
4 Specifying the Purpose How will the results be used?Determine pass/fail, satisfactory/unsatisfactoryAward prizesProvide diagnostic informationCompare studentsSet standardsProvide information to policy makersWho will use the information?Teachers, parents, students, managers, politicians
5 Specifying the Population Grade, age level. In an industry. Profession. Ethnicity/Culture/Language issues. Gender.Notion of population and sample.Sampling method: random, convenienceSize of sampleValidity of test results could depend on the population/sample you are assessing
6 Specifying the Construct Domain - Examples Familiarity with sport:What is meant by ‘sport’? Include “Gym”? “Taichi?” “Gymnastics”? In Australian contexts?Define Problem Solving:As viewed in the workforceWorkforce competenciesAs viewed in the cognitive sciencesCognitive processes: decoding, reasoning, domain specific knowledgeInterpersonal skillsNegotiation/conflict resolution skillsLeadership skillsWork with people from diverse backgrounds
7 Specifying the Construct Domain – Examples Achievement domains:Content oriented:Number, measurement, data, algebraCompetency oriented:Conceptual understanding, procedural knowledge, problem solvingTaxonomy of educational objectives (Bloom’s taxonomy of learning outcomes): cognitive and affective.
10 Considerations in defining the Construct of a test Validity ConsiderationDoes the construct cover what the test is claimed to be assessing?E.g., language proficiency: speaking, listening, reading, writingMeasurement ConsiderationHow well the specifications for a construct “hang together” to provide meaningful scores?The idea of “unidimensionality”On-balance judgementBoundaries are never clear
11 Test BlueprintSufficiently detailed so that test developers can work from these specifications.Range of difficultyTarget reliabilityItem format.Weights of sub-domainsTest administration proceduresTiming, equipment, resourcesMarking requirements
12 Test Blueprint – example (PISA Reading) Aspect% of test%constructed% MCRetrieving information20713Broad understandingInterpretation301119Reflecting on content15105Reflecting on formTotal100
13 Uses of Frameworks & Blueprints To guide item developmentDon’t ignore specifications. Cross-check with specs constantly.To ensure that there is a clear and well-defined construct that can be stable from one testing occasion to another.Different item writing teamParallel tests
14 Item Writing Science or Art? Creativity following scientific principlesEstablished procedures to guide good item development (as covered in this course)Inspiration, imagination and originality (difficult to teach, but can be gained through experience)Most important pre-requisite is subject area expertiseTeacher’s craft
15 Item Writers Best done by a team 24-hour job! Ideas emerge, not necessarily in item writing sessions, or even during office hours.Ideas appear as a rough notion, like an uncut stone. It needs shaping and polishing, and many re-works!Keep a notebook for item ideas.Have camera ready!
19 But, no tricksKeep materials interesting, but don’t try to “trick” studentsi.e. no trickery (as in trying to mislead)but items can be tricky (as in difficult)Don’t dwell on trivial points. No room to waste test space.Think of the bigger picture of the meaning of “ability” in the domain of testing.Every item should contribute one good piece of information about the overall standing of a student in the domain being tested.Collectively, all items need to provide one measure on a single “construct”
20 Item Types Multiple choice Constructed Easiest to score Not good face validityResearch showed MC do have good concurrent validity and reliability, despite guessing factorConstructedHigh face validityDifficult to scoreMarker reliability is an issue
21 Writing Multiple Choice Item What is a multiple choice item?
22 Is this a MC item?If August 1st is a Monday, what day of the week is August 7th?A. SundayB. MondayC. TuesdayD. WednesdayE. ThursdayF. FridayG. Saturday
23 Writing Multiple Choice Items Many students think MC items are easier than open-ended items, and they often will not study as hard if they know the test consists of MC items only.They often try to memorise facts, because they think that MC items can only test facts.This promotes rote-learning. We must discourage this.
24 Test-wise strategies for MC items Pick the longest answerPick “b” or “c”. They are more likely than “a” or “d”.Pick the scientific sounding answer.Pick a word related to the topic.We must demonstrate that there are no clear strategies to guess an answer.
25 Item format can make a difference to cognitive processes -1 Make sure that we are testing what we think we are testingThe following is a sequence;3, 7, 11, 15, 19, 23,….What is the 10th term in this sequence?A 27B 31C 35D 3967% correct (ans D). 24% chose A. That is, about ¼ of students worked out the pattern of the sequence but missed the phrase “10th term”.
26 Item format can make a difference to cognitive processes -2 The following is a sequence;2, 9, 16, 23, 30, 37, …What is the 10th term in this sequence?A 57B 58C 63D 6585% correct, even when this item is considered more difficult than the previous one (counting by 7 instead of by 4). The next number in the sequence (“44”) is not a distractor.
27 Item format can make a difference to cognitive processes -3 16x - 7 = 73. Solve for x.A. 5B. 6C. 7D. 8Substitution is one strategy. Substitute 5,6,7 8 for x and see if the answer is 73.
28 Item format can make a difference to cognitive processes -4 The fact that the answer is present in a list can alter the process of solving a problem.Students look for clues in the options. That can interfere with the cognitive processes the test setter has in mind.
29 Avoid confusing language - 1 Avoid using similar names.Peter, Petra, Mary and Mark.Democratic progressive, People’s democratic, Progressive socialist partyBest Butter and Better MargarineMinimise the amount of reading, if the test is not about reading. Avoid “irrelevant” material .
30 Avoid confusing language - 2 Square slate slabs 1m by 1m are paved around a 10 m by 8 m rectangular pool. How many such slabs are needed? Show your work.
31 Evidence of language confusion Student drawingsItem StatisticsItem 17: SLAB017R Weighted MNSQ = 1.17Disc = 0.39Categories    (Ans.) (other) (80) (36) (40)CountPercent (%)Pt-BiserialMean AbilityWhile the question meant “paving around the outside of the swimming pool”, many students thought it meant “ around the inside of the swimming pool” (hence the answer “80”).
32 Improve the languageSquare slate slabs 1m by 1m are paved around the outside of a 10 m by 8 m rectangular pool. How many such slabs are needed? Show your work.Tiles around the pool.Added the words “the outside of”, and a diagram, to clarify the meaning.
33 Improved item statistics Item 3: slab04R Weighted MNSQ = 1.02Disc = 0.49Categories 0     (other) (80) (36) (32) (40)CountPercent (%)Pt-BiserialMean AbilityStDev AbilityThe percentage of students who gave the answer “80” has reduced by half. The fit of this item to the item response model has improved, and the discrimination index has improved.
34 Partial Credit for MC options Which of the following is the capital city of Australia?A BrisbaneB CanberraC SydneyD VancouverE WellingtonPartial credit can be awarded even for multiple choice items.In the above example, the answer Vancouver is obviously “worse” than the answer “Sydney”. One can give a score of 2 to option B, and give a score of 1 to Options A and C.Do not think that multiple choice items can only be scored “right” or “wrong”.
35 Avoid dependency between items One glass holds 175 ml water. If I pour three glasses of water into a container, how much water would I have?If I dissolve 50 g of gelatin in the container, what is the proportion of gelatin to water?When items are dependent in that the answer to one item depends on the answer to another item, one does not collect “independent” information from each item, and the total score becomes difficult to interpret.
36 Formatting MC items Options in logical, alphabetic, or numerical order 11-1314-1718-2223-40Vertical better than horizontal
37 MC options - 1 Terminology: “key” and “distractors” Don’t use “All of the above”Use “None of the above” with caution.Keep length of options similar. Students like to pick the longest, often more scientific sounding ones.Make each alternative (a,b,c,d) the same number of times for the key.
38 MC options - 2 Avoid having an odd one out. Which word means the same as amiable in this sentence?Because Leon was an amiable person, he was nice to everyone.A. friendlyB. strictC. moodyD. mean
39 MC options - 3 How many options should be provided for a MC item? 4? 5? 3?It is not necessary to pre-determine a fixed number of MC options.It depends on the specific itemWhich two of the primary colours, red, blue and yellow make up green?(1) red and blue, (2) red and yellow, (3) blue and yellowWhich day of the week is August 3?7 options.
40 Testing higher-order thinking with MC Closed the textbook when you write items. If you can’t remember it, don’t ask the students.Lower-order thinking item:What is the perimeter of the following shape?15 m9m
41 A better item for testing higher-order thinking skills Which two shapes have the same perimeter?ABCD
42 MC can be useful - 1 When open-ended is too difficult A small hose can fill a swimming pool in 12 hours, and a large hose can fill it in 3 hours. How long will it take to fill the pool if both hoses are used at the same time?A. 2.4 hoursB. 4.0 hoursC. hoursD. hoursE hoursThis item is too difficulty for Grade 6 students in terms of the mathematics involved. However, the item intends to test for students’ sense-making ability. When two hoses are used, the time required to fill the pool should be less than the time when either hose is used. So there is only one correct answer: A. 2.4 hours. If this item is open-ended, very few students will be able to carry out the correct mathematics.
43 MC can be useful - 2 To avoid vague answers, e.g., How often do you watch sport on TV?Ans:When there is nothing else to watch on TV.Once in a whileA few time a year
44 MC: problem with face validity Music performanceIT familiarityPilot licence testingLanguage proficiencyProblem solving – not just with the MC solution format; reading gets in the way as well; general validity issues
45 Summary about MC items Don’t be afraid to ask MC items Check the cognitive processes required, as the answer is given among the options.Make sure the distractors do not distract in unintended way.Make sure the key is not attractive for unintended reasons.
48 True/falseABCD10 m6 mCircle either Yes or No for each design to indicate whether the garden bed can be made with 32 metres of timber.Garden bed designUsing this design, can the garden bed be made with 32 metres of timber?Design AYes / NoDesign BDesign CDesign D
49 True/false Be aware of high chance of guessing Consider appropriate scoring rule. E.gEach statement counts 1 scoreAll statements correct = 1 scoreSomething in-betweenExamine item “model fit” to guide scoring decision
50 Matching/Ordering Arrange the actions in sequence A neighbourhood committee of a city decided to create a public garden in a run-down area of about 4000 m2.Arrange the actions in sequence1st phase2nd phase3rd phase4th phase5th phaseActionsA. Buying materials and plants.B. Issuing the authorisations.C. Project designing.D. Care and maintenance.E. Building the garden.
51 Matching/Ordering Useful to test relationships Easy to mark. Need to be treated as one single item, as there is dependency between the responses.
52 More generally on item writing Are you really testing what you think you are testing. For example,in a reading test, can you arrive at the correct answer without reading the stimulus?in a science test, can you extract the information from the stimulus alone, and not from the scientific knowledge that you profess.in a maths test, is the stumbling block to do with understanding the stimulus, or to do with solving the problem?
54 Non multiple choice format Examples:Constructed responsePerformanceMotivation:Face validity, for testing higher order thinkingSchool reform: Avoid multiple choice teaching, and avoid testing fragmented knowledge and skills.
55 Caution about Performance format Check validity carefullyE.g., Evaluation of Vermont statewide assessment of collecting “portfolios” (1991) concluded that the assessments have low reliability and validity.Problems with rater judgement and scoring reliably.E.g, quality of handwriting; presentation3-10 times more expensiveBennett & Ward (1993); Osterlind (1998); Haladyna (1997)
56 This slide may have some truth about it This slide may have some truth about it! A study was carried out examining the differences in scores between a writing task administered online and on paper-and-pencil. Next slide shows some results from this study.
57 Example - a study comparing Online and Paper writing task A writing task was administered online and on paper.Online scores have been found to be lower than paper-and-pencil scores.Low ability students do “better” on paper-and-pencil writing task, about 1 score point difference out of 10.
58 Improve between-rater agreement Clear and comprehensive marking guideNeed trialling to get a wide range of responsesNeed training for markersNeed monitoring for marker leniency/harshnessBetter to mark by item than by student – to reduce dependency between itemsThe last dot point:Markers should mark question 1 for all students, and then question 2 for all students, rather than marking all question for student 1, then all questions for student 2.This is because there is the so-called “halo” effect: Markers form an impression about a student’s work. For example, Marker often thinks a student as an “A” student, or a “B” student, and tends to award more similar scores when they mark all questions for each student. More independent markings can be obtained if marking is not done by student booklet, but by item.Of course, sometimes the logistics may be too difficult and this will not be possible.
59 Work towards some middle ground? Constructed response format with computer assisted scoring:N6.4m9.6mEstimate the floor area of the houseCapture raw numeric response, e.g.,61.44606144Computer will recode and score
60 Computer assisted scoring Formulate firm scoring rules AFTER we examine the dataOther examples,Household spendingHours spent on homeworkIdea is to capture maximum amount of information with lowest cost.Capture all different responses. Can always collapse categories later
61 Scoring – formal vs psychometric Technically correct/incorrect, versus manifestation of latent abilityIn deciding on how to score a response, always think of the level of latent ability to produce that response.E.g., In which sport is the Bledisloe Cup competed for?sample answers: rugby union, rugby league, rugby, sailing. How to score? Where are the corresponding levels of familiarity with sport?
62 Psychometric considerations in scoring - 1 Consider where you would place a person on the ability continuum. Score the response according to the location on the continuum.Measurement is about predicting a person’s level of ability.In which sport is the Bledisloe Cup competed for?Sailing!Rugby league!Rugby!Rugby union!Not Familiar with sportFamiliar with sportScore 0Score 1Scale of familiarity with sport
63 Psychometric considerations in scoring - 2 May be better to place the persons as follows:In which sport is the Bledisloe Cup competed for?Sailing!Rugby league!Rugby!Rugby union!Not Familiar with sportFamiliar with sportScore 0\---Score 1---/Scale of familiarity with sport
64 Scoring - another example What is the area of the following shape?Consider these responses:16 m2; 16m; 16; 32m2; 32; 12m2; no responseHow to score these?4m8mWhere are the levels of latent ability corresponding to these responses?Ideally, we need scoring that satisfies both technical soundness and psychometric property.
65 How to decide on weights? Should more difficult items get higher scores?Should items requiring more time get higher scores?In general, more difficult items should not have more weight, unless the items are more discriminating.One rationale is that there should be equal penalty whether a person fails on an easy item or a difficult item.If all items tap into the same “latent variable”, then a person high on the latent variable will not likely to get easy items wrong. So the situation of having easy ones wrong and difficult ones right is an indication that the items do not tap into the same construct. On the hand, if all items do tap into the same construct, then “item difficulty” should not play a part in the weight of the score.If a test is not speeded, that is, everyone has enough time to complete the test, then items that require more time to complete should not get more weight. However, if a test is speeded, then students who completed items that required more time may be disadvantaged as they did not have much opportunity to complete shorter items.It is recommended that tests should not be speeded.
66 Partial Credit Scoring If the data support partial credit scoring*, then it is better to use partial credit rather than dichotomous. Information will be lost if dichotomous scoring is used.*Data support partial credit scoring when the average ability for each score category increases with increasing score, and the point-biserial increases in order of score categories.
67 Practical guide to partial credit scoring Within an itemIncreasing score should correspond with increasing proficiency/abilityAcross itemsThe maximum score for each item should correspond with the amount of “information” provided by the item about students’ proficiency/ability