Test and Scale Development

1 Test and Scale Development
Margaret Wu

2 Item Development Development of a Framework and Test Blueprint
Draft items Item panelling (shredding!) Iterative process: Draft items to illustrate, and clarify and sharpen up framework. Framework to guide item development.

Clearly identify ‘Why’ you are assessing (Purpose) ‘Whom’ to assess (Population) ‘What’ to assess (Construct domain) Define parameters for the test,e.g.: Duration of the test and test administration procedures Scoring/marking constraints; item formats Other issues: security, feedback.

How will the results be used? Determine pass/fail, satisfactory/unsatisfactory Award prizes Provide diagnostic information Compare students Set standards Provide information to policy makers Who will use the information? Teachers, parents, students, managers, politicians

Grade, age level. In an industry. Profession. Ethnicity/Culture/Language issues. Gender. Notion of population and sample. Sampling method: random, convenience Size of sample Validity of test results could depend on the population/sample you are assessing

Familiarity with sport: What is meant by ‘sport’? Include “Gym”? “Taichi?” “Gymnastics”? In Australian contexts? Define Problem Solving: As viewed in the workforce Workforce competencies As viewed in the cognitive sciences Cognitive processes: decoding, reasoning, domain specific knowledge Interpersonal skills Negotiation/conflict resolution skills Leadership skills Work with people from diverse backgrounds

Achievement domains: Content oriented: Number, measurement, data, algebra Competency oriented: Conceptual understanding, procedural knowledge, problem solving Taxonomy of educational objectives (Bloom’s taxonomy of learning outcomes): cognitive and affective.

Knowledge Comprehension Application Analysis Synthesis Evaluation


Validity Consideration Does the construct cover what the test is claimed to be assessing? E.g., language proficiency: speaking, listening, reading, writing Measurement Consideration How well the specifications for a construct “hang together” to provide meaningful scores? The idea of “unidimensionality” On-balance judgement Boundaries are never clear

11 Test Blueprint Sufficiently detailed so that test developers can work from these specifications. Range of difficulty Target reliability Item format. Weights of sub-domains Test administration procedures Timing, equipment, resources Marking requirements

Aspect % of test % constructed % MC Retrieving information 20 7 13 Broad understanding Interpretation 30 11 19 Reflecting on content 15 10 5 Reflecting on form Total 100

To guide item development Don’t ignore specifications. Cross-check with specs constantly. To ensure that there is a clear and well-defined construct that can be stable from one testing occasion to another. Different item writing team Parallel tests

Creativity following scientific principles Established procedures to guide good item development (as covered in this course) Inspiration, imagination and originality (difficult to teach, but can be gained through experience) Most important pre-requisite is subject area expertise Teacher’s craft

Ideas emerge, not necessarily in item writing sessions, or even during office hours. Ideas appear as a rough notion, like an uncut stone. It needs shaping and polishing, and many re-works! Keep a notebook for item ideas. Have camera ready!

18 lattice

19 But, no tricks Keep materials interesting, but don’t try to “trick” students i.e. no trickery (as in trying to mislead) but items can be tricky (as in difficult) Don’t dwell on trivial points. No room to waste test space. Think of the bigger picture of the meaning of “ability” in the domain of testing. Every item should contribute one good piece of information about the overall standing of a student in the domain being tested. Collectively, all items need to provide one measure on a single “construct”

Not good face validity Research showed MC do have good concurrent validity and reliability, despite guessing factor Constructed High face validity Difficult to score Marker reliability is an issue

What is a multiple choice item?

22 Is this a MC item? If August 1st is a Monday, what day of the week is August 7th? A. Sunday B. Monday C. Tuesday D. Wednesday E. Thursday F. Friday G. Saturday

Many students think MC items are easier than open-ended items, and they often will not study as hard if they know the test consists of MC items only. They often try to memorise facts, because they think that MC items can only test facts. This promotes rote-learning. We must discourage this.

Pick the longest answer Pick “b” or “c”. They are more likely than “a” or “d”. Pick the scientific sounding answer. Pick a word related to the topic. We must demonstrate that there are no clear strategies to guess an answer.

Make sure that we are testing what we think we are testing The following is a sequence; 3, 7, 11, 15, 19, 23,…. What is the 10th term in this sequence? A 27 B 31 C 35 D 39 67% correct (ans D). 24% chose A. That is, about ¼ of students worked out the pattern of the sequence but missed the phrase “10th term”.

The following is a sequence; 2, 9, 16, 23, 30, 37, … What is the 10th term in this sequence? A 57 B 58 C 63 D 65 85% correct, even when this item is considered more difficult than the previous one (counting by 7 instead of by 4). The next number in the sequence (“44”) is not a distractor.

16x - 7 = 73. Solve for x. A. 5 B. 6 C. 7 D. 8 Substitution is one strategy. Substitute 5,6,7 8 for x and see if the answer is 73.

The fact that the answer is present in a list can alter the process of solving a problem. Students look for clues in the options. That can interfere with the cognitive processes the test setter has in mind.

30 Avoid confusing language - 2
Square slate slabs 1m by 1m are paved around a 10 m by 8 m rectangular pool. How many such slabs are needed? Show your work.

31 Evidence of language confusion
Student drawings Item Statistics Item 17: SLAB017R Weighted MNSQ = 1.17 Disc = 0.39 Categories [0] [0] [1] [2] (Ans.) (other) (80) (36) (40) Count Percent (%) Pt-Biserial Mean Ability While the question meant “paving around the outside of the swimming pool”, many students thought it meant “ around the inside of the swimming pool” (hence the answer “80”).

32 Improve the language Square slate slabs 1m by 1m are paved around the outside of a 10 m by 8 m rectangular pool. How many such slabs are needed? Show your work. Tiles around the pool. Added the words “the outside of”, and a diagram, to clarify the meaning.

33 Improved item statistics
Item 3: slab04R Weighted MNSQ = 1.02 Disc = 0.49 Categories 0 [0] [0] [1] [2] [2] (other) (80) (36) (32) (40) Count Percent (%) Pt-Biserial Mean Ability StDev Ability The percentage of students who gave the answer “80” has reduced by half. The fit of this item to the item response model has improved, and the discrimination index has improved.

Which of the following is the capital city of Australia? A Brisbane B Canberra C Sydney D Vancouver E Wellington Partial credit can be awarded even for multiple choice items. In the above example, the answer Vancouver is obviously “worse” than the answer “Sydney”. One can give a score of 2 to option B, and give a score of 1 to Options A and C. Do not think that multiple choice items can only be scored “right” or “wrong”.

One glass holds 175 ml water. If I pour three glasses of water into a container, how much water would I have? If I dissolve 50 g of gelatin in the container, what is the proportion of gelatin to water? When items are dependent in that the answer to one item depends on the answer to another item, one does not collect “independent” information from each item, and the total score becomes difficult to interpret.

36 Formatting MC items Options in logical, alphabetic, or numerical order
11-13 14-17 18-22 23-40 Vertical better than horizontal

Don’t use “All of the above” Use “None of the above” with caution. Keep length of options similar. Students like to pick the longest, often more scientific sounding ones. Make each alternative (a,b,c,d) the same number of times for the key.

Which word means the same as amiable in this sentence? Because Leon was an amiable person, he was nice to everyone. A. friendly B. strict C. moody D. mean

39 MC options - 3 How many options should be provided for a MC item?
4? 5? 3? It is not necessary to pre-determine a fixed number of MC options. It depends on the specific item Which two of the primary colours, red, blue and yellow make up green? (1) red and blue, (2) red and yellow, (3) blue and yellow Which day of the week is August 3? 7 options.

Closed the textbook when you write items. If you can’t remember it, don’t ask the students. Lower-order thinking item: What is the perimeter of the following shape? 15 m 9m

41 A better item for testing higher-order thinking skills
Which two shapes have the same perimeter? A B C D

A small hose can fill a swimming pool in 12 hours, and a large hose can fill it in 3 hours. How long will it take to fill the pool if both hoses are used at the same time? A. 2.4 hours B. 4.0 hours C.   hours D.  hours E hours This item is too difficulty for Grade 6 students in terms of the mathematics involved. However, the item intends to test for students’ sense-making ability. When two hoses are used, the time required to fill the pool should be less than the time when either hose is used. So there is only one correct answer: A. 2.4 hours. If this item is open-ended, very few students will be able to carry out the correct mathematics.

How often do you watch sport on TV? Ans: When there is nothing else to watch on TV. Once in a while A few time a year

Music performance IT familiarity Pilot licence testing Language proficiency Problem solving – not just with the MC solution format; reading gets in the way as well; general validity issues

Check the cognitive processes required, as the answer is given among the options. Make sure the distractors do not distract in unintended way. Make sure the key is not attractive for unintended reasons.


48 True/false A B C D 10 m 6 m Circle either Yes or No for each design to indicate whether the garden bed can be made with 32 metres of timber. Garden bed design Using this design, can the garden bed be made with 32 metres of timber? Design A Yes / No Design B Design C Design D

Consider appropriate scoring rule. E.g Each statement counts 1 score All statements correct = 1 score Something in-between Examine item “model fit” to guide scoring decision

50 Matching/Ordering Arrange the actions in sequence
A neighbourhood committee of a city decided to create a public garden in a run-down area of about 4000 m2. Arrange the actions in sequence 1st phase 2nd phase 3rd phase 4th phase 5th phase Actions A. Buying materials and plants. B. Issuing the authorisations. C. Project designing. D. Care and maintenance. E. Building the garden.

Need to be treated as one single item, as there is dependency between the responses.

Are you really testing what you think you are testing. For example, in a reading test, can you arrive at the correct answer without reading the stimulus? in a science test, can you extract the information from the stimulus alone, and not from the scientific knowledge that you profess. in a maths test, is the stumbling block to do with understanding the stimulus, or to do with solving the problem?

Examples: Constructed response Performance Motivation: Face validity, for testing higher order thinking School reform: Avoid multiple choice teaching, and avoid testing fragmented knowledge and skills.

55 Caution about Performance format
Check validity carefully E.g., Evaluation of Vermont statewide assessment of collecting “portfolios” (1991) concluded that the assessments have low reliability and validity. Problems with rater judgement and scoring reliably. E.g, quality of handwriting; presentation 3-10 times more expensive Bennett & Ward (1993); Osterlind (1998); Haladyna (1997)

This slide may have some truth about it! A study was carried out examining the differences in scores between a writing task administered online and on paper-and-pencil. Next slide shows some results from this study.

57 Example - a study comparing Online and Paper writing task
A writing task was administered online and on paper. Online scores have been found to be lower than paper-and-pencil scores. Low ability students do “better” on paper-and-pencil writing task, about 1 score point difference out of 10.

Clear and comprehensive marking guide Need trialling to get a wide range of responses Need training for markers Need monitoring for marker leniency/harshness Better to mark by item than by student – to reduce dependency between items The last dot point: Markers should mark question 1 for all students, and then question 2 for all students, rather than marking all question for student 1, then all questions for student 2. This is because there is the so-called “halo” effect: Markers form an impression about a student’s work. For example, Marker often thinks a student as an “A” student, or a “B” student, and tends to award more similar scores when they mark all questions for each student. More independent markings can be obtained if marking is not done by student booklet, but by item. Of course, sometimes the logistics may be too difficult and this will not be possible.

Constructed response format with computer assisted scoring: N 6.4m 9.6m Estimate the floor area of the house Capture raw numeric response, e.g., 61.44 60 6144 Computer will recode and score

60 Computer assisted scoring
Formulate firm scoring rules AFTER we examine the data Other examples, Household spending Hours spent on homework Idea is to capture maximum amount of information with lowest cost. Capture all different responses. Can always collapse categories later

Technically correct/incorrect, versus manifestation of latent ability In deciding on how to score a response, always think of the level of latent ability to produce that response. E.g., In which sport is the Bledisloe Cup competed for? sample answers: rugby union, rugby league, rugby, sailing. How to score? Where are the corresponding levels of familiarity with sport?

Consider where you would place a person on the ability continuum. Score the response according to the location on the continuum. Measurement is about predicting a person’s level of ability. In which sport is the Bledisloe Cup competed for? Sailing! Rugby league! Rugby! Rugby union! Not Familiar with sport Familiar with sport Score 0 Score 1 Scale of familiarity with sport

May be better to place the persons as follows: In which sport is the Bledisloe Cup competed for? Sailing! Rugby league! Rugby! Rugby union! Not Familiar with sport Familiar with sport Score 0 \---Score 1---/ Scale of familiarity with sport

What is the area of the following shape? Consider these responses: 16 m2; 16m; 16; 32m2; 32; 12m2; no response How to score these? 4m 8m Where are the levels of latent ability corresponding to these responses? Ideally, we need scoring that satisfies both technical soundness and psychometric property.

65 How to decide on weights?
66 Partial Credit Scoring
67 Practical guide to partial credit scoring
