Presentation is loading. Please wait.

Presentation is loading. Please wait.

Test and Scale Development

Similar presentations

Presentation on theme: "Test and Scale Development"— Presentation transcript:

1 Test and Scale Development
Margaret Wu

2 Item Development Development of a Framework and Test Blueprint
Draft items Item panelling (shredding!) Iterative process: Draft items to illustrate, and clarify and sharpen up framework. Framework to guide item development.

3 Framework and Test Blueprint-1
Clearly identify ‘Why’ you are assessing (Purpose) ‘Whom’ to assess (Population) ‘What’ to assess (Construct domain) Define parameters for the test,e.g.: Duration of the test and test administration procedures Scoring/marking constraints; item formats Other issues: security, feedback.

4 Specifying the Purpose
How will the results be used? Determine pass/fail, satisfactory/unsatisfactory Award prizes Provide diagnostic information Compare students Set standards Provide information to policy makers Who will use the information? Teachers, parents, students, managers, politicians

5 Specifying the Population
Grade, age level. In an industry. Profession. Ethnicity/Culture/Language issues. Gender. Notion of population and sample. Sampling method: random, convenience Size of sample Validity of test results could depend on the population/sample you are assessing

6 Specifying the Construct Domain - Examples
Familiarity with sport: What is meant by ‘sport’? Include “Gym”? “Taichi?” “Gymnastics”? In Australian contexts? Define Problem Solving: As viewed in the workforce Workforce competencies As viewed in the cognitive sciences Cognitive processes: decoding, reasoning, domain specific knowledge Interpersonal skills Negotiation/conflict resolution skills Leadership skills Work with people from diverse backgrounds

7 Specifying the Construct Domain – Examples
Achievement domains: Content oriented: Number, measurement, data, algebra Competency oriented: Conceptual understanding, procedural knowledge, problem solving Taxonomy of educational objectives (Bloom’s taxonomy of learning outcomes): cognitive and affective.

8 Blooms’ Taxonomy - cognitive
Knowledge Comprehension Application Analysis Synthesis Evaluation


10 Considerations in defining the Construct of a test
Validity Consideration Does the construct cover what the test is claimed to be assessing? E.g., language proficiency: speaking, listening, reading, writing Measurement Consideration How well the specifications for a construct “hang together” to provide meaningful scores? The idea of “unidimensionality” On-balance judgement Boundaries are never clear

11 Test Blueprint Sufficiently detailed so that test developers can work from these specifications. Range of difficulty Target reliability Item format. Weights of sub-domains Test administration procedures Timing, equipment, resources Marking requirements

12 Test Blueprint – example (PISA Reading)
Aspect % of test % constructed % MC Retrieving information 20 7 13 Broad understanding Interpretation 30 11 19 Reflecting on content 15 10 5 Reflecting on form Total 100

13 Uses of Frameworks & Blueprints
To guide item development Don’t ignore specifications. Cross-check with specs constantly. To ensure that there is a clear and well-defined construct that can be stable from one testing occasion to another. Different item writing team Parallel tests

14 Item Writing Science or Art?
Creativity following scientific principles Established procedures to guide good item development (as covered in this course) Inspiration, imagination and originality (difficult to teach, but can be gained through experience) Most important pre-requisite is subject area expertise Teacher’s craft

15 Item Writers Best done by a team 24-hour job!
Ideas emerge, not necessarily in item writing sessions, or even during office hours. Ideas appear as a rough notion, like an uncut stone. It needs shaping and polishing, and many re-works! Keep a notebook for item ideas. Have camera ready!

16 Make items inter- esting!

17 Capture Potential Item Ideas

18 lattice

19 But, no tricks Keep materials interesting, but don’t try to “trick” students i.e. no trickery (as in trying to mislead) but items can be tricky (as in difficult) Don’t dwell on trivial points. No room to waste test space. Think of the bigger picture of the meaning of “ability” in the domain of testing. Every item should contribute one good piece of information about the overall standing of a student in the domain being tested. Collectively, all items need to provide one measure on a single “construct”

20 Item Types Multiple choice Constructed Easiest to score
Not good face validity Research showed MC do have good concurrent validity and reliability, despite guessing factor Constructed High face validity Difficult to score Marker reliability is an issue

21 Writing Multiple Choice Item
What is a multiple choice item?

22 Is this a MC item? If August 1st is a Monday, what day of the week is August 7th? A. Sunday B. Monday C. Tuesday D. Wednesday E. Thursday F. Friday G. Saturday

23 Writing Multiple Choice Items
Many students think MC items are easier than open-ended items, and they often will not study as hard if they know the test consists of MC items only. They often try to memorise facts, because they think that MC items can only test facts. This promotes rote-learning. We must discourage this.

24 Test-wise strategies for MC items
Pick the longest answer Pick “b” or “c”. They are more likely than “a” or “d”. Pick the scientific sounding answer. Pick a word related to the topic. We must demonstrate that there are no clear strategies to guess an answer.

25 Item format can make a difference to cognitive processes -1
Make sure that we are testing what we think we are testing The following is a sequence; 3, 7, 11, 15, 19, 23,…. What is the 10th term in this sequence? A 27 B 31 C 35 D 39 67% correct (ans D). 24% chose A. That is, about ¼ of students worked out the pattern of the sequence but missed the phrase “10th term”.

26 Item format can make a difference to cognitive processes -2
The following is a sequence; 2, 9, 16, 23, 30, 37, … What is the 10th term in this sequence? A 57 B 58 C 63 D 65 85% correct, even when this item is considered more difficult than the previous one (counting by 7 instead of by 4). The next number in the sequence (“44”) is not a distractor.

27 Item format can make a difference to cognitive processes -3
16x - 7 = 73. Solve for x. A. 5 B. 6 C. 7 D. 8 Substitution is one strategy. Substitute 5,6,7 8 for x and see if the answer is 73.

28 Item format can make a difference to cognitive processes -4
The fact that the answer is present in a list can alter the process of solving a problem. Students look for clues in the options. That can interfere with the cognitive processes the test setter has in mind.

29 Avoid confusing language - 1
Avoid using similar names. Peter, Petra, Mary and Mark. Democratic progressive, People’s democratic, Progressive socialist party Best Butter and Better Margarine Minimise the amount of reading, if the test is not about reading. Avoid “irrelevant” material .

30 Avoid confusing language - 2
Square slate slabs 1m by 1m are paved around a 10 m by 8 m rectangular pool. How many such slabs are needed? Show your work.

31 Evidence of language confusion
Student drawings Item Statistics Item 17: SLAB017R Weighted MNSQ = 1.17 Disc = 0.39 Categories [0] [0] [1] [2] (Ans.) (other) (80) (36) (40) Count Percent (%) Pt-Biserial Mean Ability While the question meant “paving around the outside of the swimming pool”, many students thought it meant “ around the inside of the swimming pool” (hence the answer “80”).

32 Improve the language Square slate slabs 1m by 1m are paved around the outside of a 10 m by 8 m rectangular pool. How many such slabs are needed? Show your work. Tiles around the pool. Added the words “the outside of”, and a diagram, to clarify the meaning.

33 Improved item statistics
Item 3: slab04R Weighted MNSQ = 1.02 Disc = 0.49 Categories 0 [0] [0] [1] [2] [2] (other) (80) (36) (32) (40) Count Percent (%) Pt-Biserial Mean Ability StDev Ability The percentage of students who gave the answer “80” has reduced by half. The fit of this item to the item response model has improved, and the discrimination index has improved.

34 Partial Credit for MC options
Which of the following is the capital city of Australia? A Brisbane B Canberra C Sydney D Vancouver E Wellington Partial credit can be awarded even for multiple choice items. In the above example, the answer Vancouver is obviously “worse” than the answer “Sydney”. One can give a score of 2 to option B, and give a score of 1 to Options A and C. Do not think that multiple choice items can only be scored “right” or “wrong”.

35 Avoid dependency between items
One glass holds 175 ml water. If I pour three glasses of water into a container, how much water would I have? If I dissolve 50 g of gelatin in the container, what is the proportion of gelatin to water? When items are dependent in that the answer to one item depends on the answer to another item, one does not collect “independent” information from each item, and the total score becomes difficult to interpret.

36 Formatting MC items Options in logical, alphabetic, or numerical order
11-13 14-17 18-22 23-40 Vertical better than horizontal

37 MC options - 1 Terminology: “key” and “distractors”
Don’t use “All of the above” Use “None of the above” with caution. Keep length of options similar. Students like to pick the longest, often more scientific sounding ones. Make each alternative (a,b,c,d) the same number of times for the key.

38 MC options - 2 Avoid having an odd one out.
Which word means the same as amiable in this sentence? Because Leon was an amiable person, he was nice to everyone. A. friendly B. strict C. moody D. mean

39 MC options - 3 How many options should be provided for a MC item?
4? 5? 3? It is not necessary to pre-determine a fixed number of MC options. It depends on the specific item Which two of the primary colours, red, blue and yellow make up green? (1) red and blue, (2) red and yellow, (3) blue and yellow Which day of the week is August 3? 7 options.

40 Testing higher-order thinking with MC
Closed the textbook when you write items. If you can’t remember it, don’t ask the students. Lower-order thinking item: What is the perimeter of the following shape? 15 m 9m

41 A better item for testing higher-order thinking skills
Which two shapes have the same perimeter? A B C D

42 MC can be useful - 1 When open-ended is too difficult
A small hose can fill a swimming pool in 12 hours, and a large hose can fill it in 3 hours. How long will it take to fill the pool if both hoses are used at the same time? A. 2.4 hours B. 4.0 hours C.   hours D.  hours E hours This item is too difficulty for Grade 6 students in terms of the mathematics involved. However, the item intends to test for students’ sense-making ability. When two hoses are used, the time required to fill the pool should be less than the time when either hose is used. So there is only one correct answer: A. 2.4 hours. If this item is open-ended, very few students will be able to carry out the correct mathematics.

43 MC can be useful - 2 To avoid vague answers, e.g.,
How often do you watch sport on TV? Ans: When there is nothing else to watch on TV. Once in a while A few time a year

44 MC: problem with face validity
Music performance IT familiarity Pilot licence testing Language proficiency Problem solving – not just with the MC solution format; reading gets in the way as well; general validity issues

45 Summary about MC items Don’t be afraid to ask MC items
Check the cognitive processes required, as the answer is given among the options. Make sure the distractors do not distract in unintended way. Make sure the key is not attractive for unintended reasons.


47 Other Closed Constructed Item Formats

48 True/false A B C D 10 m 6 m Circle either Yes or No for each design to indicate whether the garden bed can be made with 32 metres of timber. Garden bed design Using this design, can the garden bed be made with 32 metres of timber? Design A Yes / No Design B Design C Design D

49 True/false Be aware of high chance of guessing
Consider appropriate scoring rule. E.g Each statement counts 1 score All statements correct = 1 score Something in-between Examine item “model fit” to guide scoring decision

50 Matching/Ordering Arrange the actions in sequence
A neighbourhood committee of a city decided to create a public garden in a run-down area of about 4000 m2. Arrange the actions in sequence 1st phase 2nd phase 3rd phase 4th phase 5th phase Actions A. Buying materials and plants. B. Issuing the authorisations. C. Project designing. D. Care and maintenance. E. Building the garden.

51 Matching/Ordering Useful to test relationships Easy to mark.
Need to be treated as one single item, as there is dependency between the responses.

52 More generally on item writing
Are you really testing what you think you are testing. For example, in a reading test, can you arrive at the correct answer without reading the stimulus? in a science test, can you extract the information from the stimulus alone, and not from the scientific knowledge that you profess. in a maths test, is the stumbling block to do with understanding the stimulus, or to do with solving the problem?

53 Constructed Response Items

54 Non multiple choice format
Examples: Constructed response Performance Motivation: Face validity, for testing higher order thinking School reform: Avoid multiple choice teaching, and avoid testing fragmented knowledge and skills.

55 Caution about Performance format
Check validity carefully E.g., Evaluation of Vermont statewide assessment of collecting “portfolios” (1991) concluded that the assessments have low reliability and validity. Problems with rater judgement and scoring reliably. E.g, quality of handwriting; presentation 3-10 times more expensive Bennett & Ward (1993); Osterlind (1998); Haladyna (1997)

56 This slide may have some truth about it
This slide may have some truth about it! A study was carried out examining the differences in scores between a writing task administered online and on paper-and-pencil. Next slide shows some results from this study.

57 Example - a study comparing Online and Paper writing task
A writing task was administered online and on paper. Online scores have been found to be lower than paper-and-pencil scores. Low ability students do “better” on paper-and-pencil writing task, about 1 score point difference out of 10.

58 Improve between-rater agreement
Clear and comprehensive marking guide Need trialling to get a wide range of responses Need training for markers Need monitoring for marker leniency/harshness Better to mark by item than by student – to reduce dependency between items The last dot point: Markers should mark question 1 for all students, and then question 2 for all students, rather than marking all question for student 1, then all questions for student 2. This is because there is the so-called “halo” effect: Markers form an impression about a student’s work. For example, Marker often thinks a student as an “A” student, or a “B” student, and tends to award more similar scores when they mark all questions for each student. More independent markings can be obtained if marking is not done by student booklet, but by item. Of course, sometimes the logistics may be too difficult and this will not be possible.

59 Work towards some middle ground?
Constructed response format with computer assisted scoring: N 6.4m 9.6m Estimate the floor area of the house Capture raw numeric response, e.g., 61.44 60 6144 Computer will recode and score

60 Computer assisted scoring
Formulate firm scoring rules AFTER we examine the data Other examples, Household spending Hours spent on homework Idea is to capture maximum amount of information with lowest cost. Capture all different responses. Can always collapse categories later

61 Scoring – formal vs psychometric
Technically correct/incorrect, versus manifestation of latent ability In deciding on how to score a response, always think of the level of latent ability to produce that response. E.g., In which sport is the Bledisloe Cup competed for? sample answers: rugby union, rugby league, rugby, sailing. How to score? Where are the corresponding levels of familiarity with sport?

62 Psychometric considerations in scoring - 1
Consider where you would place a person on the ability continuum. Score the response according to the location on the continuum. Measurement is about predicting a person’s level of ability. In which sport is the Bledisloe Cup competed for? Sailing! Rugby league! Rugby! Rugby union! Not Familiar with sport Familiar with sport Score 0 Score 1 Scale of familiarity with sport

63 Psychometric considerations in scoring - 2
May be better to place the persons as follows: In which sport is the Bledisloe Cup competed for? Sailing! Rugby league! Rugby! Rugby union! Not Familiar with sport Familiar with sport Score 0 \---Score 1---/ Scale of familiarity with sport

64 Scoring - another example
What is the area of the following shape? Consider these responses: 16 m2; 16m; 16; 32m2; 32; 12m2; no response How to score these? 4m 8m Where are the levels of latent ability corresponding to these responses? Ideally, we need scoring that satisfies both technical soundness and psychometric property.

65 How to decide on weights?
Should more difficult items get higher scores? Should items requiring more time get higher scores? In general, more difficult items should not have more weight, unless the items are more discriminating. One rationale is that there should be equal penalty whether a person fails on an easy item or a difficult item. If all items tap into the same “latent variable”, then a person high on the latent variable will not likely to get easy items wrong. So the situation of having easy ones wrong and difficult ones right is an indication that the items do not tap into the same construct. On the hand, if all items do tap into the same construct, then “item difficulty” should not play a part in the weight of the score. If a test is not speeded, that is, everyone has enough time to complete the test, then items that require more time to complete should not get more weight. However, if a test is speeded, then students who completed items that required more time may be disadvantaged as they did not have much opportunity to complete shorter items. It is recommended that tests should not be speeded.

66 Partial Credit Scoring
If the data support partial credit scoring*, then it is better to use partial credit rather than dichotomous. Information will be lost if dichotomous scoring is used. *Data support partial credit scoring when the average ability for each score category increases with increasing score, and the point-biserial increases in order of score categories.

67 Practical guide to partial credit scoring
Within an item Increasing score should correspond with increasing proficiency/ability Across items The maximum score for each item should correspond with the amount of “information” provided by the item about students’ proficiency/ability

Download ppt "Test and Scale Development"

Similar presentations

Ads by Google