9 Scaling methods Most are rating scales that are summative May be unidimensional or multi-dimensional
10 Method of paired comparisons Aka forced choiceTest taker is forced to pick one of two items paired together
11 Comparative scalingTest takers sort cards or rank items from “least” to “most”
12 Categorical scalingTest takers sort cards into one of 2 or more categories.Stimuli are thought to differ quantitatively not qualitatively
13 Likert type scalesResponse choices are ordered on a continuum from one extreme to the other (e.g., strongly agree to strongly disagree).Likert assumes an interval scale although this may not be realistically accurate.
14 Guttman scalesResponse choices for each item are various statements that lie on a continuum.Endorsing the most extreme statement reflects endorsement of milder statements as well.
15 Method of equal-appearing intervals Presumed to be intervalFor knowledge scale:obtain T/F statementsExperts rate each itemFor attitude scaleJudges rate each item on a likert scale assuming equal intervalsFor bothTotal test score for the test taker is based on “weighted” items (determined by averaging the experts ratings)
16 Method of absolute scaling Way to determine the difficulty level of items.Give items to several age groups, with one age group acting as the anchor.Item difficulty is assessed by noting the performance of each age group on each item as compared to the anchor group.
17 Method of empirical keying Based entirely on empirical findings.Test developer comes up with several items and then gives these to a group of people who are known to possess the construct and a group who is known not to possess the construct.Items are selected based on how well they distinguish one group from the other.
26 Scoring items Cumulative model Class/category Ipsative Correction for guessing
27 3. Test tryoutShould be on group that represents the ultimate group of test takers (who the test is intended for)Good itemsReliableValidDiscriminate well
28 Before item analysis, look at the variability of scores within the test Floor effect?Ceiling effect?
29 4. Item analysishelps determine which items should be kept, revised, deleted.
30 Item-difficulty index proportion of examinees who get the item correct.can get a mean item difficulty.
31 Ideal item difficultywhen using multiple guess items, try to account for the probability of chance.Optimal item difficulty = 1+g/2exception to choosing item difficulty around mid-range involves tests of extreme groups.
32 Item endorsementproportion of examinees who endorsed the item.
33 Item reliability index Indication of internal consistencyProduct of the item SD and the correlation between the item and total scaleItems with low reliability can be eliminated
34 Item validity indexCorrelate item with criterion – (helps identify predictively useful test items)Multiply the item score and the criterion total score with the SD of the item.The usefulness of an item also depends on its dispersion or ability to discriminate
35 Item discrimination index how well the item discriminates between high scorers and low scorers on the test.For each item, compare the performance of those in the upper vs lower performance ranges. Formula: d= (U-L)/NU = # of pple in the upper range who got it rightL= # of pple in the lower range who got it rightN= total # of pple in the upper OR lower range.
36 Interpreting the IDI can vary from –1 to +1. A (–) number = A 0 indicates =The closer the IDI is to +1Can also use the IDI approach to examine the pattern of incorrect responses.
37 Item characteristic curves “Graphic representation of item difficulty and discrimination”horizontal line = abilityvertical line = probability of a correct response
38 plots the probability of a correct response relative to the position on the entire test. If the curve is an incline slope or like an S, the item is doing a good job of separating low and high scorers.
39 Item fairness Items should measure the same thing across groups Items should have similar ICC across groupsItems should have similar predictive validity across groups
40 Speed tests Easy items, similar items – everyone gets correct. Measuring response timeTraditional analyses of items do not apply
41 Qualitative item analysis Test takers descriptions of the testThink aloud administrationsExpert panels
42 5. Revising the testbased on the info we obtained from the item analysis. New items and additional testing of these items may be required.
43 Cross validationOnce you have your revised test, need to seek new, independent confirmation of the test’s validity.The researcher uses a new sample to determine if the test predicts the criterion as well as it did in the original sample.
44 Validity shrinkageTypically, with cross validation, you will find that the test is less accurate in predicting the criterion with this new sample.
45 Co-validation Validating two or more tests at the same time Co-norming Saves $Beneficial for tests that are used together
46 6. Publishing the testfinal step that involves development of a test manual.
47 Production of testing materials Testing materials that are user friendly will be more accepted. The lay out of the materials should allow for smooth administration.
48 Technical manualSummarizes the technical data and references. Item analyses, scale reliabilities, validation evidence , etc can be found here.
49 User’s manualprovides instruction for administration, scoring, and interpretation.The Standards for Educational and Psychological Testing recommend that manuals meet several goals (p 135).two of the most important:1. describe the rationale and recommended uses of the test2. provide data on reliability and validity.