Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 LINKING AND EQUATING OF TEST SCORES Hariharan Swaminathan University of Connecticut.

Similar presentations


Presentation on theme: "1 LINKING AND EQUATING OF TEST SCORES Hariharan Swaminathan University of Connecticut."— Presentation transcript:

1 1 LINKING AND EQUATING OF TEST SCORES Hariharan Swaminathan University of Connecticut

2 AN EXAMPLE CONVERTING TEMPERATURE FROM ONE SCALE TO ANOTHER CELSIUS TO FAHRENHEIT F = C * (9/5) + 32 THIS IS A LINEAR MAPPING OF ONE SCORE ON TO THAT OF ANOTHER

3 AN EXAMPLE OF EQUIVALENCE OF MEASUREMENTS MEASURING BLOOD SUGAR Equivalence of Lab Measurement and Home Measurement How do we do this? 1.Measure Blood Sugar in the Lab 2.Measure Blood Sugar Using Home Instrument 3.Plot one value against the other

4 AN EXAMPLE OF EQUIVALENCE OF MEASUREMENTS LAB = 1.021*HOME + 4.554

5 AN EXAMPLE OF EQUIVALENCE OF MEASUREMENTS COST OF LIVING COST OF GOODS IN 2015 COMPARED WITH COST IN 1975  How do we do this? 1.Record costs of comparable goods (Market Basket) 2.Find cost ratio (Economist’s procedure) or 3.Plot one value against the other

6 AN EXAMPLE OF EQUIVALENCE OF MEASUREMENTS COST NOW = 7.35*COST 70 + 7.175

7 The above are examples of relating two measurement scales to each other. In the first example, we realize that one scale bears a simple relation to the other. We are just changing the units of measurement. In the second, we are using two different instruments to measure the same entity, and using prediction to establish equivalence. In the third, we are trying to relate two measurements, under the assumption that the objects we measure remain the same. We are “equating” in all these cases. EQUATING

8 Multiple forms of a test are necessary in many testing situations to ensure the validity of scores. Examples:  Many large scale testing programs have several administration dates in a given year.  Students may be permitted multiple attempts on an exam.  To assess growth or change, an individual must be assessed on several occasions over time.  Many states are required to release some or all test items after administration and hence require new forms. THE CONTEXT OF EQUATING

9 In each case, the validity of the scores would be threatened by using the same test. Using the same test across administrations at different times would result in exposure of the test items and would mean that students who were tested later had an advantage. Using the same test across time for the same students would confound practice or memory effects with true growth or change. THE CONTEXT OF EQUATING

10 To avoid these problems, different test forms are constructed to the same specifications. The different test forms are intended to measure the same skills and be of similar difficulty and discrimination. Tests that measure the same construct, have the same mean, variance, reliability, and correlations with other tests are said to be “parallel”. THE CONTEXT OF EQUATING

11 If the test forms were truly parallel, equating would not be necessary. However, test forms will always differ somewhat, and it is necessary to take this into account somehow before scores on different forms can be compared. THE CONTEXT OF EQUATING

12 Successful equating will result in scores that can be validly compared even though they were obtained on different test forms. The overarching principle of equating is that it should be equitable for all test-takers: after equating, it should not matter to individuals whether they take Form X or Form Y. THE CONTEXT OF EQUATING

13  Equating procedures are designed to produce scores that can be validly compared across individuals even when they took different versions of a test (different “test forms”).  Linking, Scale aligning, or simply Scaling, involves transforming the scores from different tests onto a common scale. These procedures produce scores that are expressed on a common scale but the scores of individuals who took different tests cannot be compared in the strict sense. EQUATING AND LINKING

14  Linking is used for establishing the relationship between two tests that are not built to the same content specifications or to have the same difficulty.  One common example is two establish a relationship between the scores obtained on SAT and ACT, two college entrance examinations.  The ACT and SAT are measures of the same general constructs (aptitude). Concordance tables are constructed to express the relationship between the two scores; they are however, not exchangeable. LINKING VS. EQUATING

15  Similarly, in India, the relationship between scores obtained on secondary school certificate examinations conducted by two boards must be established through linking.  While these tests are not exchangeable, the linking provides some degree of comparability.  If all students take a common college entrance examination, the board scores can be linked through this common examination. This linking will be a little stronger. LINKING VS. EQUATING

16  Equating is one of the core applications of measurement theory: it is central to any testing program that uses multiple test administrations.  There are two different measurement frameworks in common use today; equating is approached differently in these frameworks  The older framework (c. 1900s) is now referred to as Classical Test Theory (CTT)  The newer (c. 1970s) framework is Item Response Theory (IRT) MEASUREMENT FRAMEWORKS FOR EQUATING

17  Classical Test Theory is based on number-correct (raw) test scores.  Equating under CTT involves creating a table for converting raw scores on one test to the scale of raw scores on the other test. The equated scores may then be linearly transformed to a desired reporting scale.  Equating in a classical framework lacks a strong theoretical basis. MEASUREMENT FRAMEWORKS FOR EQUATING

18  IRT provides a strong theoretical basis for equating.  IRT provides mathematical models that relate an individual’s performance on a test item to characteristics of the item and the individual’s value on the underlying latent trait measured by the test.  Instead of using raw score as a measure of performance, we use the estimated trait value for each individual.  Equating is a simpler matter under IRT than CTT. MEASUREMENT FRAMEWORKS FOR EQUATING

19  However, IRT requires strong assumptions and large sample sizes for estimating the parameters of the model.  CTT equating procedures remain in wide use in situations where the assumptions of IRT are not met, where testing populations are small, or where raw scores are the preferred scores. MEASUREMENT FRAMEWORKS FOR EQUATING

20 Within the classical test theory framework, equating can only be achieved under certain conditions. 1.The tests must measure the same construct. Scores that measure different constructs can never be equated - we could not fairly compare the mathematics score of one student with the reading score of another! REQUIREMENTS FOR CLASSICAL EQUATING

21 Within the classical framework, equating can only be achieved under certain conditions: 2.The tests should be equally reliable. If the test forms are unequally reliable, higher performing students would benefit from the more reliable test, and lowerperforming students would benefit from the less reliable test. REQUIREMENTS FOR CLASSICAL EQUATING

22 Within the classical framework, equating can only be achieved under certain conditions: 3.The conversion should be symmetric: it should result in the same equated scores regardless of whether Form Y scores are converted to the scale of Form X or Form X scores are converted to the scale of Form Y, i.e., the equating should produce the same result in both directions. For example, if a score of 28 on Form Y equates to a score of 26 on Form X, then a score of 26 on Form X should equate to a score of 28 on Form Y. REQUIREMENTS FOR CLASSICAL EQUATING

23 Within the classical framework, equating can only be achieved under certain conditions: 4.The equating function should be invariant across subpopulations of students at different levels of performance. The distribution of performance in the groups taking the different test forms should not affect the equating function; if it does, then the equating may not be equitable to groups of students with a different distribution of performance. REQUIREMENTS FOR CLASSICAL EQUATING

24  Fully meeting these requirements is almost impossible in practice (particularly the requirements of equal reliability and invariance across all subpopulations).  Nevertheless, some adjustment of scores from different test forms must be made to compensate for lack of equivalence of test forms.  We keep the requirements for equating in mind to ensure that we do the best possible job. REQUIREMENTS FOR CLASSICAL EQUATING

25  Linking of test scores can only be achieved if appropriate data are collected to establish the relationship between the score scales.  There are several possible data collection designs: 1.Single-group (SG) design 2.Equivalent-groups (EG) design 3.Non-equivalent anchor test (NEAT) design DATA COLLECTION DESIGNS FOR TEST SCORE LINKING

26 Single-Group Design: One group of test-takers takes both test forms. This is the ideal design, as we know that any difference in the distributions of scores is due to differences in difficulty and discrimination of the test forms. We then adjust the scores on one form to make the two distributions equal. DATA COLLECTION DESIGNS FOR TEST SCORE LINKING PopulationSampleForm XForm Y P1**

27 Single-Group Design: Advantages: There is no confounding of test differences with group differences. Only one group of test-takers is needed. DATA COLLECTION DESIGNS FOR TEST SCORE LINKING

28 Single-Group Design: Disadvantages: Practice and fatigue effects may affect scores. These effects can be controlled by counterbalancing the order of administration of the forms, but this generally cannot be incorporated into an operational administration; it requires a special equating study. Extended testing time is required (may not be feasible). DATA COLLECTION DESIGNS FOR TEST SCORE LINKING

29 Equivalent-Groups Design: Two equivalent groups take one test form each. This design is an approximation to the single-group design. DATA COLLECTION DESIGNS FOR TEST SCORE LINKING PopulationSampleForm XForm Y P1* P2*

30 Equivalent-Groups Design: Advantages: There is no confounding of test difficulty with group differences. DATA COLLECTION DESIGNS FOR TEST SCORE LINKING

31 Equivalent-Groups Design: Disadvantages: Large samples are required to ensure statistical equivalence. If the base form has been previously administered, it must be re-administered at the same time as the new form to ensure equivalent samples; this poses a test security risk. DATA COLLECTION DESIGNS FOR TEST SCORE LINKING

32 Non-equivalent Anchor Test (NEAT) Design: Two different groups take one test each; both groups take a common set of items (anchor test A) The anchor test controls for group differences. Individuals taking Form Y with the same anchor test score as individuals taking Form X should have the same score on Form X. DATA COLLECTION DESIGNS FOR TEST SCORE LINKING PopulationSampleForm XAnchorForm Y P1** Q2**

33 Non-equivalent Anchor Test (NEAT) Design: Advantages: The groups do not need to be equivalent. A new form can be equated to a previously administered form without re-administering the old form. Students take only one form of the test plus the anchor test. The design is easily incorporated into operational test administrations. DATA COLLECTION DESIGNS FOR TEST SCORE LINKING

34 Non-equivalent Anchor Test (NEAT) Design: Disadvantages: The anchor test must be of sufficient length and appropriate nature. The anchor test items must behave in the same way for the two different groups (be equally difficult). If equating to a previously administered test, the anchor test items have also been previously administered and therefore may be known to test- takers. DATA COLLECTION DESIGNS FOR TEST SCORE LINKING

35  The NEAT design is the most commonly used in practice.  The purpose of the anchor is to determine to what extent score differences on the test forms are due to ability or proficiency differences in the two groups.  Differences in test difficulty can then be separated from group differences.  Equating should adjust scores only for differences in test difficulty. DATA COLLECTION DESIGNS FOR TEST SCORE LINKING

36  Anchor tests may be external (anchor items not included in score) or internal (anchor items included in score).  External anchors are often administered as separately timed sections.  Internal anchor items are spread throughout the test.  Both types of anchor are susceptible to exposure; internal anchors are susceptible to context effects. DATA COLLECTION DESIGNS FOR TEST SCORE LINKING

37  In testing programs where scored items must be released after administration, an external anchor makes sense.  For Classical equating methods, the anchor test should be a “mini-test” in composition.  Longer anchors will give better results. DATA COLLECTION DESIGNS FOR TEST SCORE LINKING

38 THE EQUATING PROBLEM Mean = 10.4; SD = 3.9Mean = 8.4; SD = 4.4  Suppose we administer two test forms X and Y to a group of students and obtain the score distributions shown below:

39  Form Y is harder than Form X (the average score is lower on Form Y).  On average, students scored about two points lower on Form Y than Form X.  A student with a score on 10 on Form Y would probably have obtained a higher score if he or she had taken Form X.  Why not just add 2 points to the score of everyone who took Form Y? Then the average scores would be about the same. THE EQUATING PROBLEM

40  But suppose the two tests were made up of items with difficulties like those shown below: THE EQUATING PROBLEM Item DifficultyNumber of Items Form XForm Y Very hard11 Hard65 Medium711 Easy42 Very easy21

41  For low ability students, Form Y is harder because there are fewer easy questions, so they should get points added to their Form Y score. THE EQUATING PROBLEM Item DifficultyNumber of Items Form XForm Y Very hard11 Hard65 Medium711 Easy42 Very easy21

42  For high ability students, Form Y is slightly easier because there are fewer hard questions, so they should have points subtracted from their Form Y score. THE EQUATING PROBLEM Item DifficultyNumber of Items Form XForm Y Very hard11 Hard65 Medium711 Easy42 Very easy21

43  Adding or subtracting the same number of points for every student would not be equitable*.  We need an adjustment that is different for students at different levels of ability. * CLASSICAL EQUATING PROCEDURES

44  Why not do a regression analysis to predict the scores on Form X from the scores on Form Y? CLASSICAL EQUATING PROCEDURES Prediction equation is X = 4.142 +.751X A student with a score on 10 on Form Y is predicted to have a Form X score of 4.142 +.751(10) = 11.65

45  For the conversion to be symmetric, a student with a (theoretical) score of 11.65 should get a predicted score of 10 on Form Y. CLASSICAL EQUATING PROCEDURES Prediction equation is X = -1.473 +.944X A student with a score on 12.44 on Form X is predicted to have a Form Y score of -1.473 +.944(11.65) = 9.52

46  The regression procedure is not symmetric: we get different equated scores depending on which test we use as the base form. CLASSICAL EQUATING PROCEDURES

47  We can solve these problems in one of two ways.  There are two general approaches to equating test scores using classical procedures: Linear methods Equipercentile methods  Both require scores of the same or equivalent groups on the two tests (but can also be used with non-equivalent groups after some fancy footwork). CLASSICAL EQUATING PROCEDURES

48  Under linear equating, scores on the two forms are considered equivalent if they are the same distance from the mean of the form in standard deviation units (i.e., have the same z-score).  Hence, there is a linear relationship between the two sets of scores. LINEAR EQUATING

49  Scores x (on Form X) and y (on Form Y) are considered equivalent if they have the same z-score: where are the mean and standard deviation of scores on Form X and are the mean and standard deviation of scores on Form Y. LINEAR EQUATING

50  Thus,  Given sample data, the transformation required for rescaling Form Y scores to the scale of Form X:  The transformation for rescaling Form Y scores to the scale of Form X is LINEAR EQUATING

51  After equating, the rescaled Form Y scores have the same mean and standard deviation as the Form X scores.  Note that this is not a prediction (regression) equation; it can be inverted to obtain the transformation for rescaling Form X scores to the scale of Form Y: LINEAR EQUATING

52  If we substitute the transformed Form Y score y*(now on the Form X scale) for x in the equation, y* converts back to the original y value: LINEAR EQUATING

53  Rescaling usually results in non-integer scores; these are generally rounded to integer values.  Under Linear equating, it is possible to obtain rescaled scores outside the permissible range.  Depending on the final reporting scale, these are often truncated to the lowest and highest permissible scores (arbitrarily decided). LINEAR EQUATING

54  For the test score distributions shown earlier, Mean on Form X:10.429 SD of Form X: 3.918 Mean of Form Y: 8.375 SD of Form Y: 4.394 LINEAR EQUATING EXAMPLE

55 Score on Form Y Equated Score on Form X Rounded Equated Score 0 2.963 1 3.854 2 4.755 3 5.646 4 6.537 5 7.427 6 8.318 7 9.219 8 10.1010 9 10.9911 10 11.8812 11 12.7713 12 13.6714 13 14.5615 14 15.4515 16.3416 17.2317 18.1318 19.0219 19.9120 20.8020 LINEAR EQUATING EXAMPLE A score of 3 on Form Y is equivalent to a score of 6 on Form X A score of 9 on Form Y is equivalent to a score of 11 on Form X Note that we have to truncate the equated scores at 20

56 Score on Form Y Equated Score on Form X Rounded Equated Score 0 2.963 1 3.854 2 4.755 3 5.646 4 6.537 5 7.427 6 8.318 7 9.219 8 10.1010 9 10.9911 10 11.8812 11 12.7713 12 13.6714 13 14.5615 14 15.4515 16.3416 17.2317 18.1318 19.0219 19.9120 20.8020 LINEAR EQUATING EXAMPLE Note that we have added more points to scores at the low end of the scale than at the top end

57  Linear equating matches the means and standard deviations of the two distributions but it does not change the shape of the score distribution.  A Form Y score with the same z-score as a Form X score may have a different percentile rank. EQUIPERCENTILE EQUATING

58  Under equipercentile equating, scores on two forms are considered to be equivalent if they have the same percentile rank in the target population.  After equating, the equated scores have the same distribution as the scores on the base form. EQUIPERCENTILE EQUATING

59  Scores on the two forms with equal percentile ranks are plotted against each other.  The curve is usually irregular; smoothing procedures are applied under the assumption that the irregularity is due to sampling error.  Smoothing can be applied to each cumulative distribution function before obtaining equivalent scores (pre-smoothing) or it can be applied to the equipercentile equivalent curve (post-smoothing).  Equating usually results in non-integer scores; these are generally rounded to integer values. EQUIPERCENTILE EQUATING

60 Score on Form X % of students Cumulative % of Students Score on Form Y % of students Cumulative % of Students 0 0.1 0 1.0 1 0.20.3 1 3.54.5 2 1.21.5 2 4.48.9 3 2.13.5 3 5.114.0 4 2.76.3 4 5.919.9 5 4.610.9 5 7.827.7 6 5.516.4 6 8.135.8 7 6.823.2 7 9.345.1 8 9.032.2 8 8.853.9 9 10.843.0 9 9.062.8 10 9.552.5 10 7.870.6 11 9.762.2 11 6.977.5 12 7.869.9 12 5.182.5 13 7.076.9 13 4.186.6 14 6.883.7 14 3.189.7 15 5.489.1 15 2.992.5 16 3.292.4 16 2.294.7 17 3.195.5 17 2.296.9 18 2.798.2 18 1.498.3 19 1.499.6 19 0.999.3 20 0.4100.0 20 0.7100 EQUIPERCENTILE EQUATING EXAMPLE A score of 9 on Form Y has (almost) the same percentile rank as a score of 11 on Form X; a score of 9 on Form Y is equivalent to a score of 11 on Form X

61 EQUIPERCENTILE EQUATING EXAMPLE Mean = 10.9; SD = 3.5Mean = 8.5; SD = 4.5 Form Y is harder than Form X

62 EQUIPERCENTILE EQUATING EXAMPLE Form Y is harder than Form X

63 EQUIPERCENTILE EQUATING EXAMPLE Cumulative test score distributions: Note that the curves are not perfectly smooth

64 EQUIPERCENTILE EQUATING EXAMPLE Smooth the curves and plot the smoothed distributions together:

65 Score on Form X Percentile rank after smoothing Score on Form Y Percentile rank after smoothing 0 0.0 0 1 1 1.5 2 0.1 2 4.7 3 0.6 3 10.0 4 2.0 4 17.1 5 4.8 5 25.0 6 9.2 6 33.3 7 15.4 7 41.8 8 23.2 8 50.1 9 32.1 9 57.9 10 41.9 10 64.9 11 51.9 11 71.4 12 61.5 12 77.3 13 70.4 13 82.3 14 78.5 14 86.6 15 85.5 15 90.2 16 91.4 16 93.2 17 95.9 17 95.5 18 98.8 18 97.3 19 99.9 19 98.7 20 100.0 20 99.6 EQUIPERCENTILE EQUATING EXAMPLE What score on Form X has the same percentile rank as a score of 10 on Form Y?

66 Score on Form X Percentile rank after smoothing Score on Form Y Percentile rank after smoothing 8 23.2 8 50.1 9 32.1 9 57.9 10 41.9 10 64.9 11 51.9 11 71.4 12 61.5 12 77.3 13 70.4 13 82.3 14 78.5 14 86.6 EQUIPERCENTILE EQUATING EXAMPLE We interpolate between the scores of 12 and 13 to find the score on Form X that has a percentile rank of 68.2 Equated score = 12 + (64.9 - 61.5)/(70.4 - 61.5) = 12.38 We round this to a score of 12 for reporting purposes

67 Score on Form Y Equated Score on Form X Rounded Equated Score 0 --2 1 3.173 2 4.494 3 5.626 4 6.597 5 7.508 6 8.418 7 9.319 8 10.2010 9 11.1211 10 12.0912 11 13.0413 12 13.9114 13 14.7215 14 15.4515 16.1316 16.8617 17.3517 18 17.7218 19 19.0519 20 19.9020 EQUIPERCENTILE EQUATING EXAMPLE A score of 3 on Form Y is equivalent to a score of 6 on Form X A score of 14 on Form Y is equivalent to a score of 15 on Form X

68 Score on Form Y Equated Score on Form X Rounded Equated Score 0 ---2 1 3.173 2 4.494 3 5.626 4 6.597 5 7.508 6 8.418 7 9.319 8 10.2010 9 11.1211 10 12.0912 11 13.0413 12 13.9114 13 14.7215 14 15.4515 16.1316 16.8617 17.3517 18 17.7218 19 19.0519 20 19.9020 EQUIPERCENTILE EQUATING EXAMPLE Note that as in linear equating, we have added more points to scores at the low end of the scale than at the top

69  The nonlinearity in the equating relationship arises because we are trying to match the shape of the distribution of Form Y to the shape of the distribution of Form X, so we have to stretch the scale at some parts and compress it at others. EQUIPERCENTILE EQUATING

70  Problems with equipercentile equating:  Smoothing introduces subjectivity; different smoothing procedures may produce different equating results.  Equivalents can only be produced in the observed score range for the samples used in the equating; extrapolation is required for scores outside the observed range. EQUIPERCENTILE EQUATING

71  Linear equating assumes that the difference in difficulty between the forms is constant across the scale and that the same adjustment formula can be used everywhere.  If this assumption is met, linear and equipercentile methods will produce the same equating results. EQUIPERCENTILE VS LINEAR EQUATING

72  Classical equating approaches with NEAT designs require the construction of a “synthetic” population that took both forms.  The anchor test is used to impute the distribution of scores for each group on the form they did not take, so that score distributions are obtained for both groups on both forms. CLASSICAL EQUATING WITH NEAT DESIGNS

73  The synthetic population mean is a weighted combination of the two populations.  Equipercentile or linear equating procedures are then performed using the score distributions for the synthetic population. CLASSICAL EQUATING WITH NEAT DESIGNS

74  The two most common linear equating approaches for NEAT designs are called the Tucker method and the Levine method.  These procedures make different assumptions about the relationship between the Anchor test scores and the Form scores.  We will not explore the technical details of these procedures here. CLASSICAL EQUATING WITH NEAT DESIGNS

75  The Levine method is generally better when the two groups have different distributions.  The Tucker method is generally better when one form is more difficult than the other.  When the groups are of similar proficiency levels and the tests are fairly similar in difficulty, the two approaches will produce similar results. COMPARISON OF TUCKER AND LEVINE EQUATING

76  Chained linear equating is another form of linear equating that is used with the NEAT design.  Under chained linear equating, scores on Form Y are linearly equated to the anchor test scores, which are in turn linearly equated to the Form X scores.  Chained linear equating does not require strong assumptions about the relationship between the anchor test scores and the Form scores (unlike the Tucker and Levine methods). CHAINED LINEAR EQUATING

77  Two groups take the same 20-item tests as used in previous examples, but both take an external 10-item anchor test  Group 1 takes Form X and the anchor test Group 2 takes Form Y and the anchor test  Group 2 is of higher average ability and takes the harder test NEAT EQUATING EXAMPLE

78 Mean score on Form X and Form Y are similar. Can we assume that the groups of equal ability? Group 1: Mean on Form X: 10.396 s.d. on Form X: 3.985 Group 2: Mean on Form Y: 10.439 s.d. on Form Y: 4.619

79 NEAT EQUATING EXAMPLE The anchor test mean for Group 2 is higher, which indicates that Group 2 has higher average ability, which in turn implies that Form Y must have been harder Group 1: Mean on Anchor: 4.473 s.d. on Anchor: 2.181 Group 2: Mean on Anchor: 5.417 s.d. on Anchor: 2.321

80 NEAT EQUATING EXAMPLE  After constructing the synthetic population of test-takers who took both forms of the test, we see that the mean on Form Y is lower than the mean on Form X, i.e., Form Y is harder than Form X Synthetic Population: Tucker Levine Mean on Form X: 11.043 11.333 s.d. on Form X: 4.110 4.243 Mean on Form Y: 9.675 9.434 s.d. on Form Y: 4.593 4.573

81 NEAT EQUATING EXAMPLE  The equating coefficients for all methods are similar to those of the single group design Tucker linear equating coefficients: Slope = 0.895 Intercept = 2.385 Levine linear equating coefficients: Slope = 0.928 Intercept = 2.582 Chained linear equating coefficients: Slope = 0.918 Intercept = 2.537

82 NEAT EQUATING EXAMPLE Score on Form Y Rounded equated Score on Form X: Tucker Rounded equated Score on Form X: Levine Rounded equated Score on Form X: Chained Linear 0233 1343 2444 3555 4666 5777 6888 7999 810 9 11 101112 111213 121314 13141514 151615 16 17 18 19 20

83  IRT provides a strong theoretical basis for equating.  Under IRT, the test score is replaced by an estimate of the test-taker’s level of the trait being measured by the test.  The trait values do not depend on the particular items that are administered.  The only equating issue in IRT is setting a scale for reporting trait values and putting all parameter estimates on that common scale. IRT PROCEDURES FOR SCORE LINKING

84  Under all IRT models, the scale for item and trait parameters is indeterminate, i.e., has no natural metric.  The issue is simply one of setting a scale for item and trait parameter estimates on one form and linearly transforming estimates from another form to the same scale. IRT PROCEDURES FOR SCORE LINKING

85 BASIC PRINCIPLES OF IRT  In IRT, we think of observed score as an (imperfect) indicator of a latent trait.  Item response models specify the relationship between performance on a test item and the latent trait or traits measured by the test.  Specifically, item response models specify a mathematical relationship between the probability of a given response to an item and the set of underlying traits. 85

86 BASIC PRINCIPLES OF IRT  The probability of the response is determined by characteristics of the item as well as the individual’s trait value(s).  The most widely used models assume that the test is unidimensional, i.e., measures a single trait or construct. 86

87 ITEM RESPONSE FUNCTIONS FOR DICHOTOMOUS ITEMS (UNIDIMENSIONAL MODEL) 87

88 ADVANTAGES OF IRT  When an IRT model holds, 1.Person parameters (trait values) are invariant, i.e., they do not depend on the set of items administered; 2.Item parameters are invariant, i.e., they do not depend on the characteristics of the population.  These properties allow us to compare the trait estimates of individuals who have taken different forms of a test. 88

89 IRT MODELS  There are three widely used models for dichotomous responses and three widely used models for polytomous responses.  The majority of test items used on standardized assessments are dichotomously scored multiple choice items.  The models for dichotomous responses differ in the number of item characteristics that are assumed to influence test-taker performance in addition to test- taker ability or proficiency. 89

90 IRT MODELS  The two-parameter (2P) model assumes that responses are influenced by the difficulty and discrimination of the items.  The one-parameter (1P) model restricts the two- parameter model by assuming that item responses are influenced only by item difficulty, i.e., that all items are equally discriminating.  The three-parameter model (3P) extends the 2P model by including a parameter that allows for a non-zero probability of a correct response even at the lowest trait values (a “guessing” parameter). 90

91 THE 2P LOGISTIC MODEL 91 u ij is the response of individual i to item j is the latent trait value of individual i a j is the discriminating power of item j b j is the difficulty level of item j

92 THE 2P LOGISTIC MODEL 92 Discrimination a is proportional to slope at b

93 THE 1P LOGISTIC MODEL 93 u ij is the response of individual i to item j is the latent trait value of individual i b j is the difficulty level of item j

94 THE 3P LOGISTIC MODEL 94 u ij is the response of individual i to item j is the latent trait value of individual i a j is the discriminating power of item j b j is the difficulty level of item j c j is the “guessing “ parameter (lower asymptote) of item j

95  IRT requires relatively large samples for accurate estimation of the model parameters (at least 1000 for the 3P model).  Specialized computer programs are required.  The advantages of IRT hold only when the model adequately fits the data. It is essential that the fit of the model to the data be investigated before the results are used. USING IRT MODELS

96  Under the 1P model, the scale for is arbitrary as long as b is expressed on the same scale: where  We can choose whatever metric we want for, as long as we use the same metric for item difficulty b. IRT PROCEDURES FOR SCORE LINKING

97  Under the 2P and 3P models, the term is invariant over linear transformations of the scale, as long as a and b are transformed correspondingly: where IRT PROCEDURES FOR SCORE LINKING

98  The scale for parameter estimates is set during calibration, either by setting the mean and SD of to be 0 and 1, respectively, or setting the mean and SD of the difficulty values to be 0 and 1.  Most software packages fix the scale on.  When different tests are calibrated in different groups, the scales will be set differently.  However, estimates for the same items or people will differ only by a linear transformation. IRT PROCEDURES FOR SCORE LINKING

99  We need to rescale the parameter estimates from the different tests and for the different groups so that they are on a common scale.  Item response data must be collected using one of the data collection designs.  For the single group design, we can simply calibrate all items together; the estimates will automatically be on the same scale.  For the equivalent groups design, we assume that the groups have the same mean and SD and fix the scale on in each group/test. IRT PROCEDURES FOR SCORE LINKING

100  For the NEAT design, there are two general sets of methods: Calibration methods -Concurrent Calibration -Fixed Item Parameter Calibration Transformation methods -Mean and sigma approach (and variations) -Characteristic curve method (Stocking-Lord method) IRT PROCEDURES FOR SCORE LINKING

101  Calibration methods automatically put all parameter estimates on a common scale.  Transformation methods require determination of the linear relationship between parameter estimates from the two forms. IRT PROCEDURES FOR SCORE LINKING

102  For the NEAT design, we can calibrate all items and all groups together, treating the items not taken in each group as missing (indicated by 9): 00110101001000000000000110001099999999999999999999 00011110101101000110001101001199999999999999999999 01011111101000110100001110001199999999999999999999 00111111001000000000001110000099999999999999999999 99999999999999999999111011110111111111111111101111 99999999999999999999111101101111111011100111111111 99999999999999999999100000001001110000001100010000 99999999999999999999100001001011111010111111000000 99999999999999999999111100111111110111111111101111 Form X items, Form Y items, Anchor Items, Missing responses  This procedure is called concurrent calibration. CONCURRENT CALIBRATION

103  Concurrent calibration does not work well if there are large mean differences in trait levels between groups, unless group membership is explicitly taken into account.  Modern computer programs allow a group membership code and separate trait distributions in each group.  The mean and SD of the trait values in each group are allowed to differ and are estimated. CONCURRENT CALIBRATION

104  An alternative approach is to calibrate the items for each group separately, but hold the item parameters for the common items in the second group fixed at the values obtained for the common items in the base group.  This procedure is called fixed item parameter calibration.  Fixed item calibration sometimes causes estimation problems, especially when the groups are very different in trait distributions. FIXED ITEM PARAMETER CALIBRATION

105  With this approach, we calibrate items separately in each group and determine the linear transformation needed to rescale one group to the scale of the other by comparing the parameter estimates on the anchor items.  Concurrent calibration is theoretically preferred in general; however, the sparseness of the response matrix in NEAT designs may cause estimation problems in some cases. TRANSFORMATION METHODS

106  Calculate the mean and SD of the difficulty parameters for the anchor items in the two groups.  Determine the linear transformation that would transform the mean and SD of anchor item difficulties in Group 1 (Form X) to equal those of Group 2 (Form Y).  We know that MEAN AND SIGMA METHOD

107  Since, and Therefore, and MEAN AND SIGMA METHOD

108  This transformation can be inverted to place parameters from Form Y/Group2 on the scale of Form X/Group 1 (Symmetry requirement):  Once the equating coefficients M and K are determined, item and trait parameter estimates for the unique (non-anchor) items of Form Y are transformed to the scale of Form X. MEAN AND SIGMA METHOD

109  Parameter estimates for the anchor items are averaged.  For the 1P model, only the additive constant K is calculated.  The trait estimates for Group 2 are transformed using the same transformation as was used for the item difficulty parameters MEAN AND SIGMA METHOD

110 MEAN AND SIGMA METHOD EXAMPLE Form XForm Y Unique ItemsCommon Items Unique Items Item abcabcabcabc 1 1.080.590.21 2 0.631.220.12 3 0.92-0.480.18 4 1.370.690.22 5 0.57-1.110.09 6 0.94-0.530.06 7 0.710.660.150.630.480.19 8 1.31-0.890.221.08-0.990.15 9 0.92-0.050.160.78-0.360.26 10 1.250.410.091.180.240.13 11 1.34-0.290.11 12 0.48-1.040.23 13 0.760.450.19 14 0.691.210.13 15 1.09-0.570.08 16 0.49-0.230.17 Mean b:.0325 SD b:.682 Mean b: -.1575 SD b:.658

111  Calculate equating constants:  Transform Form Y parameters: MEAN AND SIGMA METHOD EXAMPLE

112 Form XForm Y Unique ItemsCommon Items Unique Items Item abcabcabcabc 1 1.080.590.21 2 0.631.220.12 3 0.92-0.480.18 4 1.370.690.22 5 0.57-1.110.09 6 0.94-0.530.06 7 0.710.660.15 0.740.650.19 8 1.31-0.890.22 1.36-0.770.15 9 0.92-0.050.16 0.95-0.160.26 10 1.250.410.09 1.300.420.13 11 1.39-0.090.11 12 0.50-0.820.23 13 0.790.620.19 14 0.721.350.13 15 1.13-0.370.08 16 0.51-0.040.17  After transformation of Form Y parameters:

113  This method uses the information in both the difficulty and discrimination parameter estimates to determine the equating relationship.  The linear transformation is found that would minimize the sum of squared differences between expected scores on the common items based on the item parameter estimates for each group.  This method is considerably more complex than the mean and sigma method. CHARACTERISTIC CURVE METHOD


Download ppt "1 LINKING AND EQUATING OF TEST SCORES Hariharan Swaminathan University of Connecticut."

Similar presentations


Ads by Google