Presentation is loading. Please wait.

Presentation is loading. Please wait.

EPSY 8223: Test Score Equating

Similar presentations


Presentation on theme: "EPSY 8223: Test Score Equating"— Presentation transcript:

1 EPSY 8223: Test Score Equating
Linking Holland, P.W., & Dorans, N.J. (2006). Linking and equating. In R.L. Brennan (Ed.), Educational measurement (4th ed.). Westport, CT: American Council on Education and Praeger Publishers.

2 Linking A Link is a connection made between two tests by transforming a score on one test to a score on the other test. Linking, or transforming scores, can be done in one of three ways: Predicting Scale aligning (scaling) Equating

3 Linking X to Y Predicting Y from X Scaling X and Y Equating X to Y
Best Prediction Comparable Scales Interchangeable Scores Source: Holland & Dorans

4 Linking Method 1: Prediction
To predict a score on one test with other information, typically including a score on another test. The prediction is E(Y | X = x, P) The error of prediction is y - E(Y | X = x, P) However, the prediction of Y from X is not the same as the prediction of X from Y

5 Predicting 2005 Math scores from 2006 Math Scores

6 Predicting 2006 Math scores from 2005 Math Scores

7 Appropriate use of Prediction
One may use PSAT scores to predict (forecast) how students will perform on the SAT in the near future. Because of the sample-specific nature of this prediction, it may not hold in another sample from the population if that sample distribution differs from the original sample distribution.

8 Linking Method 2: Scaling
To place the scores from two different tests on a common scale through some transformation. Scaling can be done for two cases: Linking measures of different constructs Linking measures with similar constructs, but different test specifications

9 Scaling: Different Constructs/ Common Population
A common example of measures of different constructs for a common population includes battery scaling, as in the case of a large test composed of a battery of measures (reading, mathematics, language use). When the SAT-I was recentered, both the mathematics and verbal scores were given the same mean and standard deviation as a reference population from 1990.

10 Scaling: Different Constructs & Populations with Anchor Measures
Anchor scaling can be done with measures that measure different constructs from different populations (hypothetical population), but there exists a common measure among all examinees. SAT-subject test scores are scaled using the SAT-V and SAT-M as anchor measures; enabling us to treat the subject test scores as having comparable scales. With an anchor, other measures could be placed on the same scale using a linking function (linear or equipercentile).

11 Scaling: Similar Constructs & Reliability; Different Difficulty & Population
This case typically involves tests of the same subject that are administered to different grades or ages of individuals – often refer to this as vertical scaling. The tests change in difficulty (relatively) across different populations (grades). Typically, this scaling is done through anchor scaling, where adjacent grades have common items..

12 Same Construct, Different Reliability; Same Population
In this case, different reliabilities typically result from test of different length. The classic example is referred to as calibration, where the scores of a short form of a test are put on the scale of the full form.

13 Similar Constructs, Difficulty, & Reliability; Same Population
Tests are measuring similar constructs, but each one is built to different specifications. Concordance represents an attempt to place scores from similar tests on the same scale. Many colleges accept both the ACT and SAT. A concordance table links the scores on one test to the other. The revised GRE may result in a concordance table to associate new scores with old scores.

14 Scaling on Hypothetical Population
Source: Holland & Dorans Scaling Dissimilar Constructs Common Population Battery Scaling Anchor Measure Scaling on Hypothetical Population Scaling to Anchor Similar Constructs

15 Calibration Vertical Scaling Concordance
Dissimilar Constructs Similar Constructs Dissimilar Reliability Calibration Common Population Anchor Measure Similar Reliability Dissimilar Difficulty + Population Vertical Scaling Similar Difficulty + Population Concordance Source: Holland & Dorans

16 Linking Method 3: Equating
A direct link is created between a score on one test and a score on a different test, creating scores that are interchangeable. Tests must measure the same construct with the same difficulty and the same accuracy. Equating is the strongest form of linking. Errors in equating have caused more problems for testing companies than flawed items.

17 What makes a Linking an Equating?
Two or more tests and scoring rules, Scores on each test from one or more samples of examinees, Implicit or explicit population to which the linking will be applied, One or more methods of estimating the linking function; And the goal: create interchangeable scores.

18 Equating Requirements
Tests measure the same construct Tests have equal reliability Equating function is symmetric Equity: it is a matter of indifference to the examinee which test is to be taken Equating function has the property of population invariance

19 Equating Common Population Observed Score True Score: CTT/IRT
Anchor Test Post-stratification Chain Levine IRT Observed Score True Score: CTT/IRT Source: Holland & Dorans

20 Standards for Educational & Psychological Testing (1999)
4.10 A clear rationale and supporting evidence should be provided for any claim that scores earned on different forms of a test may be used interchangeably. In some cases, direct evidence of score equivalence may be provided. In other cases, evidence may come from a demonstration that the theoretical assumptions underlying procedures for establishing score comparability have been sufficiently satisfied.

21 Standards 4.11 When claims of form-to-form score equivalence are based on equating procedures, detailed technical information should be provided on the method by which equating functions or other linkages were established and on the accuracy of equating functions In equating studies that rely on the statistical equivalence of examinee groups receiving different forms, methods of assuring such equivalence should be described in detail.

22 Standards 4.13 In equating studies that employ an anchor test design, the characteristics of the anchor test and its similarity to the forms being equated should be presented, including both content specifications and empirically determined relationships among test scores. If anchor items are used… the representativeness and psychometric characteristics of anchor items should be presented.

23 Standards 4.14 When score conversions or comparisons procedures are used to relate scores on tests or test forms that not closely parallel, the construction, intended interpretation, and limitations of those conversions or comparisons should be clearly described.

24 Standards 4.15 When additional test forms are created by taking a subset of the items in an existing test form or by rearranging its items and there is sound reason to believe that scores on these forms may be influenced by item context effects, evidence should be provided that there is no undue distortion of norms for the different versions or of score linkage between them.

25 Test Equating, Scaling, and Linking: Methods and Practices M. J
Test Equating, Scaling, and Linking: Methods and Practices M.J. Kolen & R.L. Brennan (2004)

26 General Issues Equating is a statistical process that is used to adjust scores on test forms so that scores on the forms can be used interchangeably. Equating adjusts for differences in difficulty, not in content, so that scores can be used interchangeably.

27 General Issues Raw scores are typically converted to a scale score; raw scores on a subsequent administration are equated to raw scores on an old form, then converted to scale scores using the raw-to-scale score transformation. Equating can potentially improve score reporting and interpretation for different forms (examinees) that are evaluated at the same time or over time.

28 Equating Implementation
Decide on purpose for equating Construct alternate forms (same specifications) Choose a data collection design Implement data collection Choose an operational definition of equating (the type of relation between forms) Choose statistical estimation method(s) Evaluate the equating results

29 Equating Properties: Symmetry
The Equating transformation must be symmetric The X to Y transformation function is the inverse of the Y to X transformation function This relation can be assessed by conducting the equating in both directions, plotting the equating relations, and find the plots to be the same.

30 Symmetry Same test specifications Content Statistical properties
This is necessary if the scores on different forms are to be considered interchangeable.

31 Equating Properties: Equity
A matter of examinee indifference Examinees with a given true score have the same distribution of converted scores

32 Equity G*[eqY(x)| ] = G(y| ), for all   is the true score
X is a score on the new form X, Y is a score on the old form Y G is the cumulative distribution function of scores on form Y eqY is the equating function to covert X to the Y scale G* is the cum dist of eqY for the same population G*[eqY(x)| ] = G(y| ), for all 

33 Equity Examinees with a given true score () will have the same observed score means, standard deviations, and distribution shape; SEM will be the same for a given true score. Technically, Lord’s equity property holds only if the two forms are identical – eliminating the need for equating.

34 Equity Morris (1982) introduced a less restrictive equity property – First Order Equity Examinees with a given true score have the same mean converted score on the two forms. E[eqY(X)| ] = E(Y| ) for all 

35 Observed Score Properties
Equipercentile Equating The cumulative distribution of equated scores on Form X is equal to the cumulative distribution of scores on Form Y G*[eqY(x)] = G(y) Mean Equating Converted scores have the same mean Linear Equating Converted scores have the same mean and SD

36 Population Invariance
Equating relation is the same regardless of the sample used in the equating Methods based on observed score properties are not strictly group invariant; mediated by the degree to which alternate forms are carefully constructed.

37 Equating Designs: Random Groups
Spiraling randomly assigns forms to examinees; large sample sizes are needed Leads to randomly equivalent groups Random Subgroup 1 Question A1 Question A2 Question A3 Question A4 Question A5 Question A6 Question A7 Question A8 Random Subgroup 2 Question B1 Question B2 Question B3 Question B4 Question B5 Question B6 Question B7 Question B8

38 Single Group Counterbalancing
Each examinee takes both forms, half in the opposite order than the others. Smaller sample size required Requires twice the administration time Differential order effects can still be problematic, leading to discarding the second form and unstable equating

39 Common Item Nonequivalent Group
Different groups are administered different forms with a common item set Only one form can be administered at a time Common items may (internal) or may not (external) contribute to the total score Common item set should represent the total test Common items should be located in the same position on each form

40 Common Item Nonequivalent Group
Subgroup 1 Form A Question A1 Question A2 Question A3 Question A4 Question A5 Question A6 Question A7 Question A8 Subgroup 2 Form B Question B5 Question B6 Question B7 Question B8

41 Random Groups Test Administration Complications
Moderate; more than one form needs to be spiraled Test Development Complications None out of the ordinary Statistical Assumptions Minimal; random assignment to forms is effective

42 Single Group – Counterbalancing
Test Administration Complications Major; each examinee must take two forms and order must be counterbalanced Test Development Complications None out of the ordinary Statistical Assumptions Moderate; order effects cancel out and random assignment is effective

43 Common-Item Nonequivalent Groups
Test Administration Complications None; forms can be administered in typical manner Test Development Complications Representative common-item sets need to designed Statistical Assumptions Stringent; common items measure same construct in both groups, groups are similar, other required statistical assumptions hold NEAT: Non-Equivalent groups with Anchor Test

44 Common-Item to an IRT Calibrated Pool
Test Administration Complications None; forms can be administered in typical manner Test Development Complications Representative common-item sets need to designed Statistical Assumptions Stringent; same as the common-item nonequivalent group design AND the IRT model assumptions must hold

45 Observed Score Equating Method 1: Mean Equating
Form X differs in difficulty from Form Y by a constant along the score scale. Deviation scores are set to be equal: x - (X) = y - (Y) or mY(x) = y = x - (X) + (Y) mY(x) indicates a score x on Form X transformed to the scale of Form Y using mean equating.

46 Mean Equating Properties
mY(x) = y = x - (x) + (y) E[mY(X)] = (Y) [mY(X)] = (X)

47 Mean Equating Example Mean on Form X = 72 Mean on Form Y = 77
mY(x) = y = x - (x) + (y) mY(x) = x = x + 5

48 Observed Score Equating Method 2: Linear Equating
Allows for the differences in difficulty between the two forms to vary along the score scale. Standardized deviation scores are set equal. zX = zY

49 Linear Equating A linear equations of the form slope + intercept Slope = and Intercept = With terms rearranged:

50 Linear Equating Example
SD(X) = 10 and SD(Y) = 9 Slope = 9/10 = .9 Intercept = 77 – .9(72) = 12.2 lY(x) = .9(x)

51 Linear Equating Properties
E[lY(X)] = (Y) [lY(X)] = (Y)

52 Special Note: Linear Equating
The linear equating function is similar to a regression, estimating slope and intercept In regression, the ratio of SDs is multiplied by r, the correlation between X and Y. This yields a non-symmetric result, unless r = 1.

53 Observed Score Equating Method 3: Equipercentile Equating
Identify scores on Form X that have the same percentile ranks as scores on Form Y. When X and Y are continuous random variables, the equipercentile equating function is defined: eY(x) = G-1[F(x)], where G* = G. F is the cumulative distribution function of X G is the cumulative distribution function of Y G* is the cumulative distribution function of eY eY is the symmetric equating function to convert scores on Form X to scores on Form Y

54 Percentiles The percentile rank of a trait value is defined as the percentage of people in a norm group who have values less than or equal to that particular trait value. In practice we do not obtain trait values, but observed test scores. For example, a test score of 28 represents a range of trait values from 27.5 to 28.5.

55 Percentiles Because most test scores are discrete rather than continuous, the percentile rank of an integer score is the percentile rank at the midpoint of the interval that contains that score. Trait levels within the interval around the midpoint are assumed to be uniformly distributed.

56 Percentiles The percentile rank of an integer score is the percentile rank at the midpoint of the interval that contains the score. Example: Percentile rank of a score of 28 % of scores at 27 and below + ½ % of scores at 28 18% + ½(5%) = 20.5%

57 Equipercentile Equating
Equipercentile equating ensures that the transformed score distributions are identical; which is not true for raw scores. A nonlinear transformation is used to equalize all moments of the two distributions – implying a nonlinear relation between true scores, implying unequal reliabilities.

58 Equipercentile Equating
When results of equating are applied to discrete scores, the equated Form X score distribution will differ from the Form Y distribution. Smoothing methods have been developed to produce estimates of the empirical distribution and equipercentile relations which have the smoothness property characteristics of the population

59 Percentile Framework f(x) is the discrete density function for X = x f(x)  0 for integer scores x = 0, 1, … KX; f(x) = 0 otherwise; and f(x) = 1.

60 F(x) is the discrete cumulative distribution function, the proportion of examinees in the population earning a score at or below x. 0  F(x)  1 for x = 0, 1, … KX; F(x) = 0 for x < 0; and F(x) = 1 for x > KX.

61 x. is the integer that is closest to x, a noninteger value, where x. -
x* is the integer that is closest to x, a noninteger value, where x* - .5  x  x* + .5 Percentile rank function for Form X is P(x) = 100{F(x*-1)+[x-(x*-.5)][F(x*)-F(x*-1)]}. -.5  x  Kx + .5 = 0, x < -.5 = 100, x  Kx + .5

62 Percentile Computation Example
P(x) = 100{F(x*-1) + [x-(x*-.5)][F(x*)-F(x*-1)]} P(1.3) = 100{F(0) + [1.3-(1-.5)][F(1)-F(0)]} Based on data from Table 2.1 P(1.3) = 100{.2+[.8][.5-.2]} = 100{.2+.24} = 44 The percentile rank of a score of 1.3 is 44.

63 Percentile Function The inverse of the percentile rank function is referred to as the percentile function: P-1 Given a percentile rank P*, this inverse function produces the corresponding score. is the smallest integer score with a cumulative percent [100F(x)] that is greater than P*.

64 Percentile Function is the smallest integer score with a cumulative percent [100F(x)] that is greater than P*. 0  P* < 100 = KX P* = 100

65 Percentile Function An alternate percentile function is available for the largest integer score with a cumulative percent [100F(x)] that is less than P*. If some of the f(x) are zero, then xU  xL It is then typical to use x = (xU + xL)/2

66 Equipercentile Equating 1
y is a score on Form Y KY is the number of items on Form Y g(y) is the discrete density of y G(y) is the discrete cumulative distribution of y Q(y) is the percentile rank of y Q-1 is the percentile function for Form Y eY(x) = y = Q-1[P(x)], -.5  x  KX + .5

67 Equipercentile Equating 2
eY(x) = y = Q-1[P(x)] Find the percentile rank of x in the Form X distribution [P(x)]. Find the Form Y score that has the same percentile rank in the Form Y distribution Q-1.

68 Equipercentile Equating 3
eY(x) = y = Q-1[P(x)] 0  P(x) < 100 = KY P(x) = 100

69 EPSY 8223: Test Score Equating
E.E. Example Find the Form Y equipercentile equivalent of a Form X score of 2. Percentile rank of score 2 on Form X is P(2) = 60, from Table is the score with smallest G(y) that is greater than .60, so

70 Equipercentile Equating Properties
Equipercentile equated scores always fall within the range of percentiles, since the function constrains results: -.5  eY(x)  KY + .5 It is possible with mean and linear equating to obtain scores out of the range of possible scores. Ideally, the moments from Form X would equal those from Form Y; but due to the discrete nature of the scores, this does not typically result.


Download ppt "EPSY 8223: Test Score Equating"

Similar presentations


Ads by Google