Presented at Measured Progress

Presented at Measured Progress
IRT Fixed Parameter Calibration and Other Approaches to Maintaining Item Parameters on a Common Ability Scale Seonghoon Kim, PhD Keimyung University Presented at Measured Progress on July 10, 2008

Overview I. Nature of IRT Ability Scale
II. Three Approaches to Maintaining Item Parameters on a Common Scale III. Principle of Fixed Parameter Calibration (FPC) IV. Use of Computer Programs for FPC V. Applications of FPC for Scaling and Equating

Reference Guide This presentation was prepared based on my articles,
Kim, S. (2006a). A comparative study of IRT fixed parameter calibration methods. Journal of Educational Measurement, 43 (4), Kim, S. (2006b). A study on IRT fixed parameter calibration methods using BILOG-MG. Journal of Educational Evaluation, 19 (1), Kim, S., & Kolen, M. J. (2006). Robustness to format effects of IRT linking methods for mixed-format tests. Applied Measurement in Education, 19 (4), Kim, S., & Lee, W. (2006). An extension of four IRT linking methods for mixed-format tests. Journal of Educational Measurement, 43 (1), Kim, S., & Kolen, M. J. (2007). Effects on scale linking of different definitions of criterion functions for the IRT characteristic curve methods. Journal of Educational and Behavioral Statistics, 32 (4), and my recent thoughts and works on FPC

I. Nature of IRT Ability Scale Indeterminacy in IRT modeling
Item response function (IRF) and metrics Two-parameter logistic (2PL) model IRF = P(θ | a, b) = 1/[1+exp(-Da(θ-b))] Suppose that θO = A θN + B If aO = aN /A and bO = A bN + B, P(θO | aO, bO) = P(θN | aN, bN) Therefore, IRF and item parameters are invariant conditional on linear transformation Thus, in practice, either θO or θN can be used, which means scale indeterminacy.

I. Nature of IRT Ability Scale “0, 1” Scaling vs. Rasch Scaling
Scaling by arbitrarily assuming that the mean (M) and standard deviation (SD) of the ability distribution are equal to 0 (origin) and 1 (unit). Such arbitrary but “standardized” fixing is unavoidable when the M and SD are unknown. Rasch scaling Setting the origin (0) of the scale at the average difficulty of all items involved, while fixing the unit at 1. The fixed unit is guaranteed by the Rasch modeling.

I. Nature of IRT Ability Scale Need for a Fixed Common Ability Scale
A fixed common scale should be used across test administrations for several reasons To check the invariance property of item parameters To achieve comparability between item parameters from different administrations To develop an item pool To conduct IRT equating

I. Nature of IRT Ability Scale Need for a Fixed Common Ability Scale
To develop a common ability scale requires all new scales to be linked to the fixed old scale θO. θN1 θN2 θN3 θO

I. Nature of IRT Ability Scale Factors for Development of a Common Scale
Development of a fixed common scale is subject to Data collection design for IRT scaling and equating test forms Random groups design vs. Common-item nonequivalent groups design Scaling convention “0,1” scaling vs. Rasch scaling Item parameter estimation method Marginal maximum likelihood (MML) estimation vs. Joint maximum likelihood (JML) estimation

The Context Assumed in This Presentation
Data collection design for IRT scaling and equating test forms Common-item nonequivalent groups (CING) design Anchor items (i.e., common items) link two test forms Scaling convention “0, 1” scaling Group dependent In a random groups design, two “0, 1” scales from alternative forms may be considered equivalent. Marginal Maximum Likelihood (MML) Estimation Estimation of Item parameters Estimation of Underlying Ability Distribution Quadrature weights are estimated at quadrature points.

Data Structure Illustration for the CING Design
Old Form (Group 1) New Form (Group 2) New Form Unique Items to New Group (2) Old Form Unique Items to Old Group (1) Items Common Items (Anchor) to Old and New Groups

II. Three Approaches to Maintaining a Common Scale
Separate calibration by form and linking Estimate transformation coefficients A and B using two sets of item parameter estimates for the anchor items Use A and B to transform new form item parameter estimates into those on the old scale Fixed parameter calibration (FPC) Holding the old form anchor item parameters fixed and estimating the new form non-anchor items Concurrent calibration (aka multiple-group estimation) Combining new and old form data and estimating both all item parameters and underlying ability distributions, with the old group being designated as the reference-scale group Will not be addressed in details in this presentation

II. Maintaining the Old Scale Separate Calibration by Form and Linking
“0, 1” scales from two test forms Old form scale: θO (reference) New form scale: θN (arbitrary) Scheme of linking two “0, 1” scales θO = A θN + B θN (arbitrary origin & unit) -1 1 B A θO (fixed origin & unit)

II. Maintaining the Old Scale Separate Calibration by Form and Linking
Linking ability scales is completed by placing all item parameters from separate calibrations onto the fixed old scale. In the case of the 2PL model, given A and B, aN and bN parameters from a new scale are transformed into a* = aN /A and b* = A bN + B In practice, A and B are estimated with item parameter estimates from the old and new scales. Mean-Sigma Method (Marco, 1977) Mean-Mean Method (Loyd & Hoover, 1980) Haebara Method (Haebara, 1980) Stocking-Lord Method (Stocking & Lord, 1983)

II. Maintaining the Old Scale Comparative Performance
Suppose that the characteristic curve (Haebara or Stocking-Lord) method is employed as a linking method for the “separate calibration and linking” approach. The performance of the three alternative approaches to maintaining the old scale is differential depending on whether the new form items are common or not (Hanson & Béguin, 2002; Kim, 2006b; Kim & Kolen, in process). For the common items, concurrent calibration would perform best, due mainly to larger sample size (new group + old group), compared to the non-common items. For the non-common items, the three approaches would perform almost equally.

II. Maintaining the Old Scale Comparative Performance

II. Maintaining the Old Scale When is FPC most appropriate?
When using the “stable” old form anchor item parameters to obtain or diagnose the parameters of new form non-anchor items on the fixed old scale Note Placing the parameters of new form non-anchor items on the old scale is the focus. Updating of the old form item parameters is not concerned at all. The old form anchor items are assumed to have stable parameter estimates because a large sample was used for obtaining them.

III. Principle of FPC Basics
Why To place the parameters of new form non-anchor items onto the fixed old scale How Holding the old form anchor item parameters fixed and estimating the new form non-anchor items Critical Process Estimating the underlying distribution of ability for the new form on the fixed old scale so that the new item parameters may be properly expressed on the old scale. By the IRT modeling, the underlying distribution can be estimated using both the new form data and the fixed anchor item parameters.

Estimated New Item Parameters on the θO Scale
III. Principle of FPC Schematic Illustration of Updating Priors and Underlying Distributions of Ability 1st Est. Ability Dist. = 2nd Initial Prior a1N b1N … bJN θO EM Iterations 2nd Est. Ability Dist. = 3rd Initial Prior 1st Initial Prior Fixing a1O, b1O, a2O, b2O, … Final Est. Ability Dist. Estimated New Item Parameters on the θO Scale

Refer to Kim (2006a) for numerical details.
III. Principle of FPC Numerical Expression: Multiple Prior Weights Updating and Multiple EM Cycles (MWU-MEM) Likelihood Function for Estimating New Form Non-Anchor Item Parameters (Iteration s, quadrature point k, person i, data y, parameters Δ) Closed-Form Formula for Estimating Quadrature Weights of the Underlying Ability Distribution from the New Form Data Refer to Kim (2006a) for numerical details.

III. Principle of FPC Summary of Key Points
The values of the fixed anchor item parameters are expressed on the fixed old scale, so the origin and unit of the ability scale for the new form data have been already set. That is, we do not need to use “0, 1” scaling for the new form data. New form non-anchor item parameters should be estimated using the new form underlying distribution that is properly recovered on the fixed old scale. As with ability estimates, the underlying distribution can be estimated using the new form data and the fixed anchor item parameters. Fixing the anchor item parameters pulls the underlying distribution onto the old scale gradually. Accordingly, the new form item parameters are also pulled onto the old scale.

III. Principle of FPC Concerns about the Unstable Estimates of Anchor Item Parameters
Unstable estimates of the fixed item parameters might adversely affect the performance of FPC. However, Kim (2006a) showed that FPC is robust to sampling errors of the fixed item parameter estimates in calibrating non-anchor items. This seems to be because the new form data collaborate with the fixed item parameters in “revealing” the old scale. In other words, as long as the sample size of the new group is large enough, unstable estimates of the fixed item parameters would not much affect the proper estimation of both the underlying distribution for the new group and the non-anchor item parameters.

III. Principle of FPC Two Alternatives to the MWU-MEM Method
Some computer programs, such as BILOG-MG, do not update the prior quadrature weights during EM cycles when conducting FPC. The resulting posterior (quadrature) weights would not properly represent the underlying ability distribution for the new form data. Two ad-hoc methods can be used to obtain good estimates of the quadrature weights for the underlying distribution. Simple Transformation Prior Update (STPU) Method Iterative-Run Prior Update (IRPU) Method

Simple Transformation Prior Update (STPU) Method Uses A and B from a linking method to simply update the prior ability distribution by transforming the posterior distribution from the regular, separate calibration with the new form. Then, conduct FPC with the updated prior ability distribution. Iterative-Run Prior Update (IRPU) Method Uses iteratively updated prior ability distributions through multiple FPC runs of BILOG-MG. An estimated posterior distribution in a calibration run is used as a prior distribution in the next calibration until the sequential procedure minimizes the difference between the two distributions.

Kim (2006b) shows that the two ad hoc methods for updating the prior ability distribution work very well. In recovering the parameters of non-anchor items, the two methods perform almost equally to the Stocking-Lord linking method and concurrent calibration. In practice, the STPU method may be preferred due to simplicity. The IRPU method has the same feature as the MWU-MEM method, except for multiple runs of FPC. Thus, theoretically, the IRPU method may be more acceptable than the STPU method.

III. Principle of FPC Caveats against Using “Constrained” Estimation for FPC
Someone might think that imposing strong Bayesian priors on the fixed item parameters and freeing the non-anchor item parameters would function similarly to FPC. A rationale for such constrained estimation can be found in, for example, the BILOG (Mislevy & Bock, 1990) manual. In theory, it sounds reasonable. But, my experiences suggest that using strong priors to fix the anchor item parameters tends to distort the non-fixed item parameters.

III. Principle of FPC Caveats against Using “Constrained” Estimation for FPC
Note that in constrained estimation the anchor item parameters are to be estimated (although almost fixed), while in FPC they are excluded from the parameter list to be estimated. Without a facility to update ability prior weights, both the underlying distribution and non-anchor item parameters would be distorted.

IV. Use of Computer Programs for FPC
BILOG-MG 7.0 (Zimowski et al., 2003) The “FIX” option does not function properly because the prior weights are not updated during EM cycles (Kim, 2006a). The STPU or IRPU method can be used. PARSCALE 4.1 (Muraki & Bock, 2003) For FPC to work properly, the “POSTERIROR” option should be used (Kim, 2006a). Without the “POSTERIOR” option, the STPU or IRPU method can be used.

IV. Use of Computer Programs for FPC Illustration of FPC with BILOG-MG
Data 3,000 examinees for the new form data The data were obtained by simulating examinees from Normal (1, 1) distribution, against the old group of N(0, 1) distribution. 25-item multiple-choice (MC) test FPC First 20 items fixed (item parameters are ready for use) Last 5 items freed The three-parameter logistic (3PL) model is used for item analyses. Comparison of Default, STPU, and IRPU FPC methods

Command File (to Use the Default FPC Facility) Default FPC with BILOG-MG The examinee group (2) was sampled from N(1,1) >COMMENT Fixed-parameter calibration >GLOBAL DFNAME=‘New.txt', PRNAME='Sample.PRM', NPARM=3, SAVE; >SAVE PAR='itempar'; >LENGTH NITEMS=25; >INPUT NTOT=25, SAMPLE=3000, NALT=5, NID=4; >ITEMS INUM=(1(1)25), INAMES=(O01(1)O20, P01, P02, P03, P04, P05); >TEST TNAME=G2_FIX, INUM=(1(1)25), FIX=(1(0)20, 0(0)5); (4A1, T1, 25A1) >CALIB NQPT=31, CYCLE=100, CRIT=0.001, NEWTON=1,NOADJUST; >SCORE NOPRINT;

Data File (New.txt) Item Responses for Anchor Items

No. of Fixed Items Fixed Parameter File (Sample.PRM) a b c 20

Command File for the STPU Method (Before Transformation) Single Group “0, 1” Scaling, Although the examinee group was sampled from N(1,1). >COMMENT STPU FPC before Transformation of Ability Points >GLOBAL DFNAME='New.txt', NPARM=3, SAVE; >SAVE PAR='sampleSim01.PAR'; >LENGTH NITEMS=25; >INPUT NTOT=25, SAMPLE=3000, NALT=5, NID=4; >ITEMS INUM=(1(1)25), INAMES=(O01(1)O20, P01, P02, P03, P04, P05); >TEST TNAME=NO_FIX; (4A1, T1, 25A1) >CALIB NQPT=31, CYCLE=100, CRIT=0.001, NEWTON=0, IDIST=0; >SCORE NOPRINT;

Posterior Distribution from “0, 1” Scaling for the STPU Method QUADRATURE POINTS, POSTERIOR WEIGHTS, MEAN AND S.D.: POINT E E E E E+01 POSTERIOR E E E E E-02 POINT E E E E E+01 POSTERIOR E E E E E-01 POINT E E E E E+00 POSTERIOR E E E E E+00 POINT E E E E E+01 POSTERIOR E E E E E-01 POINT E E E E E+01 POSTERIOR E E E E E-02 POINT E E E E E+01 POSTERIOR E E E E E-04 31 POINT E+01 POSTERIOR E-04 MEAN S.D

Command File for the STPU Method (After Transformation) STPU FPC with Transformed Prior Points The examinee group was sampled from N(1,1). Omitted (The same as the commands for before-transformation “0, 1” calibration) >TEST TNAME=G2_FIX, INUM=(1(1)25), FIX=(1(0)20, 0(0)5); (4A1, T1, 25A1) >CALIB NQPT=31, CYCLE=100, CRIT=0.001, NEWTON=0, IDIST=1, NOADJUST; >QUAD POINTS=( E E E E E+000 E E E E E-001 E E E E E-001 1.0330E E E E E+000 2.4328E E E E E+000 3.8323E E E E E+000 5.2329E+000), WEIGHTS=( 2.1630E E E E E-003 3.0300E E E E E-002 4.2810E E E E E-001 1.0740E E E E E-002 4.3430E E E E E-003 2.9570E E E E E-005 3.2120E-005); >SCORE NOPRINT; Rescaled points by θ* = Aθ+B, A = B = From “0, 1” Scaling (Not Transformed)

2nd Command File for the IRPU Method IRPU FPC with Updated Prior Weights The examinee group was sampled from N(1,1). Omitted (The same as the commands for the default FPC run) >TEST TNAME=G2_FIX, INUM=(1(1)25), FIX=(1(0)20, 0(0)5); (4A1, T1, 25A1) >CALIB NQPT=31, CYCLE=100, CRIT=0.001, NEWTON=0, IDIST=1, NOADJUST; >QUAD POINTS=( E E E E E+000 E E E E E+000 E E E E E-001 E E E E E+000 1.3330E E E E E+000 2.6670E E E E E+000 4.0000E+000), WEIGHTS=( 8.8370E E E E E-005 2.5560E E E E E-003 1.1060E E E E E-002 8.4190E E E E E-001 9.2970E E E E E-002 1.1400E E E E E-004 1.5790E-004); >SCORE NOPRINT; Fixed Points (-4.0 to 4.0) Updated Weights (= Posterior weights from the 1st run of IRPU FPC)

History of Updated Posterior Distributions by the IRPU Method Iter# Mean Std. Dev. From Default FPC Iterations stopped because the M and SD were not changed beyond the limit

FPC Estimates of Non-Anchor Item Parameters on the Fixed Old Scale Mean/Sigma Default FPC Item a b c a b c STPU FPC IRPU FPC Item a b c a b c

FPC Estimates of Mean and SD of the Underlying Distribution on the Fixed Old Scale Under-estimation Method Mean Std. Dev. Default FPC STPU FPC IRPU FPC Mean-Sigma B = A = 1.041 Note. The new group examinees were from a N(1,1) distribution that was expressed on the fixed old scale.

IV. Use of Computer Programs for FPC Illustration of FPC with PARSCALE
Data 3,000 examinees for the new form data The data were obtained by simulating examinees from Normal (0.5, 1.22) distribution, against the old group of N(0, 1) distribution. A mixed-format test of 15 MC items and 2 five-category constructed-response (CR) items FPC First 10 MC items fixed (item parameters are ready for use) Last 5 MC and 2 CR items freed The 3PL model for MC items and the generalized partial credit (GPC) model for CR items Comparison of STPU and MWU-MEM methods

Command File (MWU-MEM FPC) MWU-MEM FPC with PARSCALE The examinee group was sampled from N(0.5, 1.2^2) >COMMENT 10 common items fixed and 2 CR items calibration >FILE DFNAME='new.txt', IFNAME='MC10FIX.IFN', SAVE; >SAVE PARM='MC10FIX'; >INPUT NTOT=17, TAKE=3000, NID=5, NTEST=1, LENGTH=17; (5A1, T1, 17A1) >TEST TNAME=I10FIX, ITEMS=(1(1)45), NBLOCK=17; >BLOCK BNAME=FIXED, NITEMS=1, NCAT=2, ORI=(0, 1), MOD=(1, 2), REP=10, SKIP; >BLOCK BNAME=FREEMC, NITEMS=1, NCAT=2, ORI=(0, 1), MOD=(1, 2), GPARM=0.2, GUESS=(2, EST), REP=5; >BLOCK BNAME=FREED, NITEMS=1, NCAT=5, ORI=(0,1,2,3,4), MOD=(1,2,3,4,5), REP=2; >CALIB NQPT=41, PAR, LOG, SCALE=1.7, CYCLE=200, NEWTON=0, FREE=(NOADJUST, NOADJUST), ESTORDER, SPRI, GPRI, POSTERIOR; >SCORE NOSCORE;

Data File (New.txt) Item Responses for CR Items Item Responses for Anchor Items

Command File to Prepare IFNAME File (MC10FIX.IFN) MWU-MEM FPC with PARSCALE No Fix, “0, 1” Scaling >COMMENT 10 common items fixed and 2 CR items calibration >FILE DFNAME='new.txt', SAVE; >SAVE PARM='MC10FIX'; >INPUT NTOT=17, TAKE=3000, NID=5, NTEST=1, LENGTH=17; (5A1, T1, 17A1) >TEST TNAME=I10FIX, ITEMS=(1(1)45), NBLOCK=17; >BLOCK BNAME=FIXED, NITEMS=1, NCAT=2, ORI=(0, 1), MOD=(1, 2), GPARM=0.2, GUESS=(2, EST), REP=10; >BLOCK BNAME=FREEMC, NITEMS=1, NCAT=2, ORI=(0, 1), MOD=(1, 2), GPARM=0.2, GUESS=(2, EST), REP=5; >BLOCK BNAME=FREED, NITEMS=1, NCAT=5, ORI=(0,1,2,3,4), MOD=(1,2,3,4,5), REP=2; >CALIB NQPT=41, PAR, LOG, SCALE=1.7, CYCLE=200, NEWTON=0, FREE=(NOADJUST, NOADJUST), ESTORDER, SPRI, GPRI, POSTERIOR; >SCORE NOSCORE; No IFNAME No SKIP

Item Parameter Output File from “0, 1” Scaling MWU-MEM FPC with PARSCALE No Fix, “0, 1” Scaling I10FIX GROUP 01 FIXED BLOCK BLOCK (Omitted) FREED BLOCK

Modified Item Parameter File (MC10FIX.IFN) Replaced with fixed a Replaced with fixed b Replaced with fixed c MWU-MEM FPC with PARSCALE No Fix, “0, 1” Scaling I10FIX GROUP 01 FIXED BLOCK BLOCK (Omitted) FREED BLOCK Replacing for the 10 fixed items

Command File for the STPU Method (After Transformation) STPU FPC with Transformed Prior Points The examinee group was sampled from N(1,1). Omitted (The same as the commands for MWU-MEM >CALIB NQPT=31, PAR, LOG, SCALE=1.7, CYCLE=200, NEWTON=0, FREE=(NOADJUST, NOADJUST), ESTORDER, SPRI, GPRI, DIST=4, QPREAD; >QUADP POINTS=( E E E E E+000 E E E E E+000 E E E E E-001 2.3929E E E E E+000 2.0844E E E E E+000 3.9310E E E E E+000 5.7761E+000), WEIGHTS=( 1.2430E E E E E-004 9.1150E E E E E-002 4.2780E E E E E-002 8.6880E E E E E-002 6.4860E E E E E-003 3.3710E E E E E-004 4.1380E-005); >SCORE NOSCORE; Rescaled points by θ* = Aθ+B, A = 1.38 B = 0.24 From “0, 1” Scaling (Not Transformed)

FPC Estimates of Non-Anchor Item Parameters on the Fixed Old Scale STPU Method a b c Item c c c c5 MWU-MEM Method Item c c c c5

FPC Estimates of Mean and SD of the Underlying Distribution on the Fixed Old Scale Method Mean Std. Dev. STPU FPC MWU-MEM FPC Mean-Sigma B = A = 1.384 Note. The new group examinees were from a N(0.5,1.22) distribution that was expressed on the fixed old scale. Over-estimation Under-estimation

V. Applications of FPC for Scaling and Equating
Online Calibration in Computerized Adaptive Testing (CAT) Calibration of Pretest Items on the Fixed Operational Scale in Regular, Non-CAT Administration In a Mixed-Format Test, Separate Calibration of CR Items from MC Items To Minimize Effects of Bad CR Items on MC Item Calibration Equating Test Forms in the CING Design

V. Applications of FPC Online Calibration in CAT
In CAT, different sets of operational items are adaptively administered to examinees, with pretest items “seeded” in a certain common block of examinee groups. Because the operational items were already calibrated, their parameters are known in CAT Thus, FPC may be the best way to calibrate and diagnose the pretest items on the scale of the operational items, without affecting the operational item parameters.

V. Applications of FPC Calibration of Pretest Items on the Fixed Operational Scale
To develop test forms, pretest items are often administered together with operational items to examinees. However, it would be wise to calibrate operational items separately from pretest items, because the operational item parameters could be contaminated by bad pretest items. In this case, the ability distribution that is estimated using only the operational items can be reasonably used as the prior ability distribution for FPC with the pretest items, while the operational item parameters are used to fix the operational items in the FPC.

V. Applications of FPC FPC with Different Formats of Items
A mixed-format test contains different types of items; for instance, some are MC items and others are CR items. Simultaneous calibration with both types of items can be conducted, assuming that a dominant factor underlies examinees’ responses to items. However, practitioners may want to calibrate MC items separately from CR items, because calibration with bad CR items might adversely affect the estimation of MC item parameters. In this case, MC items are first calibrated and then CR items are calibrated while fixing the MC item parameters.

V. Applications of FPC Equating Test Forms in the CING Design
Test equating using IRT requires all item parameters to be placed on a common scale (which is usually the old form scale). Once all item and ability parameters are placed on a common scale, IRT true score or observed score equating is conducted. Thus, FPC can be effectively used for placing all item parameters on the fixed old scale. Surely, the anchor is the common items between the new and old forms.

EXPLORE FPC END Thank You

Presented at Measured Progress

Similar presentations

Presentation on theme: "Presented at Measured Progress"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Presented at Measured Progress

Similar presentations

Presentation on theme: "Presented at Measured Progress"— Presentation transcript:

Similar presentations

About project

Feedback