Presented at Measured Progress

Slides:

Advertisements

Similar presentations

Implications and Extensions of Rasch Measurement.

Advertisements

The effect of differential item functioning in anchor items on population invariance of equating Anne Corinne Huggins University of Florida.

1 COMM 301: Empirical Research in Communication Lecture 15 – Hypothesis Testing Kwan M Lee.

1 Scaling of the Cognitive Data and Use of Student Performance Estimates Guide to the PISA Data Analysis ManualPISA Data Analysis Manual.

Test Equating Zhang Zhonghua Chinese University of Hong Kong.

VALIDITY AND RELIABILITY

IRT Equating Kolen & Brennan, IRT If data used fit the assumptions of the IRT model and good parameter estimates are obtained, we can estimate person.

AN OVERVIEW OF THE FAMILY OF RASCH MODELS Elena Kardanova

Visual Recognition Tutorial

7-2 Estimating a Population Proportion

7-3 Estimating a Population Mean

1 IRT basics: Theory and parameter estimation Wayne C. Lee, David Chuah, Patrick Wadlington, Steve Stark, & Sasha Chernyshenko.

The Calibration Process

Christopher Dougherty EC220 - Introduction to econometrics (chapter 3) Slideshow: prediction Original citation: Dougherty, C. (2012) EC220 - Introduction.

1 PREDICTION In the previous sequence, we saw how to predict the price of a good or asset given the composition of its characteristics. In this sequence,

Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Modified for EPE/EDP 711 by Kelly Bradley on January 8, 2013.

Normalised Least Mean-Square Adaptive Filtering

Nonparametric or Distribution-free Tests

Identification of Misfit Item Using IRT Models Dr Muhammad Naveed Khalid.

Item Response Theory. What’s wrong with the old approach? Classical test theory –Sample dependent –Parallel test form issue Comparing examinee scores.

Introduction to plausible values National Research Coordinators Meeting Madrid, February 2010.

Copyright © Cengage Learning. All rights reserved. 8 Tests of Hypotheses Based on a Single Sample.

The Accuracy of Small Sample Equating: An Investigative/ Comparative study of small sample Equating Methods. Kinge Mbella Liz Burton Rob Keller Nambury.

Adventures in Equating Land: Facing the Intra-Individual Consistency Index Monster * *Louis Roussos retains all rights to the title.

RMTD 404 Lecture 8. 2 Power Recall what you learned about statistical errors in Chapter 4: Type I Error: Finding a difference when there is no true difference.

1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.

Statistical Decision Theory

Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.

Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Section 7-3 Estimating a Population Mean:  Known.

6.1 - One Sample One Sample  Mean μ, Variance σ 2, Proportion π Two Samples Two Samples  Means, Variances, Proportions μ 1 vs. μ 2.

Measuring Mathematical Knowledge for Teaching: Measurement and Modeling Issues in Constructing and Using Teacher Assessments DeAnn Huinker, Daniel A. Sass,

Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.

Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.

Random Regressors and Moment Based Estimation Prepared by Vera Tabakova, East Carolina University.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.

Pearson Copyright 2010 Some Perspectives on CAT for K-12 Assessments Denny Way, Ph.D. Presented at the 2010 National Conference on Student Assessment June.

Scaling and Equating Joe Willhoft Assistant Superintendent of Assessment and Student Information Yoonsun Lee Director of Assessment and Psychometrics Office.

"Classical" Inference. Two simple inference scenarios Question 1: Are we in world A or world B?

University of Ostrava Czech republic 26-31, March, 2012.

NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.

Chapter 6 Lecture 3 Sections: 6.4 – 6.5. Sampling Distributions and Estimators What we want to do is find out the sampling distribution of a statistic.

Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.

Classification Ensemble Methods 1

Item Parameter Estimation: Does WinBUGS Do Better Than BILOG-MG?

Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Heriot Watt University 12th February 2003.

Handout Six: Sample Size, Effect Size, Power, and Assumptions of ANOVA EPSE 592 Experimental Designs and Analysis in Educational Research Instructor: Dr.

Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

The Design of Statistical Specifications for a Test Mark D. Reckase Michigan State University.

Ming Lei American Institutes for Research Okan Bulut Center for Research in Applied Measurement and Evaluation University of Alberta Item Parameter and.

Two Approaches to Estimation of Classification Accuracy Rate Under Item Response Theory Quinn N. Lathrop and Ying Cheng Assistant Professor Ph.D., University.

Hypothesis Tests. An Hypothesis is a guess about a situation that can be tested, and the test outcome can be either true or false. –The Null Hypothesis.

The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.

Chapter 10: The t Test For Two Independent Samples.

IRT Equating Kolen & Brennan, 2004 & 2014 EPSY

Nonequivalent Groups: Linear Methods Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and practices (2 nd ed.). New.

Class Six Turn In: Chapter 15: 30, 32, 38, 44, 48, 50 Chapter 17: 28, 38, 44 For Class Seven: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 Read.

STAT 312 Chapter 7 - Statistical Intervals Based on a Single Sample

Chapter 7. Classification and Prediction

Vertical Scaling in Value-Added Models for Student Learning

Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.

Item Analysis: Classical and Beyond

CJT 765: Structural Equation Modeling

Chapter 8: Inference for Proportions

Hypothesis Testing: Hypotheses

Statistical Process Control

Warmup To check the accuracy of a scale, a weight is weighed repeatedly. The scale readings are normally distributed with a standard deviation of

Item Analysis: Classical and Beyond

Item Analysis: Classical and Beyond

Presentation transcript:

Presented at Measured Progress IRT Fixed Parameter Calibration and Other Approaches to Maintaining Item Parameters on a Common Ability Scale Seonghoon Kim, PhD Keimyung University Email: seonghoonkim@empal.com Presented at Measured Progress on July 10, 2008

Overview I. Nature of IRT Ability Scale II. Three Approaches to Maintaining Item Parameters on a Common Scale III. Principle of Fixed Parameter Calibration (FPC) IV. Use of Computer Programs for FPC V. Applications of FPC for Scaling and Equating

Reference Guide This presentation was prepared based on my articles, Kim, S. (2006a). A comparative study of IRT fixed parameter calibration methods. Journal of Educational Measurement, 43 (4), 355-381. Kim, S. (2006b). A study on IRT fixed parameter calibration methods using BILOG-MG. Journal of Educational Evaluation, 19 (1), 323-342. Kim, S., & Kolen, M. J. (2006). Robustness to format effects of IRT linking methods for mixed-format tests. Applied Measurement in Education, 19 (4), 357-381. Kim, S., & Lee, W. (2006). An extension of four IRT linking methods for mixed-format tests. Journal of Educational Measurement, 43 (1), 53-76. Kim, S., & Kolen, M. J. (2007). Effects on scale linking of different definitions of criterion functions for the IRT characteristic curve methods. Journal of Educational and Behavioral Statistics, 32 (4), 371-397. and my recent thoughts and works on FPC

I. Nature of IRT Ability Scale Indeterminacy in IRT modeling Item response function (IRF) and metrics Two-parameter logistic (2PL) model IRF = P(θ | a, b) = 1/[1+exp(-Da(θ-b))] Suppose that θO = A θN + B If aO = aN /A and bO = A bN + B, P(θO | aO, bO) = P(θN | aN, bN) Therefore, IRF and item parameters are invariant conditional on linear transformation Thus, in practice, either θO or θN can be used, which means scale indeterminacy.

I. Nature of IRT Ability Scale “0, 1” Scaling vs. Rasch Scaling Scaling by arbitrarily assuming that the mean (M) and standard deviation (SD) of the ability distribution are equal to 0 (origin) and 1 (unit). Such arbitrary but “standardized” fixing is unavoidable when the M and SD are unknown. Rasch scaling Setting the origin (0) of the scale at the average difficulty of all items involved, while fixing the unit at 1. The fixed unit is guaranteed by the Rasch modeling.

I. Nature of IRT Ability Scale Need for a Fixed Common Ability Scale A fixed common scale should be used across test administrations for several reasons To check the invariance property of item parameters To achieve comparability between item parameters from different administrations To develop an item pool To conduct IRT equating

I. Nature of IRT Ability Scale Need for a Fixed Common Ability Scale To develop a common ability scale requires all new scales to be linked to the fixed old scale θO. θN1 θN2 θN3 θO

I. Nature of IRT Ability Scale Factors for Development of a Common Scale Development of a fixed common scale is subject to Data collection design for IRT scaling and equating test forms Random groups design vs. Common-item nonequivalent groups design Scaling convention “0,1” scaling vs. Rasch scaling Item parameter estimation method Marginal maximum likelihood (MML) estimation vs. Joint maximum likelihood (JML) estimation

The Context Assumed in This Presentation Data collection design for IRT scaling and equating test forms Common-item nonequivalent groups (CING) design Anchor items (i.e., common items) link two test forms Scaling convention “0, 1” scaling Group dependent In a random groups design, two “0, 1” scales from alternative forms may be considered equivalent. Marginal Maximum Likelihood (MML) Estimation Estimation of Item parameters Estimation of Underlying Ability Distribution Quadrature weights are estimated at quadrature points.

Data Structure Illustration for the CING Design Old Form (Group 1) New Form (Group 2) New Form Unique Items to New Group (2) Old Form Unique Items to Old Group (1) Items Common Items (Anchor) to Old and New Groups

II. Three Approaches to Maintaining a Common Scale Separate calibration by form and linking Estimate transformation coefficients A and B using two sets of item parameter estimates for the anchor items Use A and B to transform new form item parameter estimates into those on the old scale Fixed parameter calibration (FPC) Holding the old form anchor item parameters fixed and estimating the new form non-anchor items Concurrent calibration (aka multiple-group estimation) Combining new and old form data and estimating both all item parameters and underlying ability distributions, with the old group being designated as the reference-scale group Will not be addressed in details in this presentation

II. Maintaining the Old Scale Separate Calibration by Form and Linking “0, 1” scales from two test forms Old form scale: θO (reference) New form scale: θN (arbitrary) Scheme of linking two “0, 1” scales θO = A θN + B θN (arbitrary origin & unit) -1 1 B A θO (fixed origin & unit)

II. Maintaining the Old Scale Separate Calibration by Form and Linking Linking ability scales is completed by placing all item parameters from separate calibrations onto the fixed old scale. In the case of the 2PL model, given A and B, aN and bN parameters from a new scale are transformed into a* = aN /A and b* = A bN + B In practice, A and B are estimated with item parameter estimates from the old and new scales. Mean-Sigma Method (Marco, 1977) Mean-Mean Method (Loyd & Hoover, 1980) Haebara Method (Haebara, 1980) Stocking-Lord Method (Stocking & Lord, 1983)

II. Maintaining the Old Scale Comparative Performance Suppose that the characteristic curve (Haebara or Stocking-Lord) method is employed as a linking method for the “separate calibration and linking” approach. The performance of the three alternative approaches to maintaining the old scale is differential depending on whether the new form items are common or not (Hanson & Béguin, 2002; Kim, 2006b; Kim & Kolen, in process). For the common items, concurrent calibration would perform best, due mainly to larger sample size (new group + old group), compared to the non-common items. For the non-common items, the three approaches would perform almost equally.

II. Maintaining the Old Scale Comparative Performance

II. Maintaining the Old Scale When is FPC most appropriate? When using the “stable” old form anchor item parameters to obtain or diagnose the parameters of new form non-anchor items on the fixed old scale Note Placing the parameters of new form non-anchor items on the old scale is the focus. Updating of the old form item parameters is not concerned at all. The old form anchor items are assumed to have stable parameter estimates because a large sample was used for obtaining them.

III. Principle of FPC Basics Why To place the parameters of new form non-anchor items onto the fixed old scale How Holding the old form anchor item parameters fixed and estimating the new form non-anchor items Critical Process Estimating the underlying distribution of ability for the new form on the fixed old scale so that the new item parameters may be properly expressed on the old scale. By the IRT modeling, the underlying distribution can be estimated using both the new form data and the fixed anchor item parameters.

Estimated New Item Parameters on the θO Scale III. Principle of FPC Schematic Illustration of Updating Priors and Underlying Distributions of Ability 1st Est. Ability Dist. = 2nd Initial Prior a1N b1N … bJN θO EM Iterations 2nd Est. Ability Dist. = 3rd Initial Prior 1st Initial Prior Fixing a1O, b1O, a2O, b2O, … Final Est. Ability Dist. Estimated New Item Parameters on the θO Scale

Refer to Kim (2006a) for numerical details. III. Principle of FPC Numerical Expression: Multiple Prior Weights Updating and Multiple EM Cycles (MWU-MEM) Likelihood Function for Estimating New Form Non-Anchor Item Parameters (Iteration s, quadrature point k, person i, data y, parameters Δ) Closed-Form Formula for Estimating Quadrature Weights of the Underlying Ability Distribution from the New Form Data Refer to Kim (2006a) for numerical details.

III. Principle of FPC Summary of Key Points The values of the fixed anchor item parameters are expressed on the fixed old scale, so the origin and unit of the ability scale for the new form data have been already set. That is, we do not need to use “0, 1” scaling for the new form data. New form non-anchor item parameters should be estimated using the new form underlying distribution that is properly recovered on the fixed old scale. As with ability estimates, the underlying distribution can be estimated using the new form data and the fixed anchor item parameters. Fixing the anchor item parameters pulls the underlying distribution onto the old scale gradually. Accordingly, the new form item parameters are also pulled onto the old scale.

III. Principle of FPC Concerns about the Unstable Estimates of Anchor Item Parameters Unstable estimates of the fixed item parameters might adversely affect the performance of FPC. However, Kim (2006a) showed that FPC is robust to sampling errors of the fixed item parameter estimates in calibrating non-anchor items. This seems to be because the new form data collaborate with the fixed item parameters in “revealing” the old scale. In other words, as long as the sample size of the new group is large enough, unstable estimates of the fixed item parameters would not much affect the proper estimation of both the underlying distribution for the new group and the non-anchor item parameters.

III. Principle of FPC Two Alternatives to the MWU-MEM Method Some computer programs, such as BILOG-MG, do not update the prior quadrature weights during EM cycles when conducting FPC. The resulting posterior (quadrature) weights would not properly represent the underlying ability distribution for the new form data. Two ad-hoc methods can be used to obtain good estimates of the quadrature weights for the underlying distribution. Simple Transformation Prior Update (STPU) Method Iterative-Run Prior Update (IRPU) Method

III. Principle of FPC Two Alternatives to the MWU-MEM Method Simple Transformation Prior Update (STPU) Method Uses A and B from a linking method to simply update the prior ability distribution by transforming the posterior distribution from the regular, separate calibration with the new form. Then, conduct FPC with the updated prior ability distribution. Iterative-Run Prior Update (IRPU) Method Uses iteratively updated prior ability distributions through multiple FPC runs of BILOG-MG. An estimated posterior distribution in a calibration run is used as a prior distribution in the next calibration until the sequential procedure minimizes the difference between the two distributions.

III. Principle of FPC Two Alternatives to the MWU-MEM Method Kim (2006b) shows that the two ad hoc methods for updating the prior ability distribution work very well. In recovering the parameters of non-anchor items, the two methods perform almost equally to the Stocking-Lord linking method and concurrent calibration. In practice, the STPU method may be preferred due to simplicity. The IRPU method has the same feature as the MWU-MEM method, except for multiple runs of FPC. Thus, theoretically, the IRPU method may be more acceptable than the STPU method.

III. Principle of FPC Caveats against Using “Constrained” Estimation for FPC Someone might think that imposing strong Bayesian priors on the fixed item parameters and freeing the non-anchor item parameters would function similarly to FPC. A rationale for such constrained estimation can be found in, for example, the BILOG (Mislevy & Bock, 1990) manual. In theory, it sounds reasonable. But, my experiences suggest that using strong priors to fix the anchor item parameters tends to distort the non-fixed item parameters.

III. Principle of FPC Caveats against Using “Constrained” Estimation for FPC Note that in constrained estimation the anchor item parameters are to be estimated (although almost fixed), while in FPC they are excluded from the parameter list to be estimated. Without a facility to update ability prior weights, both the underlying distribution and non-anchor item parameters would be distorted.

IV. Use of Computer Programs for FPC BILOG-MG 7.0 (Zimowski et al., 2003) The “FIX” option does not function properly because the prior weights are not updated during EM cycles (Kim, 2006a). The STPU or IRPU method can be used. PARSCALE 4.1 (Muraki & Bock, 2003) For FPC to work properly, the “POSTERIROR” option should be used (Kim, 2006a). Without the “POSTERIOR” option, the STPU or IRPU method can be used.

IV. Use of Computer Programs for FPC Illustration of FPC with BILOG-MG Data 3,000 examinees for the new form data The data were obtained by simulating examinees from Normal (1, 1) distribution, against the old group of N(0, 1) distribution. 25-item multiple-choice (MC) test FPC First 20 items fixed (item parameters are ready for use) Last 5 items freed The three-parameter logistic (3PL) model is used for item analyses. Comparison of Default, STPU, and IRPU FPC methods

IV. Use of Computer Programs for FPC Illustration of FPC with BILOG-MG Command File (to Use the Default FPC Facility) Default FPC with BILOG-MG The examinee group (2) was sampled from N(1,1) >COMMENT Fixed-parameter calibration >GLOBAL DFNAME=‘New.txt', PRNAME='Sample.PRM', NPARM=3, SAVE; >SAVE PAR='itempar'; >LENGTH NITEMS=25; >INPUT NTOT=25, SAMPLE=3000, NALT=5, NID=4; >ITEMS INUM=(1(1)25), INAMES=(O01(1)O20, P01, P02, P03, P04, P05); >TEST TNAME=G2_FIX, INUM=(1(1)25), FIX=(1(0)20, 0(0)5); (4A1, T1, 25A1) >CALIB NQPT=31, CYCLE=100, CRIT=0.001, NEWTON=1,NOADJUST; >SCORE NOPRINT;

IV. Use of Computer Programs for FPC Illustration of FPC with BILOG-MG Data File (New.txt) 1111111111111111110111111 1111110100111111110011100 1111101000111111010111111 1111110111111111111111111 1111111110111111011011111 1111111110111011101011111 0111100100000100001001111 0110011110111111010011111 1111111111111111111111111 0111101111111011110011111 1111111111111110111111111 1111010111111111111011111 1111111110011111100011110 . . . . . . . . Item Responses for Anchor Items

IV. Use of Computer Programs for FPC Illustration of FPC with BILOG-MG No. of Fixed Items Fixed Parameter File (Sample.PRM) a b c 20 01 0.48877 -1.76191 0.18850 02 0.78980 -1.51222 0.18301 03 0.86113 -1.46012 0.17266 04 0.59502 -1.07553 0.20835 05 0.81096 -0.79854 0.20981 06 0.84988 -0.62070 0.12481 07 0.59386 -0.30609 0.17302 08 0.79144 -0.07422 0.23463 09 0.51684 0.48596 0.20394 10 0.90287 1.19854 0.16761 11 0.50175 -2.00058 0.21263 12 0.81267 -1.53418 0.15649 13 1.16172 -1.22405 0.13872 14 0.52306 -1.01148 0.18519 15 0.74785 -0.84378 0.20893 16 0.77883 -0.68332 0.19013 17 0.88805 -0.41610 0.18126 18 0.90752 0.08592 0.17534 19 0.62818 0.65946 0.26229 20 0.85275 1.82052 0.13813

IV. Use of Computer Programs for FPC Illustration of FPC with BILOG-MG Command File for the STPU Method (Before Transformation) Single Group “0, 1” Scaling, Although the examinee group was sampled from N(1,1). >COMMENT STPU FPC before Transformation of Ability Points >GLOBAL DFNAME='New.txt', NPARM=3, SAVE; >SAVE PAR='sampleSim01.PAR'; >LENGTH NITEMS=25; >INPUT NTOT=25, SAMPLE=3000, NALT=5, NID=4; >ITEMS INUM=(1(1)25), INAMES=(O01(1)O20, P01, P02, P03, P04, P05); >TEST TNAME=NO_FIX; (4A1, T1, 25A1) >CALIB NQPT=31, CYCLE=100, CRIT=0.001, NEWTON=0, IDIST=0; >SCORE NOPRINT;

IV. Use of Computer Programs for FPC Illustration of FPC with BILOG-MG Posterior Distribution from “0, 1” Scaling for the STPU Method QUADRATURE POINTS, POSTERIOR WEIGHTS, MEAN AND S.D.: 1 2 3 4 5 POINT -0.4036E+01 -0.3767E+01 -0.3498E+01 -0.3229E+01 -0.2960E+01 POSTERIOR 0.2163E-04 0.7268E-04 0.2169E-03 0.5802E-03 0.1392E-02 6 7 8 9 10 POINT -0.2691E+01 -0.2422E+01 -0.2153E+01 -0.1884E+01 -0.1615E+01 POSTERIOR 0.3030E-02 0.6054E-02 0.1104E-01 0.1842E-01 0.2878E-01 11 12 13 14 15 POINT -0.1346E+01 -0.1076E+01 -0.8074E+00 -0.5384E+00 -0.2693E+00 POSTERIOR 0.4281E-01 0.5985E-01 0.7752E-01 0.9294E-01 0.1036E+00 16 17 18 19 20 POINT -0.2361E-03 0.2688E+00 0.5379E+00 0.8069E+00 0.1076E+01 POSTERIOR 0.1074E+00 0.1034E+00 0.9265E-01 0.7725E-01 0.6001E-01 21 22 23 24 25 POINT 0.1345E+01 0.1614E+01 0.1883E+01 0.2152E+01 0.2421E+01 POSTERIOR 0.4343E-01 0.2927E-01 0.1837E-01 0.1073E-01 0.5841E-02 26 27 28 29 30 POINT 0.2690E+01 0.2959E+01 0.3228E+01 0.3498E+01 0.3767E+01 POSTERIOR 0.2957E-02 0.1399E-02 0.6105E-03 0.2514E-03 0.9631E-04 31 POINT 0.4036E+01 POSTERIOR 0.3212E-04 MEAN 0.00000 S.D. 1.00000

IV. Use of Computer Programs for FPC Illustration of FPC with BILOG-MG Command File for the STPU Method (After Transformation) STPU FPC with Transformed Prior Points The examinee group was sampled from N(1,1). Omitted (The same as the commands for before-transformation “0, 1” calibration) >TEST TNAME=G2_FIX, INUM=(1(1)25), FIX=(1(0)20, 0(0)5); (4A1, T1, 25A1) >CALIB NQPT=31, CYCLE=100, CRIT=0.001, NEWTON=0, IDIST=1, NOADJUST; >QUAD POINTS=( -3.1663E+000 -2.8864E+000 -2.6065E+000 -2.3266E+000 -2.0467E+000 -1.7668E+000 -1.4869E+000 -1.2070E+000 -9.2710E-001 -6.4720E-001 -3.6730E-001 -8.6352E-002 1.9314E-001 4.7304E-001 7.5305E-001 1.0330E+000 1.3130E+000 1.5930E+000 1.8729E+000 2.1529E+000 2.4328E+000 2.7127E+000 2.9926E+000 3.2725E+000 3.5524E+000 3.8323E+000 4.1122E+000 4.3921E+000 4.6731E+000 4.9530E+000 5.2329E+000), WEIGHTS=( 2.1630E-005 7.2680E-005 2.1690E-004 5.8020E-004 1.3920E-003 3.0300E-003 6.0540E-003 1.1040E-002 1.8420E-002 2.8780E-002 4.2810E-002 5.9850E-002 7.7520E-002 9.2940E-002 1.0360E-001 1.0740E-001 1.0340E-001 9.2650E-002 7.7250E-002 6.0010E-002 4.3430E-002 2.9270E-002 1.8370E-002 1.0730E-002 5.8410E-003 2.9570E-003 1.3990E-003 6.1050E-004 2.5140E-004 9.6310E-005 3.2120E-005); >SCORE NOPRINT; Rescaled points by θ* = Aθ+B, A = 1.040535 B = 1.033264 From “0, 1” Scaling (Not Transformed)

IV. Use of Computer Programs for FPC Illustration of FPC with BILOG-MG 2nd Command File for the IRPU Method IRPU FPC with Updated Prior Weights The examinee group was sampled from N(1,1). Omitted (The same as the commands for the default FPC run) >TEST TNAME=G2_FIX, INUM=(1(1)25), FIX=(1(0)20, 0(0)5); (4A1, T1, 25A1) >CALIB NQPT=31, CYCLE=100, CRIT=0.001, NEWTON=0, IDIST=1, NOADJUST; >QUAD POINTS=( -4.0000E+000 -3.7330E+000 -3.4670E+000 -3.2000E+000 -2.9330E+000 -2.6670E+000 -2.4000E+000 -2.1330E+000 -1.8670E+000 -1.6000E+000 -1.3330E+000 -1.0670E+000 -8.0000E-001 -5.3330E-001 -2.6670E-001 -7.7720E-016 2.6670E-001 5.3330E-001 8.0000E-001 1.0670E+000 1.3330E+000 1.6000E+000 1.8670E+000 2.1330E+000 2.4000E+000 2.6670E+000 2.9330E+000 3.2000E+000 3.4670E+000 3.7330E+000 4.0000E+000), WEIGHTS=( 8.8370E-007 3.0840E-006 1.0040E-005 3.1720E-005 9.4690E-005 2.5560E-004 6.3580E-004 1.4490E-003 3.0500E-003 6.0110E-003 1.1060E-002 1.8890E-002 3.0200E-002 4.5590E-002 6.4400E-002 8.4190E-002 1.0160E-001 1.1300E-001 1.1550E-001 1.0830E-001 9.2970E-002 7.3160E-002 5.2690E-002 3.4660E-002 2.0800E-002 1.1400E-002 5.7180E-003 2.6290E-003 1.1160E-003 4.3390E-004 1.5790E-004); >SCORE NOPRINT; Fixed Points (-4.0 to 4.0) Updated Weights (= Posterior weights from the 1st run of IRPU FPC)

IV. Use of Computer Programs for FPC Illustration of FPC with BILOG-MG History of Updated Posterior Distributions by the IRPU Method Iter# Mean Std. Dev. 0 0.000 1.000 1 0.699 0.923 2 0.876 0.921 3 0.933 0.932 4 0.954 0.943 5 0.963 0.951 6 0.967 0.956 7 0.969 0.960 8 0.971 0.963 9 0.972 0.965 10 0.973 0.966 11 0.973 0.967 12 0.974 0.968 From Default FPC Iterations stopped because the M and SD were not changed beyond the 0.001 limit

IV. Use of Computer Programs for FPC Illustration of FPC with BILOG-MG FPC Estimates of Non-Anchor Item Parameters on the Fixed Old Scale Mean/Sigma Default FPC Item a b c a b c 21 0.591 -1.947 0.212 0.650 -1.994 0.214 22 0.831 -1.643 0.222 0.922 -1.699 0.230 23 1.027 -1.781 0.196 1.128 -1.850 0.198 24 0.566 -0.988 0.213 0.635 -1.089 0.220 25 0.605 -0.727 0.206 0.681 -0.847 0.216 STPU FPC IRPU FPC Item a b c a b c 21 0.605 -1.909 0.210 0.624 -1.844 0.208 22 0.863 -1.587 0.222 0.887 -1.542 0.217 23 1.065 -1.723 0.196 1.100 -1.663 0.195 24 0.575 -0.991 0.209 0.594 -0.952 0.207 25 0.614 -0.729 0.205 0.637 -0.689 0.206

IV. Use of Computer Programs for FPC Illustration of FPC with BILOG-MG FPC Estimates of Mean and SD of the Underlying Distribution on the Fixed Old Scale Under-estimation Method Mean Std. Dev. Default FPC 0.699 0.923 STPU FPC 1.003 1.018 IRPU FPC 0.974 0.968 Mean-Sigma B = 1.033 A = 1.041 Note. The new group examinees were from a N(1,1) distribution that was expressed on the fixed old scale.

IV. Use of Computer Programs for FPC Illustration of FPC with PARSCALE Data 3,000 examinees for the new form data The data were obtained by simulating examinees from Normal (0.5, 1.22) distribution, against the old group of N(0, 1) distribution. A mixed-format test of 15 MC items and 2 five-category constructed-response (CR) items FPC First 10 MC items fixed (item parameters are ready for use) Last 5 MC and 2 CR items freed The 3PL model for MC items and the generalized partial credit (GPC) model for CR items Comparison of STPU and MWU-MEM methods

IV. Use of Computer Programs for FPC Illustration of FPC with PARSCALE Command File (MWU-MEM FPC) MWU-MEM FPC with PARSCALE The examinee group was sampled from N(0.5, 1.2^2) >COMMENT 10 common items fixed and 2 CR items calibration >FILE DFNAME='new.txt', IFNAME='MC10FIX.IFN', SAVE; >SAVE PARM='MC10FIX'; >INPUT NTOT=17, TAKE=3000, NID=5, NTEST=1, LENGTH=17; (5A1, T1, 17A1) >TEST TNAME=I10FIX, ITEMS=(1(1)45), NBLOCK=17; >BLOCK BNAME=FIXED, NITEMS=1, NCAT=2, ORI=(0, 1), MOD=(1, 2), REP=10, SKIP; >BLOCK BNAME=FREEMC, NITEMS=1, NCAT=2, ORI=(0, 1), MOD=(1, 2), GPARM=0.2, GUESS=(2, EST), REP=5; >BLOCK BNAME=FREED, NITEMS=1, NCAT=5, ORI=(0,1,2,3,4), MOD=(1,2,3,4,5), REP=2; >CALIB NQPT=41, PAR, LOG, SCALE=1.7, CYCLE=200, NEWTON=0, FREE=(NOADJUST, NOADJUST), ESTORDER, SPRI, GPRI, POSTERIOR; >SCORE NOSCORE;

IV. Use of Computer Programs for FPC Illustration of FPC with PARSCALE Data File (New.txt) 11111101111111132 11111111111111144 11111011001111032 11111111101111134 11111100011111031 11110110000101113 11010100011111111 01111101001111144 00000101000100001 00011000001100100 11111101101111122 . . . . . . . . . Item Responses for CR Items Item Responses for Anchor Items

IV. Use of Computer Programs for FPC Illustration of FPC with PARSCALE Command File to Prepare IFNAME File (MC10FIX.IFN) MWU-MEM FPC with PARSCALE No Fix, “0, 1” Scaling >COMMENT 10 common items fixed and 2 CR items calibration >FILE DFNAME='new.txt', SAVE; >SAVE PARM='MC10FIX'; >INPUT NTOT=17, TAKE=3000, NID=5, NTEST=1, LENGTH=17; (5A1, T1, 17A1) >TEST TNAME=I10FIX, ITEMS=(1(1)45), NBLOCK=17; >BLOCK BNAME=FIXED, NITEMS=1, NCAT=2, ORI=(0, 1), MOD=(1, 2), GPARM=0.2, GUESS=(2, EST), REP=10; >BLOCK BNAME=FREEMC, NITEMS=1, NCAT=2, ORI=(0, 1), MOD=(1, 2), GPARM=0.2, GUESS=(2, EST), REP=5; >BLOCK BNAME=FREED, NITEMS=1, NCAT=5, ORI=(0,1,2,3,4), MOD=(1,2,3,4,5), REP=2; >CALIB NQPT=41, PAR, LOG, SCALE=1.7, CYCLE=200, NEWTON=0, FREE=(NOADJUST, NOADJUST), ESTORDER, SPRI, GPRI, POSTERIOR; >SCORE NOSCORE; No IFNAME No SKIP

IV. Use of Computer Programs for FPC Illustration of FPC with PARSCALE Item Parameter Output File from “0, 1” Scaling MWU-MEM FPC with PARSCALE No Fix, “0, 1” Scaling I10FIX 17 17 7 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 GROUP 01 FIXED 20001 0.94308 0.07058 -1.12375 0.14908 0.26134 0.07792 0.00000 0.00000 BLOCK 20002 0.98019 0.06877 -0.93880 0.12173 0.21813 0.06540 BLOCK 20003 1.18582 0.07723 -0.72689 0.08253 0.19030 0.04856 (Omitted) FREED 50016 1.16556 0.03437 -0.14845 0.01309 0.00000 0.00000 0.00000 1.25729 0.29044 -0.33537 -1.21236 0.00000 0.04262 0.03157 0.02902 0.03037 BLOCK 50017 1.42147 0.04095 -0.19171 0.01178 0.00000 0.00000 0.00000 1.29058 0.38858 -0.50917 -1.16999 0.00000 0.03895 0.02653 0.02434 0.02606

IV. Use of Computer Programs for FPC Illustration of FPC with PARSCALE Modified Item Parameter File (MC10FIX.IFN) Replaced with fixed a Replaced with fixed b Replaced with fixed c MWU-MEM FPC with PARSCALE No Fix, “0, 1” Scaling I10FIX 17 17 7 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 GROUP 01 FIXED 20001 0.69300 0.00000 -1.50000 0.00000 0.12500 0.00000 0.00000 0.00000 BLOCK 20002 0.78600 0.00000 -1.00000 0.00000 0.18500 0.00000 BLOCK 20003 0.89700 0.00000 -0.60000 0.00000 0.23300 0.00000 (Omitted) FREED 50016 1.16556 0.03437 -0.14845 0.01309 0.00000 0.00000 0.00000 1.25729 0.29044 -0.33537 -1.21236 0.00000 0.04262 0.03157 0.02902 0.03037 BLOCK 50017 1.42147 0.04095 -0.19171 0.01178 0.00000 0.00000 0.00000 1.29058 0.38858 -0.50917 -1.16999 0.00000 0.03895 0.02653 0.02434 0.02606 Replacing for the 10 fixed items

IV. Use of Computer Programs for FPC Illustration of FPC with PARSCALE Command File for the STPU Method (After Transformation) STPU FPC with Transformed Prior Points The examinee group was sampled from N(1,1). Omitted (The same as the commands for MWU-MEM >CALIB NQPT=31, PAR, LOG, SCALE=1.7, CYCLE=200, NEWTON=0, FREE=(NOADJUST, NOADJUST), ESTORDER, SPRI, GPRI, DIST=4, QPREAD; >QUADP POINTS=( -5.2976E+000 -4.9280E+000 -4.5598E+000 -4.1902E+000 -3.8206E+000 -3.4524E+000 -3.0828E+000 -2.7132E+000 -2.3450E+000 -1.9754E+000 -1.6059E+000 -1.2377E+000 -8.6808E-001 -4.9891E-001 -1.2988E-001 2.3929E-001 6.0846E-001 9.7749E-001 1.3467E+000 1.7162E+000 2.0844E+000 2.4540E+000 2.8236E+000 3.1918E+000 3.5614E+000 3.9310E+000 4.2992E+000 4.6688E+000 5.0384E+000 5.4066E+000 5.7761E+000), WEIGHTS=( 1.2430E-005 3.4290E-005 8.7330E-005 2.0480E-004 4.4420E-004 9.1150E-004 1.8720E-003 4.1960E-003 1.0550E-002 2.5160E-002 4.2780E-002 5.0290E-002 5.8510E-002 8.6110E-002 9.9290E-002 8.6880E-002 9.7990E-002 1.0840E-001 9.0140E-002 7.7730E-002 6.4860E-002 4.3230E-002 2.4440E-002 1.3010E-002 6.7400E-003 3.3710E-003 1.6010E-003 7.1410E-004 2.9710E-004 1.1500E-004 4.1380E-005); >SCORE NOSCORE; Rescaled points by θ* = Aθ+B, A = 1.38 B = 0.24 From “0, 1” Scaling (Not Transformed)

IV. Use of Computer Programs for FPC Illustration of FPC with PARSCALE FPC Estimates of Non-Anchor Item Parameters on the Fixed Old Scale STPU Method a b c Item c2 c3 c4 c5 11 0.741 -1.361 0.194 12 0.767 -0.995 0.238 13 0.741 -0.906 0.185 14 0.942 -0.442 0.140 15 1.181 -0.113 0.234 16 0.920 0.025 -1.569 -0.343 0.449 1.562 17 1.120 -0.031 -1.667 -0.522 0.615 1.452 MWU-MEM Method Item c2 c3 c4 c5 12 0.768 -0.994 0.238 13 0.741 -0.908 0.184 14 0.942 -0.444 0.139 15 1.180 -0.113 0.234 16 0.921 0.025 -1.568 -0.342 0.450 1.561 17 1.120 -0.030 -1.666 -0.522 0.615 1.454

IV. Use of Computer Programs for FPC Illustration of FPC with BILOG-MG FPC Estimates of Mean and SD of the Underlying Distribution on the Fixed Old Scale Method Mean Std. Dev. STPU FPC 0.460 1.242 MWU-MEM FPC 0.456 1.227 Mean-Sigma B = 0.239 A = 1.384 Note. The new group examinees were from a N(0.5,1.22) distribution that was expressed on the fixed old scale. Over-estimation Under-estimation

V. Applications of FPC for Scaling and Equating Online Calibration in Computerized Adaptive Testing (CAT) Calibration of Pretest Items on the Fixed Operational Scale in Regular, Non-CAT Administration In a Mixed-Format Test, Separate Calibration of CR Items from MC Items To Minimize Effects of Bad CR Items on MC Item Calibration Equating Test Forms in the CING Design

V. Applications of FPC Online Calibration in CAT In CAT, different sets of operational items are adaptively administered to examinees, with pretest items “seeded” in a certain common block of examinee groups. Because the operational items were already calibrated, their parameters are known in CAT Thus, FPC may be the best way to calibrate and diagnose the pretest items on the scale of the operational items, without affecting the operational item parameters.

V. Applications of FPC Calibration of Pretest Items on the Fixed Operational Scale To develop test forms, pretest items are often administered together with operational items to examinees. However, it would be wise to calibrate operational items separately from pretest items, because the operational item parameters could be contaminated by bad pretest items. In this case, the ability distribution that is estimated using only the operational items can be reasonably used as the prior ability distribution for FPC with the pretest items, while the operational item parameters are used to fix the operational items in the FPC.

V. Applications of FPC FPC with Different Formats of Items A mixed-format test contains different types of items; for instance, some are MC items and others are CR items. Simultaneous calibration with both types of items can be conducted, assuming that a dominant factor underlies examinees’ responses to items. However, practitioners may want to calibrate MC items separately from CR items, because calibration with bad CR items might adversely affect the estimation of MC item parameters. In this case, MC items are first calibrated and then CR items are calibrated while fixing the MC item parameters.

V. Applications of FPC Equating Test Forms in the CING Design Test equating using IRT requires all item parameters to be placed on a common scale (which is usually the old form scale). Once all item and ability parameters are placed on a common scale, IRT true score or observed score equating is conducted. Thus, FPC can be effectively used for placing all item parameters on the fixed old scale. Surely, the anchor is the common items between the new and old forms.

EXPLORE FPC END Thank You