Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Heriot Watt University 12th February 2003.

Slides:

Advertisements

Similar presentations

Implications and Extensions of Rasch Measurement.

Advertisements

INTRODUCTION TO ITEM RESPONSE THEORY Malcolm Rosier Survey Design and Analysis Services Pty Ltd web: Copyright © 2000.

Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 January 23, 2012.

Item Response Theory in the Secondary Classroom: What Rasch Modeling Can Reveal About Teachers, Students, and Tests. T. Jared Robinson tjaredrobinson.com.

Item Response Theory in a Multi-level Framework Saralyn Miller Meg Oliphint EDU 7309.

Brief introduction on Logistic Regression

How Should We Assess the Fit of Rasch-Type Models? Approximating the Power of Goodness-of-fit Statistics in Categorical Data Analysis Alberto Maydeu-Olivares.

Item Response Theory in Health Measurement

Item Analysis: A Crash Course Lou Ann Cooper, PhD Master Educator Fellowship Program January 10, 2008.

Some terminology When the relation between variables are expressed in this manner, we call the relevant equation(s) mathematical models The intercept and.

Introduction to Item Response Theory

AN OVERVIEW OF THE FAMILY OF RASCH MODELS Elena Kardanova

Models for Measuring. What do the models have in common? They are all cases of a general model. How are people responding? What are your intentions in.

1 BINARY CHOICE MODELS: LOGIT ANALYSIS The linear probability model may make the nonsense predictions that an event will occur with probability greater.

Visual Recognition Tutorial

Common Factor Analysis “World View” of PC vs. CF Choosing between PC and CF PAF -- most common kind of CF Communality & Communality Estimation Common Factor.

Estimation from Samples Find a likely range of values for a population parameter (e.g. average, %) Find a likely range of values for a population parameter.

Point and Confidence Interval Estimation of a Population Proportion, p

Item Response Theory. Shortcomings of Classical True Score Model Sample dependence Limitation to the specific test situation. Dependence on the parallel.

Chapter 9 Flashcards. measurement method that uses uniform procedures to collect, score, interpret, and report numerical results; usually has norms and.

Why Scale -- 1 Summarising data –Allows description of developing competence Construct validation –Dealing with many items rotated test forms –check how.

BINARY CHOICE MODELS: LOGIT ANALYSIS

Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Modified for EPE/EDP 711 by Kelly Bradley on January 8, 2013.

Identification of Misfit Item Using IRT Models Dr Muhammad Naveed Khalid.

Item Response Theory. What’s wrong with the old approach? Classical test theory –Sample dependent –Parallel test form issue Comparing examinee scores.

1 Item Analysis - Outline 1. Types of test items A. Selected response items B. Constructed response items 2. Parts of test items 3. Guidelines for writing.

Introduction to plausible values National Research Coordinators Meeting Madrid, February 2010.

From Last week.

Determining Sample Size

Modern Test Theory Item Response Theory (IRT). Limitations of classical test theory An examinee’s ability is defined in terms of a particular test The.

Estimation of Statistical Parameters

The ABC’s of Pattern Scoring Dr. Cornelia Orr. Slide 2 Vocabulary Measurement – Psychometrics is a type of measurement Classical test theory Item Response.

Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.

Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.

Dealing with Omitted and Not- Reached Items in Competence Tests: Evaluating Approaches Accounting for Missing Responses in Item Response Theory Models.

Counseling Research: Quantitative, Qualitative, and Mixed Methods, 1e © 2010 Pearson Education, Inc. All rights reserved. Basic Statistical Concepts Sang.

6. Evaluation of measuring tools: validity Psychometrics. 2012/13. Group A (English)

Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1.

Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.

1 Item Analysis - Outline 1. Types of test items A. Selected response items B. Constructed response items 2. Parts of test items 3. Guidelines for writing.

Validity and Item Analysis Chapter 4.  Concerns what instrument measures and how well it does so  Not something instrument “has” or “does not have”

The ABC’s of Pattern Scoring

Estimation. The Model Probability The Model for N Items — 1 The vector probability takes this form if we assume independence.

Item Factor Analysis Item Response Theory Beaujean Chapter 6.

Measurement MANA 4328 Dr. Jeanne Michalski

Experimental Research Methods in Language Learning Chapter 12 Reliability and Reliability Analysis.

NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.

©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.

Latent regression models. Where does the probability come from? Why isn’t the model deterministic. Each item tests something unique – We are interested.

Higher National Certificate in Engineering Unit 36 –Lesson 4 – Parameters used to Describe the Normal Distribution.

Item Response Theory in Health Measurement

Reliability a measure is reliable if it gives the same information every time it is used. reliability is assessed by a number – typically a correlation.

The Design of Statistical Specifications for a Test Mark D. Reckase Michigan State University.

2. Main Test Theories: The Classical Test Theory (CTT) Psychometrics. 2011/12. Group A (English)

More on regression Petter Mostad More on indicator variables If an independent variable is an indicator variable, cases where it is 1 will.

Item Response Theory Dan Mungas, Ph.D. Department of Neurology University of California, Davis.

Two Approaches to Estimation of Classification Accuracy Rate Under Item Response Theory Quinn N. Lathrop and Ying Cheng Assistant Professor Ph.D., University.

1 BINARY CHOICE MODELS: LOGIT ANALYSIS The linear probability model may make the nonsense predictions that an event will occur with probability greater.

Hypothesis Testing and Statistical Significance

Lesson 2 Main Test Theories: The Classical Test Theory (CTT)

IRT Equating Kolen & Brennan, 2004 & 2014 EPSY

Adopting The Item Response Theory in Operations Management Research

Data Analysis and Standard Setting

Classical Test Theory Margaret Wu.

Item Analysis: Classical and Beyond

Mohamed Dirir, Norma Sinclair, and Erin Strauts

Psy 425 Tests & Measurements

Item Analysis: Classical and Beyond

Item Analysis: Classical and Beyond

Presentation transcript:

Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Heriot Watt University 12th February 2003.

Why is item analysis relevant? Item analysis provides a way of measuring the quality of questions - seeing how appropriate they were for the candidates and how well they measured their ability. It also provides a way of re-using items over and over again in different tests with prior knowledge of how they are going to perform.

What kinds of item analysis are there? Item Analysis ClassicalLatent Trait Models RaschItem Response theory IRT1 IRT2 IRT3 IRT4

Classical Analysis Classical analysis is the easiest and most widely used form of analysis. The statistics can be computed by generic statistical packages (or at a push by hand) and need no specialist software. Analysis is performed on the test as a whole rather than on the item and although item statistics can be generated, they apply only to that group of students on that collection of items

Classical Analysis Assumptions Classical test analysis assumes that any test score is comprised of a “true” value, plus randomised error. Crucially it assumes that this error is normally distributed; uncorrelated with true score and the mean of the error is zero. x obs = x true + G(0,  err )

Classical Analysis Statistics Difficulty(item level statistic) Discrimination (item level statistic) Reliability (test level statistic)

Classical Analysis Difficulty The difficulty of a (1 mark) question in classical analysis is simply the proportion of people who answered the question incorrectly. For multiple mark questions, it is the average mark expressed as a proportion. Given on a scale of 0-1, the higher the proportion the greater the difficulty

Classical Analysis Discrimination The discrimination of an item is the (Pearson) correlation between the average item mark and the average total test mark. Being a correlation it can vary from –1 to +1 with higher values indicating (desirable) high discrimination.

Classical Analysis Reliability Reliability is a measure of how well the test “holds together”. For practical reasons, internal consistency estimates are the easiest to obtain which indicate the extent to which each item correlates with every other item. This is measured on a scale of 0-1. The greater the number the higher the reliability.

Classical Analysis vs Latent Trait Models Classical analysis has the test (not the item as its basis. Although the statistics generated are often generalised to similar students taking a similar test; they only really apply to those students taking that test Latent trait models aim to look beyond that at the underlying traits which are producing the test performance. They are measured at item level and provide sample-free measurement

Latent Trait Models Latent trait models have been around since the 1940s, but were not widely used until the 1960s. Although theoretically possible, it is practically unfeasible to use these without specialist software. They aim to measure the underlying ability (or trait) which is producing the test performance rather than measuring performance per se. This leads to them being sample-free. As the statistics are not dependant on the test situation which generated them, they can be used more flexibly

Rasch vs Item Response Theory Mathematically, Rasch is identical to the most basic IRT model (IRT1), however there are some important differences which makes it a more viable proposition for practical testing In Rasch the model is superior. Data which does not fit the model is discarded. Rasch does not permit abilities to be estimated for extreme items and persons. Rasch eschews the use of bayesian priors to assist parameter setting.

IRT - the generalised model Where a g = gradient of the ICC at the point  (item discrimination) b g = the ability level at which a g is maximised (item difficulty) c g = probability of low candidates correctly answering question g

IRT - Item Characteristic Curves An ICC is a plot of the candidates ability over the probability of them correctly answering the question. The higher the ability the higher the chance that they will respond correctly. c - intercept a - gradient b - ability at max (a)

IRT - About the Parameters Difficulty Although there is no “correct” difficulty for any one item, it is clearly desirable that the difficulty of the test is centred around the average ability of the candidates. The higher the “b” parameter the more difficult the question - note that this is inversely proportionate to the probability of the question being answered correctly.

IRT - About the Parameters Discrimination In IRT (unlike Rasch) maximal discrimination is sought. Thus the higher the “a” parameter the more desirable the question. Note however that differences in the discrimination of questions can lead to differences in the difficulties of questions across the ability range.

IRT - About the Parameters Guessing A high “c” parameter suggests that candidates with very little ability may choose the correct answer. This is rarely a valid parameter outwith multiple choice testing…and the value should not vary excessively from the reciprocal of the number of choices.

IRT - Parameter Estimation Before being used (in an item bank or for measurement) items must first be calibrated. That is their parameters must be estimated. There are two main procedures - Joint Maximal Likelihood and Marginal Maximal Likelihood. JML is most common for IRT1 and 2, while MML is used more frequently for IRT3. Bayesian estimation and estimated bounds may be imposed on the data to avoid one parameter degrading, or high discrimation items being over valued.

Resources - Classical Analysis Software Standard statistical packages (Excel; SPSS; SAS) ITEMAN (available from Reading Matlock-Hetzel (1997) Basic Concepts in Item and Test Analysis available at

Resources - IRT Software BILOG (available at Xcalibre available at Reading Lord (1980) Applications of Item Response Theory to Practical Testing Problems Baker, Frank (2001). The Basics of Item Response Theory - available at