Functional Methods for Testing Data. The data Each person a = 1,…,N Each person a = 1,…,N responds to each item i = 1,…,n responds to each item i = 1,…,n.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

ADHD Reaction Times: Densities, Mixed Effects, and PCA.
The Basics of Physics with Calculus – Part II
Distance Preserving Embeddings of Low-Dimensional Manifolds Nakul Verma UC San Diego.
Linear Regression.
Pattern Recognition and Machine Learning
Support Vector Machines Instructor Max Welling ICS273A UCIrvine.
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Model Assessment, Selection and Averaging
1 BINARY CHOICE MODELS: LOGIT ANALYSIS The linear probability model may make the nonsense predictions that an event will occur with probability greater.
An Introduction to Variational Methods for Graphical Models.
Chapter 4: Linear Models for Classification
Basis Expansion and Regularization Presenter: Hongliang Fei Brian Quanz Brian Quanz Date: July 03, 2008.
Visual Recognition Tutorial
Jim Ramsay McGill University Basis Basics. Overview  What are basis functions?  What properties should they have?  How are they usually constructed?
Kernel methods - overview
Maximum likelihood (ML) and likelihood ratio (LR) test
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
6. One-Dimensional Continuous Groups 6.1 The Rotation Group SO(2) 6.2 The Generator of SO(2) 6.3 Irreducible Representations of SO(2) 6.4 Invariant Integration.
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
General Relativity Physics Honours 2008 A/Prof. Geraint F. Lewis Rm 557, A29 Lecture Notes 2.
458 Fitting models to data – II (The Basics of Maximum Likelihood Estimation) Fish 458, Lecture 9.
Dimensional reduction, PCA
Curves Locus of a point moving with one degree of freedom
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
ROC Curves.
Human Growth: From data to functions. Challenges to measuring growth We need repeated and regular access to subjects for up to 20 years. We need repeated.
Visual Recognition Tutorial
1 STATISTICAL INFERENCE PART I EXPONENTIAL FAMILY & POINT ESTIMATION.
Maximum likelihood (ML)
BINARY CHOICE MODELS: LOGIT ANALYSIS
Today Wrap up of probability Vectors, Matrices. Calculus
Human Growth: From data to functions. Challenges to measuring growth We need repeated and regular access to subjects for up to 20 years. We need repeated.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
October 8, 2013Computer Vision Lecture 11: The Hough Transform 1 Fitting Curve Models to Edges Most contours can be well described by combining several.
Basic Statistics Standard Scores and the Normal Distribution.
Summarized by Soo-Jin Kim
Outline Separating Hyperplanes – Separable Case
3.3 Density Curves and Normal Distributions
October 14, 2014Computer Vision Lecture 11: Image Segmentation I 1Contours How should we represent contours? A good contour representation should meet.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
16-1 Copyright  2010 McGraw-Hill Australia Pty Ltd PowerPoint slides to accompany Croucher, Introductory Mathematics and Statistics, 5e Chapter 16 The.
V. Space Curves Types of curves Explicit Implicit Parametric.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
8 th Grade Math Common Core Standards. The Number System 8.NS Know that there are numbers that are not rational, and approximate them by rational numbers.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
A Generalization of PCA to the Exponential Family Collins, Dasgupta and Schapire Presented by Guy Lebanon.
CS Statistical Machine learning Lecture 24
Linear Models for Classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
CSCE 441: Keyframe Animation/Smooth Curves (Cont.) Jinxiang Chai.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Logistic Regression and Odds Ratios Psych DeShon.
Searching a Linear Subspace Lecture VI. Deriving Subspaces There are several ways to derive the nullspace matrix (or kernel matrix). ◦ The methodology.
1 BINARY CHOICE MODELS: LOGIT ANALYSIS The linear probability model may make the nonsense predictions that an event will occur with probability greater.
AP Statistics Review Day 1 Chapters 1-4. AP Exam Exploring Data accounts for 20%-30% of the material covered on the AP Exam. “Exploratory analysis of.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Deep Feedforward Networks
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
10701 / Machine Learning.
Latent Variables, Mixture Models and EM
Quantum One.
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
The Basics of Physics with Calculus – Part II
Generally Discriminant Analysis
Presentation transcript:

Functional Methods for Testing Data

The data Each person a = 1,…,N Each person a = 1,…,N responds to each item i = 1,…,n responds to each item i = 1,…,n and makes a binary response = u ai, where 0 indicates “wrong” and 1 indicates “right”. and makes a binary response = u ai, where 0 indicates “wrong” and 1 indicates “right”. We want to estimate P ai = the probability of person a getting item i right. We want to estimate P ai = the probability of person a getting item i right.

The model The response space is an n-dimensional unit hypercube. The data vectors {u a1,…,u an } are on the corners. The response space is an n-dimensional unit hypercube. The data vectors {u a1,…,u an } are on the corners. The vectors of correct response probabilities {P a1,…,P an } fall along a smooth curve in response space, the response manifold. The vectors of correct response probabilities {P a1,…,P an } fall along a smooth curve in response space, the response manifold. This manifold is, in principle, identifiable from the data, is therefore not a latent trait. This manifold is, in principle, identifiable from the data, is therefore not a latent trait.

Item response functions We can define a smooth charting function that maps each point on the response manifold to a corresponding real number θ. E.g.: arc length. We can define a smooth charting function that maps each point on the response manifold to a corresponding real number θ. E.g.: arc length. In this way we establish a metric defining positions on this manifold. In this way we establish a metric defining positions on this manifold. P i (θ) is the success probability for item i of all those at position θ. This is a smooth function of θ : The item response function for item i. P i (θ) is the success probability for item i of all those at position θ. This is a smooth function of θ : The item response function for item i.

The response manifold for 3 test items: 3, 4, and 29. The response manifold for 3 test items: 3, 4, and 29. The curve indicates the possible values of P ai = P i (θ). The curve indicates the possible values of P ai = P i (θ). The circles correspond to 11 fixed values of θ. The circles correspond to 11 fixed values of θ. Three items from an Intro Psych test Intro Psych test

What does “smooth” mean? If θ has a standard normal distribution, then experience indicates that usually: The function P i (θ) is monotonic. The function P i (θ) is monotonic. It has slopes near 0 for extreme θ values. It has slopes near 0 for extreme θ values. The lower asymptote is positive, and the upper asymptote is one. The lower asymptote is positive, and the upper asymptote is one. There is only one inflection point. There is only one inflection point.

The three-parameter logistic item response function is smooth

The challenges Let the item response functions take whatever shapes are supported by the data. Let the item response functions take whatever shapes are supported by the data. But control their smoothness in this sense. But control their smoothness in this sense. Constrain the function values to lie within [0,1]. Constrain the function values to lie within [0,1]. We want a smooth derivative, too, for the item information function. We want a smooth derivative, too, for the item information function.

The log-odds transformation deals with the [0,1] constraint We actually estimate W i (θ) = log [P i (θ) /Q i (θ)], where Q i (θ) = 1 – P i (θ). “Smooth” in terms of W i (θ) means linear behavior for extreme θ, with small slope on the left and larger positive slope on the right.

The log-odds transformation of a three-parameter logistic item response function

A three-dimensional model for a smooth log-odds function The function log(e θ +1) has the desired behavior at the extremes. The other two terms add vertical shift and tilt as required.

Comparing the 3D model and the 3PL log-odds functions

B-spline expansions for W(θ) But, in fact, we will actually estimate our log-odds functions by expanding W(θ) in terms of a set of K B-spline basis functions, whileexpanding W(θ) in terms of a set of K B-spline basis functions, while smoothing these expansions towards these simpler three-dimensional models.smoothing these expansions towards these simpler three-dimensional models.

Fitting the data We use maximum marginal likelihood estimation, using the EM algorithm to maximize where g(θ) is a prior density on θ, often taken to be the standard normal. Maximization is with respect to the nK coefficients defining the B-spline expansions of the log-odds functions.

What about smoothness? We have defined smoothness here in a less orthodox fashion; It isn’t defined only in terms of the second derivative. We have defined smoothness here in a less orthodox fashion; It isn’t defined only in terms of the second derivative. Instead, we define smooth in terms of the size of Instead, we define smooth in terms of the size of

How did you come up with this? If W(θ) conforms exactly to the three- dimensional smooth model, then LW(θ) = 0. If W(θ) conforms exactly to the three- dimensional smooth model, then LW(θ) = 0. In other words, if In other words, if then

Our strategy is to define a low- dimensional family of prototype functions that capture what we mean by “smooth.” Our strategy is to define a low- dimensional family of prototype functions that capture what we mean by “smooth.” Then we represent this family by a linear differential equation. Then we represent this family by a linear differential equation. This differential equation defines a measure of “roughness”, which we penalize. This differential equation defines a measure of “roughness”, which we penalize. The more we penalize this kind of roughness, the more we force the fitted functions to be smooth. The more we penalize this kind of roughness, the more we force the fitted functions to be smooth.

In general, if we begin with a linear model of dimension m, we can find a linear differential equation of order m such that all versions of this model will satisfy: In general, if we begin with a linear model of dimension m, we can find a linear differential equation of order m such that all versions of this model will satisfy: D m W(θ) = b 0 (θ) W(θ)+ b 1 (θ) DW(θ)+ … D m W(θ) = b 0 (θ) W(θ)+ b 1 (θ) DW(θ)+ … + b m-1 (θ) D m-1 W(θ) + b m-1 (θ) D m-1 W(θ) for some choice of coefficient functions b j (θ). for some choice of coefficient functions b j (θ). We change the equation to a roughness penalty by converting it to operator form: We change the equation to a roughness penalty by converting it to operator form: LW(θ) = b 0 (θ) W(θ)+ b 1 (θ) DW(θ)+…+ b m-1 (θ) D m-1 W (θ) + D m W(θ) = 0. + D m W(θ) = 0.

The roughness-penalty measures the departure ofW(θ) from this smooth model. measures the departure of W(θ) from this smooth model.

Roughness-penalized log marginal likelihood Consequently, we actually maximize Smoothing parameter λ controls the amount of smoothness in the W(θ)s; the larger it is, the smoothness in the W(θ) ‘s; the larger it is, the more these will look like the three-dimensional versions.

Some examples Here are three estimates of the item response functions for items 3, 4, 29, and 96 for an introductory psychology test Here are three estimates of the item response functions for items 3, 4, 29, and 96 for an introductory psychology test The test had 100 items, and was given to 379 students. The test had 100 items, and was given to 379 students. Each function W(θ) is defined by an expansion in terms of 13 B-spline basis functions. Each function W(θ) is defined by an expansion in terms of 13 B-spline basis functions.

λ=0

λ=0.01

λ=50

What does θ mean? We have fallen into the habit of calling θ a “latent trait score”. We have fallen into the habit of calling θ a “latent trait score”. Actually, it is the value of a function that is chosen more or less arbitrarily to map position along the response manifold. Actually, it is the value of a function that is chosen more or less arbitrarily to map position along the response manifold. The assumption of a standard normal distribution is pure convention. The assumption of a standard normal distribution is pure convention. We can choose otherwise. We can choose otherwise.

What charting functions would be more useful? Three choices of charting functions are especially interesting, and none are “latent” in any sense. Three choices of charting functions are especially interesting, and none are “latent” in any sense. Each leads to interesting diagnostic statistics and graphics. Each leads to interesting diagnostic statistics and graphics.

The arc length charting Arc length s measures the Euclidean distance traveled along the manifold from its origin at θ 0 to a given position θ: Arc length s measures the Euclidean distance traveled along the manifold from its origin at θ 0 to a given position θ:

Item discrimination in arc length metric One useful property of arc length is Each squared item discrimination is a proportion total test discrimination, and therefore has a familiar frame of reference.

Expected score charting Assuming that expected score is monotonically related to θ, (there aren’t too many items like 96), then Provides a metric that is familiar to users and easy for them to interpret. Expected score is already used extensively as a basis for assessing differential item functioning (DIF).

ACT Math test for males and females Three items from a 60 item math test. Three items from a 60 item math test. Around 2000 examinees. Around 2000 examinees. The male and female response manifolds differ. The male and female response manifolds differ.

Differential item functioning for an ACT Math test item

Total change charting The following total change in probability of success measure is closely related to arc length:

Some general lessons Fitting functional models to non-functional data is relatively straight-forward. Fitting functional models to non-functional data is relatively straight-forward. But we do need to transform constrained functions into unconstrained versions. But we do need to transform constrained functions into unconstrained versions. We can define smoothness or roughness in customized ways that capture the default or baseline behavior of our estimated functions. We can define smoothness or roughness in customized ways that capture the default or baseline behavior of our estimated functions.

“Latent trait models” aren’t really latent at all. “Latent trait models” aren’t really latent at all. They express the idea of a one-dimensional subspace for modeling the data. They express the idea of a one-dimensional subspace for modeling the data. Differential geometry gives us the appropriate mathematical tools. Differential geometry gives us the appropriate mathematical tools. There is room for creativity in choosing charting functions. There is room for creativity in choosing charting functions.

Looking ahead There is an intimate connection between designer roughness penalties and the estimation of differential equations from data. There is an intimate connection between designer roughness penalties and the estimation of differential equations from data. We will use discrete data to estimate a differential equation that describes the data. We will use discrete data to estimate a differential equation that describes the data.

References More technical details on fitting test data with functional models are in Rossi, N., Wang, X. and Ramsay, J. O. (2002) Nonparametric item response function estimates with the EM algorithm. Journal of Educational and Behavioral Statistics, 27,