Hierarchical Linear Modeling for Detecting Cheating and Aberrance Statistical Detection of Potential Test Fraud May, 2012 Lawrence, KS William Skorupski.

Slides:



Advertisements
Similar presentations
Introduction to Monte Carlo Markov chain (MCMC) methods
Advertisements

Provincial Report Cards Mathematics Grades 1 to 12.
A Model Description By Benjamin Ditkowsky, Ph.D. Student Growth Models for Principal and Student Evaluation.
Hierarchical Linear Modeling: An Introduction & Applications in Organizational Research Michael C. Rodriguez.
A Bayesian random coefficient nonlinear regression for a split-plot experiment for detecting differences in the half- life of a compound Reid D. Landes.
Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review By Mary Kathryn Cowles and Bradley P. Carlin Presented by Yuting Qi 12/01/2006.
Wisconsin Knowledge & Concepts Examination (WKCE) Test Security Training for Proctors Wisconsin Department of Public Instruction Office of Educational.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Skills Diagnosis with Latent Variable Models. Topic 1: A New Diagnostic Paradigm.
Bayesian Estimation in MARK
Communicating through Data Displays October 10, 2006 © 2006 Public Consulting Group, Inc.
Bayesian Analysis of X-ray Luminosity Functions A. Ptak (JHU) Abstract Often only a relatively small number of sources of a given class are detected in.
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
New Hampshire Enhanced Assessment Initiative: Technical Documentation for Alternate Assessments Consequential Validity Inclusive Assessment Seminar Elizabeth.
Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~
Using Growth Models for Accountability Pete Goldschmidt, Ph.D. Assistant Professor California State University Northridge Senior Researcher National Center.
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Classroom Assessment A Practical Guide for Educators by Craig A
Modeling Menstrual Cycle Length in Pre- and Peri-Menopausal Women Michael Elliott Xiaobi Huang Sioban Harlow University of Michigan School of Public Health.
Bayesian Model Selection in Factorial Designs Seminal work is by Box and Meyer Seminal work is by Box and Meyer Intuitive formulation and analytical approach,
Inferences about School Quality using opportunity to learn data: The effect of ignoring classrooms. Felipe Martinez CRESST/UCLA CCSSO Large Scale Assessment.
Standardization and Test Development Nisrin Alqatarneh MSc. Occupational therapy.
The Audit Process Tahera Chaudry March Clinical audit A quality improvement process that seeks to improve patient care and outcomes through systematic.
Bayesian parameter estimation in cosmology with Population Monte Carlo By Darell Moodley (UKZN) Supervisor: Prof. K Moodley (UKZN) SKA Postgraduate conference,
Priors, Normal Models, Computing Posteriors
Chapter 8 Introduction to Hypothesis Testing
CJT 765: Structural Equation Modeling Class 7: fitting a model, fit indices, comparingmodels, statistical power.
Andrew Thomson on Generalised Estimating Equations (and simulation studies)
Student assessment AH Mehrparvar,MD Occupational Medicine department Yazd University of Medical Sciences.
IRT Model Misspecification and Metric Consequences Sora Lee Sien Deng Daniel Bolt Dept of Educational Psychology University of Wisconsin, Madison.
J1879 Robustness Validation Hand Book A Joint SAE, ZVEI, JSAE, AEC Automotive Electronics Robustness Validation Plan The current qualification and verification.
6. Evaluation of measuring tools: validity Psychometrics. 2012/13. Group A (English)
With or without constraints? An empirical comparison of two approaches to estimate interaction effects in the theory of planned behavior Eldad Davidov,
Measurement Models: Exploratory and Confirmatory Factor Analysis James G. Anderson, Ph.D. Purdue University.
Confidence Intervals: The Basics BPS chapter 14 © 2006 W.H. Freeman and Company.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
“Value added” measures of teacher quality: use and policy validity Sean P. Corcoran New York University NYU Abu Dhabi Conference January 22, 2009.
Reserve Variability – Session II: Who Is Doing What? Mark R. Shapland, FCAS, ASA, MAAA Casualty Actuarial Society Spring Meeting San Juan, Puerto Rico.
Three Frameworks for Statistical Analysis. Sample Design Forest, N=6 Field, N=4 Count ant nests per quadrat.
Validity and Item Analysis Chapter 4. Validity Concerns what the instrument measures and how well it does that task Not something an instrument has or.
A generalized bivariate Bernoulli model with covariate dependence Fan Zhang.
- 1 - Overall procedure of validation Calibration Validation Figure 12.4 Validation, calibration, and prediction (Oberkampf and Barone, 2004 ). Model accuracy.
Bayesian Travel Time Reliability
Item Parameter Estimation: Does WinBUGS Do Better Than BILOG-MG?
Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Bayesian Modelling Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
URBDP 591 A Lecture 16: Research Validity and Replication Objectives Guidelines for Writing Final Paper Statistical Conclusion Validity Montecarlo Simulation/Randomization.
- 1 - Preliminaries Multivariate normal model (section 3.6, Gelman) –For a multi-parameter vector y, multivariate normal distribution is where  is covariance.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Project VIABLE - Direct Behavior Rating: Evaluating Behaviors with Positive and Negative Definitions Rose Jaffery 1, Albee T. Ongusco 3, Amy M. Briesch.
Investigate Plan Design Create Evaluate (Test it to objective evaluation at each stage of the design cycle) state – describe - explain the problem some.
Hierarchical Models. Conceptual: What are we talking about? – What makes a statistical model hierarchical? – How does that fit into population analysis?
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Markov Chain Monte Carlo in R
Missing data: Why you should care about it and what to do about it
MCMC Stopping and Variance Estimation: Idea here is to first use multiple Chains from different initial conditions to determine a burn-in period so the.
Statistical Data Analysis
Educational Analytics
HLM with Educational Large-Scale Assessment Data: Restrictions on Inferences due to Limited Sample Sizes Sabine Meinck International Association.
CJT 765: Structural Equation Modeling
Evaluation of Parameter Recovery, Drift, and DIF with CAT Data
Week 3 Class Discussion.
This teaching material has been made freely available by the KEMRI-Wellcome Trust (Kilifi, Kenya). You can freely download,
J1879 Robustness Validation Hand Book A Joint SAE, ZVEI, JSAE, AEC Automotive Electronics Robustness Validation Plan Robustness Diagram Trends and Challenges.
Filtering and State Estimation: Basic Concepts
UNDERSTANDING RESEARCH RESULTS: STATISTICAL INFERENCE
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Cal State Northridge Psy 427 Andrew Ainsworth PhD
Presentation transcript:

Hierarchical Linear Modeling for Detecting Cheating and Aberrance Statistical Detection of Potential Test Fraud May, 2012 Lawrence, KS William Skorupski University of Kansas Karla Egan CTB/McGraw-Hill

Purpose of the Study “Cheating” as a paradigm for psychometric research has focused on individuals. Our purpose is to identify groups of cheaters, based on the premise that teachers and administrators may be motivated to inappropriately influence students’ scores.

Background Importance of cheating detection Cheating as classroom-, school-, or even district-wide phenomenon Results of many large-scale educational assessments are tied to incentives, e.g., merit-based pay, accountability, AYP targets from NCLB Teachers may be tempted to “teach to the test,” provide inappropriate materials, alter students’ answer sheets

Previous Study Skorupski & Egan (2011) demonstrated a Bayesian hierarchical modeling approach for group-level aberrance (real data). Cross-validation with external reports of impropriety. Reasonable detection rates, difficult to verify results.

Findings Relatively large aberrance for a few schools at certain Time points suggested that this approach may be useful for flagging potentially cheating schools. The present simulation study was planned to evaluate detection power.

Two “Non-Aberrant” Schools

Two Flagged Schools

Goals of the study Evaluate the robustness of the Bayesian HLM approach for detecting group-level cheating through Monte Carlo simulation. Develop heuristics for flagging known “cheaters” from the analysis

Cheating & Aberrance Certain kinds of aberrance may be evidence of cheating Answer copying Model-data misfit In our analysis: unusually high group performance at given time, given marginal group & time effects i.e., Large positive interaction effect

Important Note No cheating/aberrance detection method can “prove” cheating, but merely flag unusual individuals or groups for further review. Our goal is to demonstrate detection of known group-level cheating with adequate power while maintaining an acceptable Type I error rate.

Methods – Data Simulation Data created to emulate a vertically scaled SWA 3 linked administrations, means increasing 0.5 between each Time  t = 0, 0.5, 1 60 Groups, N(g) within ranging from 10 to 260 (Total N = 4,650)

51 of 60 means at Time 1 from (g) ~ N(0,1) 3 x 3 = 9 groups: N(g) = 10, 60, 110 (g) = -1,0,1 These 9 groups (3 at each Time, so 5% overall) will be the “cheaters”

Simulate Individual Scores  ~ MVN(0,R): 0 vector of zeros, R correlation matrix, off-diagonals = 0.77 (based on real data study) Each individual score Y igt was created by taking  igt and adding its respective Time and Group mean. At this point, all scores are “non- aberrant;” main effects alone account for differences

Simulate “Cheating” For cheating groups, additional interaction effect is added to Y igt 3 at each Time, for (g) = -1, 0, or 1 and N(g) = 10, 60, or 110 Group-by-Time (60 x 3) matrix of effects. If GT=0  no cheating, GT>0  cheating. GT=1 for simulated cheaters (i.e., Group mean is +1 above main effects)

Time 3 Cheating Time 2 Cheating Time 1 Cheating Each of these 3 patterns was crossed with 3 N = 10, 60, 110

Notes on Simulation Forms must be linked over Time In this analysis, scale scores were directly simulated (treating scores as measured without error), but in practice item response data would first be obtained, linked in a vertical scale. Examinees are nested within groups, Time points nested within individuals

Y ig1 Y ig2 Y ig3 Individuals (1,…,N(g)) Person N(g)g Person ig Person 1g Time (linked) (1,2,3) Groups (1,…,G) Group g

Methods – Analysis Hierarchical Growth Model Model: Scale scores for individuals (i) within groups (g) over time (t): Y igt =  0 +  1g +  2t +  3gt +  igt  igt ~ N (0,  2 ) Fully Bayesian estimation (MCMC) using WinBUGS (Lunn et al, 2000) 50 replications

Baseline Model Only Time- and Group-level effects are estimated as differences in intercepts (plus interaction term) With real data, other models could also incorporate covariates (SES, etc.) at any level of the model

Outcomes The parameter estimates  3gt (Group-by-Time interactions) are used to infer aberrant group performance at a given Time.  1g (main effect for Group) could also be used to detect systematic aberrance Delta values for parameter estimates, plus “Posterior Probability of Cheating” (PPoC).

Outcomes PPoC = proportion of posterior draws (samples from the posterior in MCMC output) above zero. Criterion for flagging: PPoC≥0.75 Standardized effect size for Interaction. Previous study found ≥0.5 as a reasonable criterion

Cross-validation Any Group/Time interaction effect with ≥0.5 and PPoC≥0.75 was considered flagged as aberrant (i.e., potentially cheating). Over replications, correctly identified groups were part of the Power calculation, false positive flags were part of the Type I error rate.

Results MCMC: 2 chains, 30,000 iterations each, burn-in=25,000 Very good convergence of solutions Main effects for Time and Group were well recovered. Detection power was very good at Times 2 & 3, quite low for Time 1 Acceptable Type I error rate

Flag Criteria:  ≥.5 PPoC ≥.75 Marginal Power =.59 Type1 =.04

Flag Criteria:  ≥.5 PPoC ≥.75 Marginal Power =.59 Type1 =.04

Flag Criteria:  ≥.5 PPoC ≥.75 Time 1 Power =.07 Type1 =.04

Flag Criteria:  ≥.5 PPoC ≥.75 Time 2 Power =.71 Type1 =.04

Flag Criteria:  ≥.5 PPoC ≥.75 Time 3 Power = 1 Type1 =.05

Discussion Overall power is quite good, very poor at Time 1 Type I error rate acceptable Pretty encouraging results; more simulations, replications planned More conditions with various effect sizes, sample sizes, non- linear trends, etc.

How might this method be used in practice? Flagged groups may be compared to the Overall growth trajectory to infer aberrance of performance. Groups flagged must then be investigated further. Unusual performance could be caused by cheating, or it could indicate something exemplary! Commend or Condemn?

Thanks!