Presentation on theme: "Stephen C. Court Presented at"— Presentation transcript:
1 Empirical Methods to Evaluate the Instructional Sensitivity of Accountability Tests Stephen C. CourtPresented atAssociation of Educational Assessment - Europe10th Annual ConferenceInnovation in Assessment to meet changing needs5 - 7 November 2009Valletta, Malta
2 Basic Assumption of Accountability Systems Student test scores accurately reflect instructional qualityHigher scores = greater learning due to higher quality teachingLower scores = less learning due to lower quality teachingIn short, it is assumed, accountability tests are instructionally sensitive.
3 Reality The assumption rarely holds. Most accountability tests are not sensitive to instruction because they simply were not constructed to be instructionally sensitive.The tests are built to the same general “Army Alpha” specifications - originally designed during the First World War – used to differentiate between officer candidates and enlisted personnel.
4 Consequences of Instructional Insensitivity In principle:Lack of fairnessLack of trustworthy evidence to support validity argumentsIn practiceBad policyBad evaluationBad things happen in the classroom
5 The situation in Kansas - SES SES disparitiesbetween districts
6 The situation in Kansas - Test Scores Disparities in state assessment scores and proficiency rates
7 The Situation in Kansas Can the instruction in high-poverty districts be so much worser than the instruction in low-poverty districts?Or, are construct-irrelevant factors (such as SES) masking the effects of instruction?
8 The basic question: What methods can be employed to evaluate the instructional sensitivity of accountability tests?
9 Definition Instructional Sensitivity “the degree to which students’ performances on a test…accurately reflect the quality of instruction provided specificallyto promote students’ mastery of the knowledge and skills being assessed.”(Popham, 2008)
10 Two-pronged Approach Judgmental strategies Empirical studies At last year’s AEA conference in Hissar, Popham (2008) advocated a two-pronged approach to evaluating instructional sensitivity:Judgmental strategiesEmpirical studies
11 Empirical Study Following the guidance of Popham (2007)… three Kansas school districts conducted an empirical study of the Kansas assessments.
12 Description of the Kansas Study Teachers were invited to complete a brief online rating form. Participation was voluntary.Each teacher identified the 3-4 indicators (curricular aims) he or she had taught best during the school year.Student results were matched to responding teachers.
13 Study Participants 575 teachers responded 14,000 students 320 teachers (grades 3-5 reading and math)129 reading teachers (grades 6-8)126 math teachers (grades 6-8) 14,000 students
14 A Gold StandardTypically, test scores are used to confirm teacher perceptions…as if the test scores are infallible and the teachers are always suspect.In fact, for the first 40 years of inquiry into instructional sensitivity, teacher perceptions were never even part of the mix. Instructional sensitivity studies always contrasted two sets of scores – e.g. pre-test/post-test, not-taught/taught, etc.Asking teachers to identify their best-taught indicators has changed the instructional sensitivity issue both conceptually and operationally.
15 Old and New Model Instructional Sensitivity A = Non-LearningB = LearningC = SlipD = MaintainA = True FailB = False PassC = False FailD = True Pass
16 Kansas Study Propensity Score Matching Propensity scores were generated from logistic regression: Several demographic and prior performance characteristics were regressed on overall proficiency rate.Probabilities were used to match “Not-Best-Taught” with “Best-Taught” students using “nearest neighbor” method.Purpose: to form quasi- “random equivalent groups” of similar size for each content area, grade level, and indicator configuration.
17 Basic ContrastThe basic contrast involved “best-taught” versus “not-best-taught”For example…Grade 3 Reading – Indicator 1…Given average class size, 160 teachers responded30 teachers identified Indicator 1 as one of their best-taught.From among the pool of other teachers and their students, the propensity score matching was used to form an equivalent group of 750 students from 30 teachers.
18 Initial Analysis Scheme Conduct independent t-tests withmean indicator score as dependent variableBest-taught versus Other students as independent variable
19 Initial Analysis Scheme Initial logic:If best-taught students outperform other students, indicator is sensitive to instruction.If mean differences are small or in the wrong direction, indicator is insensitive to instruction.
20 ProblemBut significant performance differences between best-taught and other students do not necessarily represent significant differences in instructional sensitivity.Instead, instructional sensitivity is about whether the indicator accurately distinguishes effective from ineffective instruction – without confounding from any form of construct irrelevant easiness or difficulty.
21 Basic ConceptIn its simplest form, Popham’s definition of instructional sensitivity can be depicted as a 2x2 contingency table.
23 Basic Concepts Mean Least effective = B/(A+B) Mean Most effective = D/(C+D)ButMean Least effective = False Pass/(True Fail + False Pass)makes no sense at all.In fact, it returns to the outcome as infallible and the teacher perceptions as suspect: If the pass-rate for the two groups are statistically similar, then the degree of difference between less and most effective must be questioned.
24 Conceptually CorrectRather than comparing means, we instead need to look at the combined proportions of true fail and true pass. That is,(A + D) / (A + B + C + D)Which can be shortened to(A + D) / N
25 (A + D) / N Index 1 Ranges from 0 to 1 (Completely Insensitive to Totally Sensitive)In practice:Values < .50are worse than random guessing
26 Totally Sensitive (A + D)/N = (50 + 50)/100 = 1.0 A totally sensitive test would cluster students into A or D.
27 Totally Insensitive (A+D)/N = (0+0)/100 = 0.0 A totally insensitive test clusters students into B and C
28 Useless (A+D)/N = (25+25)/100 = 0.50 0.50 = mere chance Values < 0.50 are worse than chance.
29 Index 1 Equivalents Index 1 is conceptually equivalent to: Mann-Whitney UWilcoxon statisticTransposing Cell A and Cell B, then running a t-testArea Under the Curve (AUC) in Receiver Operating Characteristic (ROC) curve analysis
30 ROC Curve AnalysisHas rarely been used in domain of educational researchMore commonly used inmedicine and radiologydata mining (information retrieval)artificial intelligence (machine learning)The use of ROC curves was first introduced during WWII in response to the challenge of how to accurately identify enemy planes on radar screens.
31 AUC ContextROC Curve Analysis – especially the AUC - is more useful for several reasons:Easily computedEasily interpretedDecomposable into sensitivity and specificitySensitivity = D / (B+D)Specificity = C / (A+C)Easily graphed as (Sensitivity) versus (1 – Specificity)Readily expandable to polytomous situationsMultiple test items in a subscaleMultiple subscales in a testMultiple groups being tested
32 Basic Interpretation (Descriptive) Easy to compute: (A+D)/NEasy to interpret…= excellent (A)= good (B)= fair (C)= poor (D)= fail (F)Less than .50 is worse than guessing!
33 Basic InterpretationMost statistical software packages – e.g., SAS, SPSS - include a ROC procedure.The area under the curve table displaysestimates of the area,standard error of the area,confidence limits for the area,and the p-value of a hypothesis test.
34 ROC Hypothesis Test The null hypothesis: true AUC = .50. So, use of ROC Curve Analysis in this context would support rigorous psychometric inquiry into instructional sensitivity.Yet, the A, B, C, D, F system could be reported in ways that even the least experienced reporters or policy-makers can readily understand.
35 Area Under Curve (AUC) - Graphed Curve 1 = .50 Pure chance…no better than random guessCurve 4 = 1.0 Totally Sensitive completely accurate discrimination between effective and less-effective instructionCurve 3 is better than Curve 2
36 ROC Curve Interpretation Greater AUC values indicate greater separation between distributionse.g., Most effective versus less effectiveBest taught versus Not-best-taught1.0 = complete separation – that is, total sensitivity
37 ROC Curve Interpretation AUC values close to .50 indicate no separation between distributions.AUC = .50 indicatescomplete overlapNo differencemight as well guess
38 Procedural ReviewStep 1: Cross-tabulate not/pass status with teacher identification of best-taught indicatorsStep 2: (Optional) Use logistic regression and propensity score matching to create randomly-equivalent groups – or, as close as you can getStep 3: Use (A+D)/N or formal ROC Curve Analysis to evaluate instructional sensitivity at the smallest grain-size possible – preferably, at the wrong/right level of individual items.
39 In ClosingThe assumption that accountability tests are sensitive to instruction rarely holds.Inferences drawn from test scores about school quality and teaching effectiveness must be validated before action is taken.The empirical approaches presented here should prove helpful in determining if the inference that a test is instructionally sensitive is indeed warranted.
40 Presenter’s email address: firstname.lastname@example.org Questions, comments, or suggestions are welcome