The reliability of educational assessments Dylan Wiliam www.dylanwiliam.net Ofqual Annual Lecture, Coventry: 7 May 2009
The argument The public understanding of the reliability of assessments is weak Contributory factors are The need of humans for certainty (and beliefs that its absence is chaos) The inherent unreliability of all measurements, educational and otherwise The use in education of tools derived from individual differences psychology An emphasis on scores, rather than how they are used Political assumptions about the educability of the public, combined with A desire to use assessment outcomes as drivers of reform Those who produce—and those who mandate the use of—assessments must take responsibility for informed use of assessment outcomes
Dealing with uncertainty in society People like certainty… Hilbert (1900): “In mathematics, there is no ignoramibus” He was wrong And it was unsettling (Klein, 1980) …and to attribute blame… Deaths of children in care (e.g., “Baby P.”) …although there are some cases where uncertainty is accepted “It is better and more satisfactory to acquit a thousand guilty persons than to put a single innocent one to death” (Maimonides) “It is better that ten guilty persons escape than that one innocent suffer” (Blackstone)
The very first high-stakes assessment… “Then Jephthah gathered together all the men of Gilead, and fought with Ephraim: and the men of Gilead smote Ephraim, because they said, Ye Gileadites are fugitives of Ephraim among the Ephraimites, and among the Manassites. And the Gileadites took the passages of Jordan before the Ephraimites: and it was so, that when those Ephraimites which were escaped said, Let me go over; that the men of Gilead said unto him, Art thou an Ephraimite? If he said, Nay; Then said they unto him, Say now Shibboleth: and he said Sibboleth: for he could not frame to pronounce it right. Then they took him, and slew him at the passages of Jordan: and there fell at that time of the Ephraimites forty and two thousand. (Judges 12, 4-6, King James version)
Reliability Hansen (1993) distinguishes between literal and representational assessments There are no literal assessments All assessments are representational All assessments involve generalization Reliability is a measure of the stability of assessment outcomes under changes in—or the ability to generalize across—things that (we think) shouldn’t make a difference, such as: marker/rater occasion* item selection* * UK excepted
Uncertainty in assessing English Starch & Elliott (1912)
Uncertainty in assessing mathematics Starch & Elliott (1913)
Measures of reliability In classical test theory, reliability is defined as a kind of “signal-to- noise” ratio (in fact a signal to signal-plus-noise ratio) Reliability is increased by decreasing the noise, or, easier, by increasing the signal Hence the need for discrimination The legacy of individual differences psychology A focus on discrimination between individuals In education, more appropriate ways of estimating reliability exist Discriminating between those who have and have not been taught Discriminating between those who have and have not been taught well
Test length and reliability 0.700.750.800.850.900.95 0.701.0 0.751.31.0 0.801.71.31.0 0.8220.127.116.11.0 0.903.93.02.31.61.0 0.918.104.22.168.42.11.0 From To Just about the only way to increase the reliability of a test is to make it longer, or narrower (which amounts to the same thing).
Reliability is not what we really want Take a test which is known to have a reliability of around 0.90 for a particular group of students. Administer the test to the group of students and score it Give each student a random script rather than their own Record the scores assigned to each student What is the reliability of the scores assigned in this way? A.0.10 B.0.30 C.0.50 D.0.70 E.0.90
Reliability v consistency Classical measures of reliability are meaningful only for groups are designed for continuous measures Marks versus grades Scores suffer from spurious accuracy Grades suffer from spurious precision Classification consistency A more technically appropriate measure of the reliability of assessment Closer to the intuitive meaning of reliability
Uncertainty in assessment at A-level Classification consistency at A-level Please, D. N. (1971) “Estimation of the proportion of examination candidates who are wrongly graded”. Br. J. math. statist. Psychol. 24: 230-238.
Here’s the table that got me into trouble… Classification consistency of National Curriculum Assessment in England
AERA, APA, NCME Standards (4 e/d,1999) Standard 2.1 For each total score, subscore or combination of scores that is to be interpreted, estimates of relevant reliabilites and standard errors of measurement or test information functions should be reported (p. 31) Standard 2.2 The standard error of measurement, both overall and conditional (if relevant) should be reported both in raw score or original scale units and in units of each derived score recommended for use in test interpretation (p. 31, my emphasis). Standard 2.3 When test interpretation emphasizes differences between two observed scores of an individual, or two averages of a group, reliability data, including standard errors, should be provided for such differences (p. 32)
Conclusion It is simply unethical to produce or to mandate the use of assessments without taking steps to ensure informed use of the outcomes of the assessments by those likely to do so. Error bounds should be routinely estimated, and reported in terms of the units used for reporting (e.g., grades and aggregate measures) Government and its agencies should actively promote public understanding of the limitations of assessments, both in terms of reliability and other aspects of validity appropriate interpretations of assessment outcomes, for individuals and groups