Reading Assessment: Still Time for a Change

Reading Assessment: Still Time for a Change
P. David Pearson UC Berkeley Professor and Former Dean •Kiaora Thank the IRA and especially Pat Edwards for inviting me (3rd time I’ve had a chance to address a world congress as a plenary speaker)—testament to longevity •Thank Heather Bell and her colleagues for the delightful venue, program, and organization. •Best word on the slide is former Former Slides available at

Why did I pick such a boring topic?
I’m a professor! Who needs fun? The consequences are too grave. I have a perverse standard of fun. I’m a professor and I think boring topics are good for people! In general, I don’t like to see people have a good time. Whether the task is exciting, engaging or deadly dull, the consequences of not getting assessment right are too grave to allow us to postpone the task any longer. We must, at all costs, stop the irreparable harm that otherwise gets done—to students, parents, teachers, schools, and curriculum. It could be fun, in a perverse way, if we could turn this assessment problem on its ear, get it right, make it an ally of teaching and learning rather than an enemy. One thing you should all know at the outset is that there are 87 slides in this presentation, so those of you who are compulsive notetakers this is a real challenge. Those who want to relax can download the slides at Slides available at

Valencia and Pearson (1987) Reading Assessment: Time for a Change
Valencia and Pearson (1987) Reading Assessment: Time for a Change. In Reading Teacher A set of contrasts between cognitively oriented views of reading and prevailing practices in assessing reading circa 1986 New views of the reading process tell us that . . . Yet when we assess reading comprehension, we . . . Prior knowledge is an important determinant of reading comprehension. Mask any relationship between prior knowledge and reading comprehension by using lots of short passages on lots of topics. A complete story or text has structural and topical integrity. Use short texts that seldom approximate the structural and topical integrity of an authentic text. Inference is an essential part of the process of comprehending units as small as sentences. Rely on literal comprehension text items. The diversity in prior knowledge across individuals as well as the varied causal relations in human experiences invites many possible inferences to fit a text or question. Use multiple-choice items with only one correct answer, even when many of the responses might, under certain conditions, be plausible. The ability to vary reading strategies to fit the text and the situation is one hallmark of an expert reader. Seldom assess how and when students vary the strategies they use during normal reading, studying, or when the going gets tough. The ability to synthesize information from various parts of the text and different texts is hallmark of an expert reader. Rarely go beyond finding the main idea of a paragraph or passage. The ability to ask good questions of text, as well as to answer them, is hallmark of an expert reader. Seldom ask students to create or select questions about a selection they may have just read. All aspects of a reader’s experience, including habits that arise from school and home, influence reading comprehension. Rarely view information on reading habits and attitudes as being as important information about performance. Reading involves the orchestration of many skills that complement one another in a variety of ways. Use tests that fragment reading into isolated skills and report performance on each. Skilled readers are fluent; their word identification is sufficiently automatic to allow most cognitive resources to be used for comprehension. Rarely consider fluency as an index of skilled reading. Learning from text involves the restructuring, application, and flexible use of knowledge in new situations. Often ask readers to respond to the text’s declarative knowledge rather than to apply it to near and far transfer tasks. 1987 article, Sheila Valencia and I tried to convince the reading field and at least America’s policy makers that it was time for a change in the nature, development, and format of assessments IF we were to rely on them as the friend, not the enemy, of teaching and learning. We noted 11 discrepancies between what we knew to be true and reading and what we knew to be the state of the art in reading assessment. Hence the title, Reading Assessment: Time for a Change. Today I will wonder with you whether it is still time for a change. Short answer, yes. Now let me tell you why.

New views of the reading process tell us that . . .
Yet when we assess reading comprehension, we . . . Prior knowledge is an important determinant of reading comprehension. Mask any relationship between prior knowledge and reading comprehension by using lots of short passages on lots of topics. A complete story or text has structural and topical integrity. Use short texts that seldom approximate the structural and topical integrity of an authentic text. Inference is an essential for comprehending units as small as sentences. Rely on literal comprehension text items. Side effect: privilege those with highest general verbal ability Snippets or textoids

Yet when we assess reading comprehension, we . . . The diversity in prior knowledge across individuals as well as the varied causal relations in human experiences invites many possible inferences to fit a text or question. Use multiple-choice items with only one correct answer, even when many of the responses might, under certain conditions, be plausible. The ability to synthesize information from various parts of the text and different texts is hallmark of an expert reader. Rarely go beyond finding the main idea of a paragraph or passage. The ability to vary reading strategies to fit the text and the situation is one hallmark of an expert reader. Seldom assess how and when students vary the strategies they use during normal reading, studying, or when the going gets tough.

What is thinking? You do it in your head, without a pencil..Alexandra, age 4 You shouldn’t do it in the dark. It’s too scary, Thomas, age 5 Speaking of being metacognitive and strategic, one of the things we used to do in those days was to interview a lot of kids. And when we did, we asked them questions like, what is reading? What is thinking? In doing the research for this talk, I ran across some old overheads (remember those), and dumped them into a power point for the first time in their natural lives. Kids’ responses to this question, What is thinking?

What is Thinking? Thinking is when you’re doing math and getting the answers right, Sissy, age 5 And in response… NO! You do the thinking when you DON’T know the answer. Alex, age 5

What is Thinking? It’s very, very easy. The way you do it is just close your eyes and look inside your head. Robert, age 4

What is Thinking? You think before you cross the street!
What do you think about? You think about what you would look like smashed up! Leon, age 5

What is Thinking? You have to think in swimming class. About what?
About don’t drink the water because maybe someone peed in it…and don’t drown!

Yet when we assess reading comprehension, we . . . The ability to ask good questions of text, as well as to answer them, is hallmark of an expert reader. Seldom ask students to create or select questions about a selection they may have just read. All aspects of a reader’s experience, including habits that arise from school and home, influence reading comprehension. Rarely view information on reading habits and attitudes as being as important information about performance. Reading involves the orchestration of many skills that complement one another in a variety of ways. Use tests that fragment reading into isolated skills and report performance on each. Did some work on number 1 Habits are important outcomes Scott Paris’ important work on constrained and unconstrained skills.

Yet when we assess reading comprehension, we . . . Skilled readers are fluent; their word identification is sufficiently automatic to allow most cognitive resources to be used for comprehension. Rarely consider fluency as an index of skilled reading. Learning from text involves the restructuring, application, and flexible use of knowledge in new situations. Often ask readers to respond to the text’s declarative knowledge rather than to apply it to near and far transfer tasks.

Why did We Take this Stance?
Need a little mini-history of assessment to understand our motives Slides available at

The Scene in the US in the 1970s and early 1980s
Behavioral objectives Mastery Learning Criterion referenced assessments Curriculum-embedded assessments Minimal competency tests: New Jersey Statewide assessments: Michigan & Minnesota Slides available at

Historical relationships between instruction and assessment
Skill 1 Teach Assess Conclude Skill 2 Teach Assess Conclude Bloom’s notion of mastery learning was that if you could just be sufficiently transparent and explicit about the nature of the task and the criterion for demonstrating mastery of it, a lot more people would be able to demonstrate mastery of it. Bloom argued that we usually fix the instruction and allow performance outcomes to vary. He wanted us to fix the outcome, and allow instruction to vary. Got perverted into all these bits and pieces of low level skills, not the big stuff like comprehension or composition. The 1970s Skills management mentality: Teach a skill, assess it for mastery, reteach it if necessary, and then go onto the next skill. Foundation: Benjamin Bloom’s ideas of mastery learning

Skill 1 Teach Assess Conclude The 1970s, cont. And we taught each of these skills until we had covered the entire curriculum for a grade level. Skill 2 Teach Assess Conclude Skill 3 Teach Assess Conclude Skill 4 Teach Assess Conclude 1972 White Bear Lake Minnesota Skill 5 Teach Assess Conclude Skill 6 Teach Assess Conclude

Dangers in the Mismatch we Saw in 1987
False sense of security. Instructionally insensitive to progress on new curricula Accountability will do us in and force us to teach to the tests and all the bits and pieces. We’ll feel good about teaching to specific skill tests when what we need our tests that challenge students to think. Tests will be insensitive to progress on the higher order thinking agenda implied by the new curriculum As accountability increases, we’ll see more teaching to the test rather than teaching to our highest ideals.

Pearson’s First Law of Assessment
The finer the grain size at which we monitor a process like reading and writing, the greater the likelihood that we will end up teaching and testing bits and pieces rather than global processes like comprehension and composition. As an aside, one of the things we learned in that period, but we could never manage to make stick, was that The finer the grain size at which we monitor a process like reading and writing, the greater the likelihood that we will end up teaching and testing bits and pieces rather than global processes like comprehension and composition

The ideal The best possible assessment
teachers observe and interact with students as they read authentic texts for genuine purposes. they evaluate the way in which the students construct meaning. intervening to provide support or suggestions when the students appear to have difficulty. Given such a view. the best possible assessment of reading would seem to occur when teachers observe and interact with students as they read authentic texts for genuine purposes. As teachers interact with students, they evaluate the way in which the students construct meaning. intervening to provide support or suggestions when the students appear to have difficulty

Pearson’s Second Law of Assessment
An assessment tool is valued to the degree that it can approximate the good judgment of a professional teacher! So, anything that falls short of the ideal should be evaluated according to how close it comes to that ideal. A M/C test that correlates highly with an informal approach is to be valued more highly We should be explicitly mindful of the shortcomings of all surrogates for the “real thing”

A new conceptualization of the goal
Feature Level of Decision-Making Beyond School School Classroom Individual Accuracy IRI or Unit Test or NRT Test IRI Fluency Word Meaning Norm Refenced Unit or NRT Unit Unit assessment Comprehension NRT IRI or unit activities Critique Perform Discussion Response Essay What Sheila and I proposed in this article and some others was this: Educators should select some aspects of reading that are worth monitoring and then decide how to monitor them—what tools to use--at each level in a system. And those levels vary from an individual student to an entire district, school authority, state or province or even nation. Notice that I did not include Norm Referenced Tests in every row. Why, because I think there are some aspects of reading that can never be assessed in anything short of direct performance. Time for every purpose under heaven principle.

A 1987 Agenda for the Future Another way to look at these issues is to imagine that the assessment system has many clients, and each client has decisions to make and questions to answer. Our job as assessment system designers is to help each client make critical decisions in as valid a manner as possible, with the least possible harm done to any individual or aggregation in the system

Pearson’s Third Law of Assessment
When we ask an assessment to serve a purpose for which it was not designed, it is likely to crumble under the pressure, leading to invalid decisions and detrimental consequences. A time to every purpose under Heaven Exactly what we do when we milk the scores on a standardized test, looking for diagnostic value. A test might be perfectly well-suited to monitoring progress or evaluating programs. That does not mean it will help us figure out what to do next for an individual child. Time to stop making silk purses of out sows’ ears. Nor by the way, should we try to make sows’ ears out of silk purses.

Early 1990s in the USA Standards based reform
State initiatives IASA model Trading flexibility for accountability Move from being accountable for the means and leaving the ends up for grabs (doctor or lawyer model) TO Being accountable for the ends and leaving the means up for grabs (carpenter or product model) Just a promissory note: When NCLB came into being 8 years later, this bargain of flexibility for accountability disappeared. So let’s watch out for that to see how it happened.

Mid 1990s Developments Assessment got situated within the standards movement Content Standards: Know and be able to do? Performance Standards: What counts? Opportunity to Learn Standards: Quid pro quo? Assessment got situated within the Standards Movement that took off around the globe. Content Standards: Know and be able to do Performance Standards: What counts as evidence of meeting the content standards? Opportunity to Learn Standards: What do we have to provide to kids and teachers to achieve the content and performance standards? Somehow got left behind.

Standards-Based Reform The Initial Theory of Action
Assessment Accountability Clear Expectations Motivation Higher Student Learning We began our work with the same set of assumptions about standards-based reform that undergirded the IASA of The theory of action, to use our chair Dick Elmore’s favorite term, was that if you put in place a standards based accountability system (comprised of standards and assessments and the accountability requirement), that will be sufficient to drive the reform engine. The standards determine the content, the assessments make the expectations clear to all, and the accountability system provides the motivation to improve. The final ingredient, which is a critical assumption in this classic standards based reform mode,l is flexibility; that is, in return for being accountable, schools and teachers will be granted wide rein in the processes, strategies, and methods they use to improve student learning. But the studies we reviewed and the experiences of our committee members suggested that this model does not necessarily achieve the goal of higher student learning. Too often, for example, a probationary or reconstituted school threatened with takeover or severe penalties will focus on improving scores rather than changing instruction. We also found evidence that assumptions in this model did not correspond to reality; namely the assumption that teachers would develop improved practices if they had both the freedom and the motivation to do so. Changes in practice, we found, seldom occurred without intentional and arduous effort on behalf of school leaders. Ala Tucker and Resnick in the early 1990s

Expanded Theory of Action
Standards Assessment Accountability Clear Exp’s Motivation Instruction Professional Development Higher Student Learning So we expanded our theory of action to match what the research we reviewed and the experiences we shared told us. In our expanded theory of action, two key elements are inserted between the clear expectations provided by assessments and the motivation provided by accountability on the one side and student learning on the other. And those two elements are instruction and professional development. The implication here is that standards, assessment, and accountability are not enough, that standards have to be explicitly and deliberately transformed into instructional practices and that professional development is the pathway to improved instruction. Only then, our work told us, would student learning improve in the way the theory predicts it should Ala Elmore and Resnick in the late 1990s

The Golden Years of the 90s?
A flying start in the late 1980s and early 1990s International activity in Europe, Down Under, North America Developmental Rubrics Performance Tasks New Standards CLAS Portfolios of Various Sorts Storage bins Showcase: best work Compliance: Walden, NYC Increase the use of constructed response items in NRTs

Late 1980s/early 1990s: Portfolios Performance Assessments Make Assessment Look Like Instruction
From which we draw Conclusions Activities On standards 1-n In the late 1980s, building on all the good work on performance assessment, portfolio assessment, developmental rubrics, and the like that was begun a decade or two earlier in New Zealand and Australia and in pockets in Europe and North America, we began to experiment with these forms of assessment in the US. The key to the whole system was to tighten the link between instruction and assessment by making assessment look more like instruction rather than the other way round. Some in the movement took the point of view that as long as we were going to teach to the tests, we might as well have tests worth teaching to. That proved a fatal flaw in the movement because, as I will point out later, it is high stakes, not necessarily the format of the test, that are the evil that lurks in the heart of assessment. But for a few years, from roughly 1991 or 92 through 96 or 97, at least in the US, we experienced a proliferation of alternative assessments We engage in instructional activities, from which we collect evidence which permits us to draw conclusions about student growth or accomplishment on several dimensions (standards) of interest.

The complexity of modern assessment practices: one to many
Activity X Any given activity may offer evidence for many standards, e.g, responding to a story. Standard 1 Standard 2 Standard 3 This was very exciting because it meant that you could use artifacts from your classroom—student work—as evidence that students had mastered important standards. Standard 4 Standard 5

The complexity of performance assessment practices: many to one
Standard X Activity 1 For any given standard, there are many activities from which we could gather relevant evidence about growth and accomplishment, e.g., reads fluently Activity 2 Activity 3 This by the way is the real meaning of curriculum embedded assessments. Instruction as an occasion for assessment. Activity 4 Activity 5

The complexity of portfolio assessment practices, many to many
Activity 1 Standard 1 Activity 2 Standard 2 Standard 3 Activity 3 Activity 4 Standard 4 By the way, it is complexity that among other things probably accounts for the demise of this family of alternative approaches. Activity 5 Standard 5 Any given artifact/activity can provide evidence for many standards Any given standard can be indexed by many different artifacts/activities

The perils of performance assessment: or maybe those multiple-choice assessments aren’t so bad after all……. Thunder is a rich source of loudness "Nitrogen is not found in Ireland because it is not found in a free state"

The perils of performance assessment
"Water is composed of two gins, Oxygin and Hydrogin. Oxygin is pure gin. Hydrogin is gin and water.” "The tides are a fight between the Earth and moon. All water tends towards the moon, because there is no water in the moon, and nature abhors a vacuum. I forget where the sun joins in this fight."

"Germinate: To become a naturalized German." "Vacumm: A large, empty space where the pope lives.” Momentum is something you give a person when they go away.

The cause of perfume disappearing is evaporation. Evaporation gets blamed for a lot of things people forget to put the top on. Mushrooms always grow in damp places which is why they look like umbrellas. Genetics explains why you look like your father, and if you don't, why you should.

"When you breath, you inspire. When you do not breath, you expire."

Post 1996: The Demise of Performance Assessment
A definite retreat from performance-based assessment as a wide-scale tool Psychometric issues Cost issues Labor issues Political issues Why the demise of performance assessment: Generalizability Cost Labor (BUT PD) California: Open Mind

The Remains… Still alive inside classrooms and schools
Hybrid assessments based on the NAEP model multiple-choice short answer extended response The persistence of standards-based reform. Still alive inside classrooms and schools Fugitive life Hybrid assessments based on the NAEP model multiple-choice short answer extended response The persistence of standards-based reform

No Child Left Behind Accountability in Spades
Every grade level reporting Census assessment rather than sampling (everybody takes the same test) Disaggregated Reporting by Income Exceptionality Language Ethnicity Full employment for psychometricians Law

NCLB, continued Assessments for varied purposes
Placement Progress monitoring Diagnosis Outcomes/program evaluation Scientifically based curriculum too This may seem like progress but it can explode on you. The curriculum: fix both the ends (with assessments) and the means (curriculum and monitoring devices to promote fidelity). Remember the deal: trading flexibility on the curriculum side for accountability on the outcomes side. Guess what: in 2002, the policy makers reneged on the deal. Fix both. Where is the professional prerogative there? Slides available at

There is good reason to worry about disaggregation
L Achievement H School 1 School 2

Disaggregation and masking
Height of bar = average achievement; width = number of students Disaggregation and masking Simpson’s Paradox? A Large N B Small N B Large N A Small N L Achievement H School 1 School 2

Disaggregation: Damned if we do and damned if we don’t
Don’t report: render certain groups invisible Do report: blame the victim (they are the group that did not meet the standard.

Pearson’s Fourth Law of Assessment
Disaggregation is the right approach to reporting results. Just be careful where the accountability falls.

Pearson’s Fourth Law: A Corollary
Accountability, in general, falls to the lowest level of reporting in the system. If it is reported at the state or provincial level, states or provinces fail. If at the district or authority level, districts and authorities fail. If at the school level, schools fail. If at the classroom level, teachers fail. If at the subgroup level, subgroups fail. . If at the student level, students fail. Everybody’s failing it, failing it, failing it, everybody’s failing it:

Assessment can be the friend or the enemy of teaching and learning
The curious case of DIBELS, … and other benchmark assessments can wreak havoc on the best laid curricular plans The Dark Side

A word about benchmark assessments…
The world is filled with assessments that provide useful information… But are not worth teaching to They are good thermometers or dipsticks Not good curriculum

The ultimate assessment dilemma…
What do we do with all of these timed tests of fine-grained skills: Words correct per minute Words recalled per minute Letter sounds named per minute Phonemes identified per minute Scott Paris: Constrained versus unconstrained skills Pearson: Mastery skills versus growth constructs

Why they are so seductive
Mirror at least some of the components of the NRP report Correlate with lots of other assessments that have the look and feel of real reading Takes advantage of the well-documented finding that speed metrics are almost always correlated with ability, especially verbal ability. Example: alphabet knowledge 90% of the kids might be 90% accurate but… They will be normally distributed in terms of LNPM

How to get a high correlation between a mastered skill and something else
Letter Name Fluency (LNPM) Letter Name Accuracy The wider the distribution of scores, the greater the likelihood of obtaining a high correlation

Face validity problem: What virtue is there in doing things faster?
naming letters, sounds, words, ideas What would you do differently if you knew that Susie was faster than Ted at naming X, Y, or Z??? For a paper I did for the new handbook of reading disability research, I have had occasion to go back to a lot of the work done trying to understand the skill infrastructure of kids classified as reading disabled. Curious thing: all kinds of speed metrics—naming letters, naming pictures, naming words—turn out to be entirely predicted by age. So, older kids, even older kids with problems, can do lots of things faster than younger kids reading at the same reading level. I guess that means that if we want WCPM to go up, we should just wait a year or two.

Why I fear the use of these tests

They meet only one of tests of validity: criterion-related validity
correlate with other measures given at the same time--concurrent validity predict scores on other reading assessments--predictive validity

Fail the test of curricular or face validity
They do not, on the face of it, look like what we are teaching…especially the speeded part Unless, of course, we change instruction to match the test

Really fail the test of consequential validity
Weekly timed trials instruction Confuses means and ends Proxies don’t make good goals Weekly timed trials instruction Confuses means and ends We want kids to read faster, sure, but Because they are getting better at all aspects of reading and language performance Not because we practiced timed trials 3 times a day

The Achilles Heel: Consequential Validity
Give DIBELS Give Comprehension Test Use results to craft instruction Dibels does not, by the way, claim to be diagnostic. It is supposedly a progress monitoring test. Tell you how kids are doing on the road to somewhere. But, not how it gets used. Give DIBELS again Give Comprehension Test The emperor has no clothes

The bottom line on so many of these tests
Pearson’s Third Law again New Bumper Sticker The world is filled with tests that provide useful and convenient proxies for the real thing. It is true that wcpm is a good proxy for comprehension. By the way, this is another application of Pearson’s Third Law But the minute that indirect indicator morphs itself into a curricular goal, it becomes a monster. I want kids to read faster and more fluently, sure, because we taught everything well in a balanced curriculum, not because we had all the kids practice timed trials 5 days a week all year long. Never send a test out to do a curriulum’s job!

The dark side of alignment: the transfer problem
I agree about the importance of curriculum-based assessment and situated learning, BUT… We do expect what you learn in one context to assist you in others In our heart of hearts we do NOT believe that kids learn ONLY what you teach OR That only what is tested is what should get learned (and taught) Note our strong faith in the idea of application I agree with a lot of what Bill had to say about curriculum-based assessment. In fact, most of what I suggested as part of an assessment system would qualify as curriculum-based assessment, except for the big picture assessments. Those are a little different, I think. In today’s educational discussions, we hear a lot in recent years about the notion of situated cognition and situated learning. And the key point in this concept is that we have to stop treating learning as a set of abstract principles or constructs that rise above the specifics of each learning situation and guide us whenever we encounter a new instance of that same phenomenon. When you learn a skill, a process, or a fact, granted you learn it in a specific context. But even if we endorse the notion of situated learning—that context matters--we do expect that what you learn in one situation will serve your similar needs in other contexts. There is still a hint of transfer left in our thinking. I think our big picture assessments have to have to require students to export what they have learned in one or more contexts to a new context. Otherwise, I do not see how we can call it authentic assessment. So, how do we test for transfer?

How do we test for transfer?
A continuum of cognitive distance An example: Learn about the structure of texts/knowledge about insect societies--bees, ants, termites New passages Paper wasps A human society A biome How far will the learning travel? Our problem today: THIS IDEA OF TRANSFER IS NOT EVEN ON OUR CURRENT RADAR SCREEN!!! And it ought to be!!!!! When I grew up academically, transfer of learning was regarded as the gold standard. Hence tests of transfer are also the gold standard of assessment. But what you really want is that proximal to distal, near to far continuum of assessments.

Domain representation
If we teach to the standards and the assessments, will we guarantee that all important aspects of the curriculum are covered? Linn and Shepard study: improvements on a narrow assessment do not transfer to other assessments Shepard et al: in high stakes districts, high performance on consequential assessments comes at a price... First, it makes sense to base our instruction on a set of standards only if we can assume that the set of standards we have developed provides a complete representation of the curriculum domain in question. Lacking some guarantee that all important aspects of the curriculum are covered, it would be foolhardy to limit what we do in classrooms to a set of standards. Just as surely we do not want to develop a laundry list of skills.

Linn and Shepard’s work...
= New Standardized Test = Old Standardized Test So even with the best of intentions, there can be a kind of covert, insidious teaching to the test that goes on. Year

Shepard et al work ST = consequential standardized assessment AA = more authentic assessment of the same skill domain Note the consequences of high stakes on alternative assessments ST ST AA AA Low Stakes Schools High Stakes Schools

Key Concept: Haladyna Test Score Pollution: a rise or fall in a score on a test without an accompanying rise or fall in the cognitive or affective outcome allegedly measured by the test

Aligning everything to the standards: A model worth rejecting
Assessment Instruction This model is likely to shape the instruction too narrowly. Lead to test score pollution.

A better way of thinking about the link between standards, instruction and assessment
Standards: How we operationalize our values about teaching and learning Guide the development of both instruction and assessment Teaching and Learning Activities Assessment Activities By the way, in a piece that I did with Monica Yoo and Terry Underwood, in examining the secondary standards, curriculum, and assessments in Massachusetts, California, and Texas, we discovered that it was not the standards or even the curricula that drove teachers and schools nuts, it was the tests. The tests did NOT measure the standards, particularly the higher order ones, well or even at all. And they certainly did not do justice to the curriculum. This relationship can operate at the regional or local level The logic of lots of good reform projects!

Pearson’s Fifth Law of Assessment
Alignment is a double-edged sword. If there must be alignment, lead with the instruction and let the assessment follow. If the assessments are aligned to the instruction, things will work out. If instruction is aligned to the assessment, pollution will occur.

Pearson’s Sixth Law High Stakes will corrupt any assessment, no matter how virtuous or pure in intent It’s the stakes that drive us to madness and distraction. Teaching to the test Packing to the portfolio

Corollary to Pearson’s Fifth and Sixth Laws
The worst possible combination is high stakes and low challenge Hgh Stakes and Low Challenge Why? Drag us all to the bottom of our pool of aspirations.

So how did we do in responding the the challenges from Valencia & Pearson?
Issue Grade Solution Prior Knowledge D Choice of Passages Authentic Text B+ Things are lots better on lots of comprehension assessments Inference B Depends on the test Diversity in Knowledge means diversity in response Constructed response and multiple correct answers or graded answers Flexible use of strategies C Hard to assess; easy to coach; I’d abandon except for diagnostic interviews Synthesizing Information is paramount Still too much emphasis on details

So how did we do in responding the the challenges from Valencia & Pearson?
Issue Grade Solution Asking questions as an index of comprehension D No progress except in informal classroom assessment Measuring habits, attitudes, and dispositions C Some reasonable things out there. But no teeth Orchestrating many skills Too many mastery skills; not enough growth skills Fluency Made a fetish out of it Transfer and application Limited to a few situations Overall Grade Lots of work to do

Where should we be headed?
So, what makes sense for a district or school? Develop an educational improvement system

Elements of an Educational Improvement System
Standards, yes Assessments, yes Outcome assessments for program evaluation Benchmark assessments for monitoring individual progress “Closer look” diagnostic assessments for determining individual student emphases Reporting system, yes as long as we are prepared to live with the dilemmas of disaggregation Alignment, but of a different sort

Outcome assessments Slides available at www.scienceandliteracy.org
Drop in out of the sky Curriculum independent Assess reading in their most global aspects Growth constructs NOT mastery constructs Could be some sort of standardized assessment Not directly linked to the curriculum Could be some sort of standardized assessment—as long as they are not taught to Slides available at

A plan for early reading benchmark assessments
Still trying to figure out how to work in Vocabulary. Every so often, give four benchmark assessments.

Benchmarks for Intermediate and Secondary
Comprehend Deconstruct: What do authors do and why Compose Narratives Response to Literature Author’s Craft Creative Writing Information Genres Summaries, Charts, Key ideas Genre (form follows function) Writing from sources to convey ideas

Closer Look Assessments
There is no sin in examining the infrastructure of reading Really do need to know which of those pieces kids have and have not mastered Question is what to do about them Teach to and practice the weak bits Rely on strengths to bootstrap the weaknesses Just read more “just right” material Do the weak skills get better if we bootstrap them to the strengths of just do more orchestrated enactments of the process—i.e., just plain reading. I’d do all three.

Teaching to Weaknesses Flaw
Basic Skills Conspiracy of Good Intentions: First you gotta get the words right and the facts straight before you can do the what ifs and I wonder whats? Some kids spend their entire careers doing the what ifs and I wonder whats.

Monitoring Conditions of Instruction
Collect data on curriculum, instructional practices We need clear data on enacted curriculum and instructional practices in order to link it as precisely as possible to achievement Use data for program improvement Design professional development Often overlooked but at our own peril Best work in this area is Barbara Taylor’s at the University of Minnesota. Look in particular for evidence of these curricular and instructional practices Higher order thinking Deep Knowledge Substantive Conversation Connection to the world beyond the classroom Using data to design new instructional and staff development programs Tying Professional development tied to standards ONE other point: in our cases of effective educational improvement systems, one of the common characteristics is internally developed systems for monitoring student progress, both within and across grades

Return to the hard work on assessment
Encouraged by recent funding of new century assessments Could be some good coming out of our reading for understanding assessment grants in the US Possibilities in the Australian work: NAPLAN?? Tests that take the high road (tests worth teaching to) Focus on making and monitoring meaning Focus on the role of reading in knowledge building and the acquisition of disciplinary knowledge Focus on critical reasoning and problem solving Focus on representation of self. Assessment is something you do to and for yourself because it helps you outgrow your current self. The unfinished business from the 1990s

Where Could we Be Headed: A Near Term Research Agenda
The Development of More Trustworthy, More Useful Curriculum-Based Assessments Expanding the logic of the Informal Reading Inventory Getting comprehension assessment right Computerized Assessments (yes, but no time today)

Expanding the logic of the IRI
Benchmark books model ala Reading Recovery Indices of… Level of text one can read independently Accuracy (including error patterns) Fluency Comprehension Not one, not two, not three, but many, many conceptually and psychometrically comparable passages at every level of text challenge.

Comprehension Assessment
Our models for external assessment, modeled after some of the better wide-scale assessments, are OK. We desperately need a school/classroom tool that does for comprehension what running records/benchmark books have done for oral reading accuracy and fluency

Disciplinary Grounding
We’re much better off if we ground our comprehension assessments in the inquiry and knowledge traditions of the disciplines rather than to

Pearson’s (bet on a) Seventh Law of Assessment
Comprehension assessment begins and ends within the knowledge traditions and inquiry processes of each discipline

Pearson’s (bet on a) a Corollary to the Seventh Law
Summative (big external) assessments of reading comprehension will be better if they begin as formative (smaller internal) assessments of reading comprehension within the knowledge traditions and inquiry processes of each discipline. In other words figure out how to assess comprehension in a way that respects the disciplinary bases of science, history, mathematics, and literature, and then we can develop a good general test of comprehension by sampling from those really well-grounded formative assessments.

My bottom line Tests that are
Instructionally sensitive Psychometric sound Trustworthy No decision of consequence should be based upon a single indicator. Tests are a means to an end:. We desperately need instructionally sensitive assessments that have first rate psychometric characteristics so that we can build trustworthy internal systems for monitoring student progress No decision of consequence about any individual, school, district or other aggregation should be based upon a single indicator of anything. Tests are a means to an end: Their value is measured by the degree to which they allow us to make good decisions and provide good instruction. They are not the ends themselves. They are NOT curriculum

To reduce it to a single idea
Six, maybe seven laws Two, maybe three corollaries But only one thing truly worth remembering… Only one thing worth remembering Thanks for spending your valuable time with me today. It is your most precious gift and one ought not to waste it. Take care; hope the rest of your day, your conference, your stay in New Zealand, your school year, and your life are filled with satisfying work, students who value reading, and a teaching life that promotes your students’ opportunity to become literate citizens of our global community. Never send a test out to do a curriulum’s job! Slides available at

Coda in Stuart McNaughton’s Spirit
A new bumper sticker with a tinge of optimism. And with that cheery if implausible thought, I’ll truly say thank you. Kiaora. Tests in support of teaching and learning.

Computerized Assessment
With advances in voice recognition, we are close to being able to teach computers to recognize and score students’ oral responses Applications: Listen to oral reading of benchmark passages and conduct a first level diagnosis (thus eliminating a key barrier, time, to more widespread use of this important diagnostic tool). Mention new reading for understanding grants

Computerized Assessment in Early Literacy
More applications of voice recognition Phonemic awareness tasks Word reading tasks Phonics tests (both real words and synthetic words) Comprehension assessment still a way down the road because of the interpretive problem The computer has to both listen to and understand the response BARLA: Bay Area Reading and Listening Assessment

Reading Assessment: Still Time for a Change

Similar presentations

Presentation on theme: "Reading Assessment: Still Time for a Change"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reading Assessment: Still Time for a Change

Similar presentations

Presentation on theme: "Reading Assessment: Still Time for a Change"— Presentation transcript:

Similar presentations

About project

Feedback