Presentation on theme: "Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners."— Presentation transcript:
Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners An Invited Talk Given to the Institute of Education Science in the Graduate School of Education of the University of Pennsylvania February 13, 2012
“ In general we look for a new law by the following process. First we guess it. Then we compute the consequences of the guess to see what would be implied if this law that we guessed is right. Then we compare the result of the computation to nature, with experiment or experience, compare it directly with observation, to see if it works. If it disagrees with experiment it is wrong. In that simple statement is the key to science. It does not make any difference how beautiful your guess is. It does not make any difference how smart you are, who made the guess, or what his name is - if it disagrees with experiment it is wrong. That is all there is to it. ” Richard P. Feynman (1964)
Outline I.Introduction – Mistreating missing data can have a huge effect A.Lombard ’ s most dangerous profession B.Getting younger in Princeton ’ s cemetery C.Wald ’ s model for armoring planes II. Case 1. What happens if the SAT is made Optional: Bowdoin College as an example. III. Case 2. Allowing choice on exams A. Some history – especially 1921 English B. The mystery of 1968 AP Chemistry C. Women suffer in 1988 US History D. The only unambiguous solution to missing data E. Indiana Jones and a wonderful workaround 1. 1989 Chemistry as proof of concept. IV. Case 3. Using student test scores to evaluate teachers: Value-Added Models A. VAM and missing scores - Gaming the system by using missing data imputations. B. VAM and Counterfactuals – How would Freddy have done if he hadn ’ t had Ms. Jones? V. Conclusions
I will illustrate my talk today with three principal examples: 1.A September 2008 report published by the National Association for College Admission Counseling in which one of the principal recommendations was for colleges and universities to reconsider requiring the SAT or the ACT for applicants. 2.Increasingly often ‘ standardized ’ exams provide a set of possible questions and allow the examinee to pick which ones to answer. 3.“ Race to the Top ” provides funds to states that amend their educational system in specific ways. But all must somehow use the change in student test scores to evaluate teachers.
In all three of these, the issue of missing data looms large The issue of missing data is too often assumed to be a small technical one that is not likely to have any serious effect; even by people who ought to know better. How we understand and treat missing data can have an enormous effect on the conclusions we draw.
MD3. Bullet holes and a model for missing data From Abraham Wald
Example 1. National Association for College Admission Counseling ’ s September 2008 report on admissions testing On September 22, 2008, the New York Times carried the first of three articles about a report, commissioned by the National Association for College Admission Counseling, that was critical of the current, widely used, college admissions exams, the SAT and the ACT. The commission was chaired by William R. Fitzsimmons, the dean of admissions and financial aid at Harvard. The report was reasonably wide-ranging and drew many conclusions while offering alternatives. Although well-meaning, many of the suggestions only make sense if you say them very fast.
Among their conclusions were: 1.Schools should consider making their admissions “ SAT optional, ” that is allowing applicants to submit their SAT/ACT scores if they wish, but they should not be mandatory. The commission cites the success that pioneering schools with this policy have had in the past as proof of concept. 2.Schools should consider eliminating the SAT/ACT altogether and substituting instead achievement tests. They cite the unfair effect of coaching as the motivation for this – they were not naïve enough to suggest that because there was no coaching for achievement tests now that, if they became more high stakes coaching for them would not be offered, but rather that such coaching would be directly related to schooling and hence more beneficial to education that coaching that focuses on test-taking skills. 3.That the use of the PSAT with a rigid qualification cut-score for such scholarship programs as the Merit Scholarships be immediately halted.
Recommendation 1. Make SAT optional: It is useful to examine those schools that have instituted “ SAT Optional ” policies and see if the admissions process been hampered in those schools. The first reasonably competitive school to institute such a policy was Bowdoin College, in 1969. Bowdoin is a small, highly competitive, liberal arts college in Brunswick, Maine. A shade under 400 students a year elect to matriculate at Bowdoin, and roughly a quarter of them choose to not submit SAT scores. In the following table is a summary of the classes at Bowdoin and five other institutions whose entering freshman class had approximately the same average SAT score. At the other five institutions the students who didn ’ t submit SAT scores used ACT scores instead.
Table 1 : Six Colleges/Universities with similar observed mean SAT scores for the entering class of 1999.
To know how Bowdoin ’ s SAT policy is working we will need to know two things: 1. How did the students who didn ’ t submit SAT scores do at Bowdoin in comparison to those students that did submit them? 2.Would the non-submitters ’ performance at Bowdoin have been better predicted by their SAT scores, had the admissions office had access to them?
The first question is easily answered by looking at their first year grades at Bowdoin.
But would their SAT scores have provided information missing from other submitted information? This would depend on why these students chose to not submit their scores. Some possibilities are: 1.If I don ’ t need to submit them, why bother to take them? 2.I took them, and did really well, but so what? 3.I took them, but did worse than the typical student who was accepted by Bowdoin in the past. Submitting them wouldn ’ t help my cause.
Although we may have some opinions on the likelihood of each of these options, under typical circumstances we have no data to help us decide, for these students did not submit their SAT scores.
However all of these students actually took the SAT, and through a special data-gathering effort at the Educational Testing Service, we found that the students who didn ’ t submit these scores behaved sensibly. They realized that their lower-than-average scores would not help their cause at Bowdoin, and hence chose not to submit them. Here is the distribution of SAT scores for those who submitted them as well as those who did not.
As it turns out, the SAT scores for the students who did not submit them would have accurately predicted their lower performance at Bowdoin. In fact, the correlation between grades and SAT scores was higher for those who didn ’ t submit them (0.9) than for those who did (0.8).
So not having this information does not improve the academic performance of Bowdoin ’ s entering class – on the contrary it diminishes it. Why would a school opt for such a policy? Why is less information preferred to more?
There are surely many answers to this, but one is seen in an augmented version of the earlier table 1: We see that if all of the students in Bowdoin ’ s entering class had their SAT scores included, the average SAT at Bowdoin would shrink from 1323 to 1288, and instead of being second among these six schools they would have been tied for next to last.
Since mean SAT scores are a key component in school rankings, a school can game those rankings by allowing their lowest scoring students to not be included in average. I believe that Bowdoin ’ s adoption of this policy pre- dates US News & World Report ’ s rankings, so that was unlikely to have been their motivation, but I cannot say the same thing for schools that have chosen such a policy more recently.
Is inferring such a nefarious goal just the paranoid ravings of an aging cynic? Or are colleges actively engaged in trying to game college rankings?
Some evidence: 1.The January 31, 2012 NY Times reported that Richard C. Vos, VP and dean of admissions of Claremont McKenna College has, for the past six years, been adding points to the mean SAT scores that the school reported to USN&WR. 2.The February 1, 2012 NY Times reported that Iona College “ has lied for years about test scores, graduation rates, freshman retention, student-faculty ratio, acceptance rates and alumni giving. ” 3.“ Baylor University paid admitted students to retake the SATs in hopes of increasing scores. ” This seems like an inefficient approach -- easier, cheaper and more sure to use Claremont ’ s approach and just falsify them.
Case 2. Allowing choice on exams If you allow choice, you will regret it; if you don't allow choice, you will regret it; whether you allow choice or not, you will regret both. (Søren Kierkegaard, 1986, p. 24)
It is common practice to allow choice on exams Why? If a test is made up of multiple choice questions answering any one of them takes very little time and so there can be lots of them. If we ask essay questions, or other kinds of big problems, it is impractical to ask more than a few of them, and so some students may be disadvantaged by the specific topic selected.
So we offer a choice, “Answer 2 of the following 6” Is this a a good idea? Historically, such an approach was most common almost a century ago, but its popularity rapidly declined. It is currently enjoying a resurgence.
Number of possible test forms generated by examinee choice patterns in College Entrance Exams YearChemistryPhysicsEnglishGerman 19055481641 190918108601 191381447,2601 19172521,6201,587,6001 19212522162,960,1001 192512656486 19292056901 19332010241 193715211 19411111
How did they arrive at the unlikely number of test forms for the 1921 English exam? Section I - Answer 1 of 3 questions; 3 forms. Section II - Answer 5 of 26 questions; 65,780 forms. Section III - 1 of 15; 15 forms. 3x65,780x15 = 2,960,100 Voila!
Are choice items of equal difficulty? Average Scores on AP Chemistry 1968 While their scores on the common multiple-choice (MC) section were about the same (11.7 vs. 11.2 out of a possible 25), their scores on the choice problem were very different (8.2 vs. 2.7 on a 10-point scale).
There are several possible conclusions to be drawn from this; four among them are: 1. Problem 5 is a good deal more difficult than problem 4. 2. Small differences in performance on the multiple- choice section translate into much larger differences on the free response questions. 3. The proficiency required to do the two problems is not strongly related to that required to do well on the multiple-choice section. 4. Item 5 is selected by those who are less likely to do well on it.
The only unambiguous data on choice and difficulty Xiang-bo Wang and his colleagues repeatedly presented examinees with a choice of two items, but then required them to answer both
The proportion of students getting each item correct shown conditional on which item they preferred to answer
The conclusion drawn from many results like this is that: As examinees ’ ability increases they tend to choose more wisely – they know enough to be able to determine which choices are likely to be the least difficult. As ability declines choice becomes closer and closer to random. On average, lower ability students, when given choice are more likely to choose more difficult items than their competitors at the higher end of the proficiency scale. Thus allowing choice will tend to exacerbate group differences.
How can we allow choice? Adjust for differential difficulty after administering items to random samples of examinees - equate (but that makes the examinee’s job more difficult). And, if we are successful, it renders choice unnecessary. OR
What if we make the choice part of the test? But choose wisely, for while the true Grail will bring you life, the false Grail will take it from you. –Grail Knight in Indiana Jones and the Last Crusade, 1989
The alternative to trying to make all examinee- selected choices within a choice question of equal difficulty is to consider the entire set of questions with choices as a single item. Thus the choice is part of the item. If you make a poor choice and select an especially difficult option to respond to, that is considered in exactly the same way as if you wrote a poor answer.
Under what circumstances is this a plausible and fair approach? 1.We must believe that choosing wisely uses the same knowledge and skills that are required for answering the question. 2. That the choice is being made by the examinees and not by their teachers.
If we agree to adopt this strategy a remarkable result ensues! Let us consider data from Section D of the 1989 Advanced Placement Examination in Chemistry. Section D has five problems (Problems 1, 2, 3, 4 and 5) of which the examinee must answer just three. ETS calculates the reliability of Section D as 0.60.
Scores of examinees as a function of the problems they chose
Suppose we think of Section D as a single ‘item’ with an examinee falling into one of ten possible categories, and the estimated score of each examinee is the mean score of everyone in their category. How reliable is this one item test?
After doing the appropriate calculation we discover that the reliability of this ‘choice item’ is.15. While.15 is less than.60, it is also larger than zero, and it is easier to obtain. We don’t have to score the examinees’ answers, we just note which problems they chose. In fact, they don’t even have to answer them -- just indicate which three they would answer, if they were forced to.
Of course with a reliability of only.15, this is not much of a test. But suppose we had two such items, each with a reliability of.15? This new test would have a reliability of.26. And, to get to the end of the story, if we had eight such ‘items’ it would have a reliability of.60, the same as the current form. Such a test would be easier on examinees and much cheaper for the testing company. A win-win.
This is what I like best about science, with only a small investment in fact, we can garner such huge dividends in conjecture.
Case 3. Using student test scores to evaluate teachers “ Some professors are justly renowned for their bravura performances as Grand Expositor on the podium, Agent Provocateur in the preceptorial, or Kindly Old Mentor in the corridors. These familiar roles in the standard faculty repertoire, however, should not be mistaken for teaching, except as they are validated by the transformation of the minds and persons of the intended audience. ” “ Good teachers evaluate themselves with a pitiless gaze and measure their successes not by their virtuosity as performers but by their contribution to the transformation of students. ” (Marvin Bressler, 1991 )
Value Added Models (VAMs) y i1 = 1 1 i1 (1) y i2 = 2 1 2 i2 (2) Hence the change, the value-added, is simply the difference between the scores from year 1 to year 2, or y i2 -y i1 =( 2 1 ) 2 ( i2 i1 ) (3)
“ The child in me was delighted. The adult was skeptical. ” Saul Bellow, 1977 “ I was impressed, not because it did it well, but that it could do it at all. ” Samuel Johnson after watching a dog walk on its hind legs
There are many challenges to be overcome before such models are ready for widespread use. Principal among them are: (i)psychometric issues in both the construction and scoring of tests that allow comparisons over large ranges and across different subjects; (ii) statistical issues dealing with stability of estimates and biases introduced by missing data; (iii) epistemological issues associated with drawing causal conclusions without the need for heroic assumptions.
Today I will focus on only two missing data issues: (i)When, in the ordinary course of school examinations, a student’s score is missing for either the pre-test, the post-test, or both. (ii) The counterfactual data that are always missing; how the student would have performed had she had a different teacher.
There are two approaches to the first missing data problem currently being used with VAMs. The first is to only use those students with complete data and then to assume that the estimates of the teacher effects thus computed are OK (e.g. assume missing data are missing-at-random). If Abraham Wald had used this assumption he would’ve arrived at exactly the opposite conclusion -- add more armor where there were holes.
The more common method is to impute a score for those missing one based on the mean of those who have them (conditioned on some covariates). This does not change the marginal means (but higher moments are wrong), and it can be successfully gamed. Field trips!
A sub-problem of missing scores is when students choose not to answer some items. This happens frequently when a student’s performance on a test has no direct impact on the student (e.g. NAEP or new teacher evaluation exams like those just adopted in NY). If the test has no immediate impact on students (and for HS seniors, even if it does) they tend not to try very hard.
For evidence look at non-response rates in NAEP: Non-response increases with student age (younger students try harder). Non-response varies with item type (multiple choice items are answered much more frequently than constructed response/essay type questions). Non-response varies with ethnicity (Asian and Jewish students are less likely to omit). Non-response varies with location (students in South Dakota answer more often than those in California and Hawaii). Non-response rates can run as high as 70% for 11th grade essay items in Hawaii.
Imagine the scenario -- students are told that they may leave when they are done, and that the test doesn't count toward their grade. Then they are asked to write an essay or two. You can see that if the surf ’ s up they are unlikely to hang around long.
This issue came up in California some years ago in the "Cash for CAP ” program. One HS senior class asked the principal for a share of the money that would come to the school, if they did well, to subsidize their prom. (I believe in Ojai). The principal rejected this and indicated that the money had been slated for improved computing resources. Most of the senior class handed in blank essays coupled with randomly selected options to the multiple choice items, and left early.
And finally, the biggest challenge, causal inference. VAM is not interested in descriptive statements like: “Freddy gained 10 points when he was in Ms. Smith’s class.” No, the goal is to make causal statements like: “Freddy gained 10 points because he was in Ms. Smith’s class.”
To understand the challenge of causal inference we first need some epistemology: “ Counterfactual conditional ” is a term that refers to any expression of the general form: “ If A were the case, then B would be the case. ” This is the conditional part. The counterfactual part is that A must be false or untrue in the world.
Some examples: 1.“ If kangaroos had no tails, they would topple over. ” 2. “ If an hour ago I had taken two aspirins instead of just a glass of water, my headache would now be gone. ”
And, perhaps the most obnoxious counterfactuals are those of the form: 3. “ If I were you, I would.... ”
Hume ’ s famous discussion of causation, “ we may define a cause to be an object followed by another, and where all the objects, similar to the first, are followed by objects similar to the second, and, where, if the first object had not been, the second would never have existed. ”
Let us return to student testing. Suppose that we find that a student ’ s test performance changes from a score of X to a score of Y after some educational intervention. We might then be tempted to attribute the pretest-posttest change, Y – X to the intervening educational experience — i.e., to use the gain score as a measure of the improvement due to the intervention. This is the essence of VAM.
There are many other possible explanations of the gain, Y – X. Some of the more obvious are: i. simple maturation (e.g. Freddy grew 5 inches when he was in Ms. Smith ’ s class) ii. other educational experiences occurring during the relevant time period, and iii. differences in either the tests or the testing conditions at pre- and post-tests.
From Hume we see that what is important is what the value of Y would have been if the student not had the educational experiences that the intervention entailed. Call this score value, Y*. Thus enter counterfactuals.
Y* is not directly observed for the student, i.e., she did have the educational intervention of interest, so asking for what her post-test score would have been had she not had it is asking for information collected under conditions that are contrary to fact. Hence, it is not the difference Y – X that is of causal interest, but the difference Y – Y*, and the gain score has a causal significance only if X can serve as a substitute for the counterfactual Y *.
Conclusions 1.Missing data is an unavoidable complication. 2.Ignoring them (assuming missing-at-random) doesn’t often lead to a happy outcome. 3.The best solution, if possible, involves a special data gathering effort (e.g. Bowdoin’s SAT scores or examinees’ performance on the items they did not choose to answer). This may not be practical on a large scale -- so we must pay careful attention to those complete data sets when we have them.
Conclusions (2) 4.When gathering the missing data is not possible (e.g. counterfactual performance of students had they had a different teacher) we must use all the tools at our disposal (randomization if we can, the various techniques of good observational studies when we can’t). 5.And ALWAYS remember that we must be modest in our claims when the uncertainty induced by what we did not observe is of the same order of magnitude as the phenomena suggested by what we did observe.
We must remember the wisdom of Sir Josiah Charles Stamp (1880-1941) “ The government [is] extremely fond of amassing great quantities of statistics. These are raised to the nth degree, the cube roots are extracted, and the results are arranged into elaborate and impressive displays. What must be kept ever in mind, however, is that in every case, the figures are first put down by a village watchman, and he puts down anything he damn well pleases. ”
"For a successful technology, reality must take precedence over public relations, for nature cannot be fooled." Richard P. Feynman
In their search for the Holy Grail, both Walter Donovan and Indiana Jones arrived at the Canyon of the Crescent Moon with great anticipation. But after all of the other challenges had been met, the last test involved choice. The unfortunate Mr. Donovan chose first, and in the words of the Grail Knight, “He chose poorly” The consequences were severe.
Recommendation 2. Using Achievement Tests Instead Driving the Commission ’ s recommendations was the notion that the differential availability of commercial coaching made admissions testing unfair. They recognized that the 100 point gain (on the 1200 point SAT scale) coaching schools often tout as a typical outcome was hype and agreed with the estimates from more neutral sources of about 20 points was more likely. But, they deemed even 20 points too many. The Commission pointed out that there was no wide-spread coaching for achievement tests, but agreed that should the admissions option shift to achievement tests the coaching would likely follow. This would be no fairer to those applicants who could not afford extra coaching, but at least the coaching would be of material more germane to the subject matter and less related to test-taking strategies.
One can argue with the logic of this – that a test that is less subject oriented and related more to the estimation of a general aptitude might have greater generality. And that a test that is less related to specific subject matter might be fairer to those students whose schools have more limited resources for teaching a broad range of courses. I find these arguments persuasive, but I have no data at hand to support them. So instead I will take a different, albeit more technical, tack – the psychometric reality associated with replacing general aptitude tests with achievement tests means that making the kinds of comparisons that schools need among different candidates impossible.
When all students take the same tests we can compare their scores on the same basis. The SAT and ACT were constructed specifically to be suitable for a wide range of curricula. SAT – Math is based on mathematics no more advanced than 8 th grade. Contrast this with what would be the case with achievement tests. There would need to be a range of tests and students would chose a subset of them that best displayed both the coursework they have had and the areas they felt they were best in. Some might take chemistry, others physics; some French, others music. The current system has students typically taking three achievement tests (SAT-II). How can such very different tests be scored so that the outcome on different tests can be compared?
Do you know more French than I know physics? Was Mozart a better composer than Einstein was a physicist? How can admissions officers make sensible decisions through incomparable scores?
How are SAT-II exams scored currently? Or more specifically, how they had been scored for decades when I left the employ of ETS nine years ago – I don ’ t know if they have changed anything in the interim. They were all scored on the familiar 200-800 scales, but similar scores on two different tests are only vaguely comparable. How could they be comparable? What is currently done is that tests in mathematics and science are roughly equated using the SAT-Math, the aptitude test that everyone takes, as an equating link. In the same way tests in the humanities and social sciences are equated using the SAT-Verbal. This is not a great solution, but is the best that can be done in a very difficult situation. Comparing history with physics is not worth doing for even moderately close comparisons.
One obvious approach would be to norm reference each test, so that someone who scores average for all those who take a particular test gets a 500 and someone a standard deviation higher gets a 600, etc. This would work if the people who take each test were, in some sense, of equal ability. But that is not only unlikely, it is empirically false. The average student taking the French achievement test could starve to death on the Boulevard Raspail, whereas the average person who takes the Hebrew achievement test, if dropped onto the streets of Tel Aviv in the middle of the night would do fine. Happily the latter students also do much better on the SAT-VERBAL test and so the equating helps. This is not true for the Spanish test, where a substantial portion of those taking it come from Spanish speaking homes.
Substituting achievement tests is not a feasible option unless admissions officers are prepared to have subject matter quotas. Too inflexible for the modern world I reckon.
Recommendation 3. Halt the use of a cut-score on the PSAT to qualify for Merit Scholarships One of the principal goals of the Merit Scholarship program is to distribute a limited amount of money to highly deserving students without regard to their sex, ethnicity, or geographic location. This is done by first using a very cheap and wide ranging screening test. The PSAT is dirt-cheap and is taken by about 1.5 million students annually. The Commission objected to a rigid cut-off on the screening test. They believed that if the cut-off was, say, at a score of 50, we could not say that someone who scored 49 was different enough to warrant excluding them from further consideration. They suggested replacing the PSAT with a more thorough and accurate set of measures for initial screening.
The problem with a hard and fast cut score is one that has plagued testing for more than a century. The Indian Civil Service system, on which the American Civil Service system is based, found a clever way around it. The passing mark to qualify for a civil service position was 20. But if you received a 19 you were given one ‘ honor point ’ and qualified. If you scored 18 you were given two honor points, and again qualified. If you scored 17, you were given three honor points, and you qualified. But if you scored 16 you did not qualify, for you were four points away.
I don ’ t know exactly what the logic was behind this system, but I might guess that experience had shown that anyone scoring below 17 was sufficiently unlikely to be successful in obtaining a position, that it was foolish to include them in the competition. But having a sharp break at 16 might have been thought too abrupt and so the method of honor points was concocted.
How does this compare with the Merit Scholarship program? The initial screening selects 15,000 (top 1%) from the original pool. These 15,000 are then screened much more carefully using both the SAT and ancillary information to select down to the 1,500 winners (the top 10% of the 15,000 semi-finalists).
Once this process is viewed as a whole several things become obvious: 1.Since the winners are in the top 0.1% of the population it is dead certain these are all likely to be enormously talented individuals. 2.There will surely be many worthy individuals that were missed, but that is inevitable if there is only money for 1,500 winners. 3.Expanding the initial semifinal pool by even a few points will expand the pool of semi-finalists enormously (the normal curve grows exponentially), and those given the equivalent of some PSAT “ honor points ” are extraordinarily unlikely to win anyway, given the strength of the competition.
What about making the screening a more rigorous process – rather than just using the PSAT scores? Such a screening must be more expensive, and to employ it as widely would, I suspect, use up much more of the available resources leaving little or nothing for the actual scholarships. The irony is that utilizing a system like that proposed by the Commission would either have to be much more limited in its initial reach, or it would have to content itself with giving out many fewer scholarships. Of course, one could argue that more money should be raised to do a better job in initial screening. I would argue that if more money was available the same method of allocating should be continued and used to either give out more scholarships or bigger ones.
This completes more of the reasoning behind my initial conclusion that some of the recommendations of the Commission only made sense if you said them fast. I tried to slow things down a bit.