Presentation on theme: "G.O. Wesolowsky Statistical Detection of Cheating on Multiple Choice Exams: Software, Implementation, and Controversy George O. Wesolowsky Professor Emeritus."— Presentation transcript:
1G.O. WesolowskyStatistical Detection of Cheating on Multiple Choice Exams: Software, Implementation, and ControversyGeorge O. WesolowskyProfessor Emeritus of Management ScienceDe Groote School of BusinessMcMaster University, Hamilton. Ontario,Essentials:Bubbling in or selecting choicesUnauthorized communication – one way or two way. Commonly known as copying, which is cheating
2Outline of this Presentation G.O. WesolowskyOutline of this PresentationIntroduction: Cheating on multiple choice testsHow I got into this.Outline of statistical detection methodologyPractical capabilities of SCheckCommon attitudes to detection and preventionRecommendations
3Ideal* Writing Conditions G.O. WesolowskyIdeal* Writing Conditions* I have seen more than 30% cheating under such conditions
4Less than Ideal Writing Conditions G.O. Wesolowsky
5Prevalence of MC Tests and Exams G.O. WesolowskyPrevalence of MC Tests and Exams30% ? of marks in UG classes given through MCAt McGill (20000 undergrads) in the Fall Semester of 2002:Finals: 83 courses,15072 studentsMidterms: 70+ courses, students
6How They Do It: Copying Sampler G.O. WesolowskyHow They Do It: Copying SamplerPeekingPassing (papers or whole examsSignaling. One invigilator told me a student was twitching so much all over his face and hand he thought at first it was a seizureClandestine electronic communicationCartoon of a toilet communications base from a web sit of a Scandinavian company selling electronic counter measures.
7How They Do It: Types of Cheating not Resulting in Similar Responses G.O. WesolowskyHow They Do It: Types of Cheating not Resulting in Similar ResponsesI am an impostorUsually not vulnerable to statistical detection
8G.O. WesolowskyA guide to cheating during tests and examinations From Wikibooks, the open-content textbooks collectionContents[hide]1 Preamble1.1 A few definitions to consider1.2 The rewards/dangers of cheating1.2.1 Rationales of cheating1.2.2 Rationales of prosecuting cheaters1.2.3 Possible Penalties2 General notes3 Techniques3.1 Copying from a person3.1.1 Application of codes3.2 Copying from a pre-written source3.2.1 Directly from textbook/notes3.2.2 Cheat Sheet3.3 Precautions3.4 Copying from a planted source3.5 Locating Cheating Material on the Web3.6 Test previewing
9Some Statistics plagiarized text (CAI) G.O. WesolowskySome Statistics plagiarized text (CAI)CAI Research Conducted By Don McCabe (Released In June, 2005) is typical of many studies: As part of CAI’s Assessment Project, almost 50,000 undergraduates on more than 60 campuses have participated in a nationwide survey of academic integrity since the fall of The results were disturbing, provocative, and challenging.On most campuses, 70% of students admitted to some cheating. Close to one-quarter of the participating students admitted to serious test cheating in the past year and half admitted to one or more instances of serious cheating on written assignments.Faculty are reluctant to take action against suspected cheaters.In Assessment Project surveys involving almost 10,000 faculty, 44% of those who were aware of student cheating in their course in the last three years, have never reported a student for cheating to the appropriate campus authority. Students suggest that cheating is higher in courses where it is well known that faculty members are likely to ignore cheating.
10One Method of Cheating Detection G.O. WesolowskyOne Method of Cheating DetectionMy favorite method.
11Questionable Statistical Detection G.O. WesolowskyIt is not infrequent that instructors, when confronted by a suspected cheating situation, invent their own methodology on the spot. This is usually what I call ‘outlier methodology’. The basis is some way of using the number of wrong answers that two students have in common.It could be simply a count of such 'wrong matches', a proportion, a run length, a ratio with other counts, or a multivariate plot of such variables. The idea is to look for outliers and attribute them to cheating.
12G.O. WesolowskyExampleBonnie and Clyde engaged in suspicious behavior. A comparison of responses revealed:“Bonnie and Clyde are surprisingly similar; 23 matches out of 23 wrong.”.....C......B C BD D..DB.A..ABD.B..A..CB...B ABAC..A..CBoth chose C, which is wrong. = correct
13But then: The instructor wrote a program: G.O. WesolowskyBut then:The instructor wrote a program:“My script just returns any match that has a high percentage of matching errors (and sufficient errors to convince you that some thing's up!) ““Holmes and Watson are surprisingly similar; 8 matches out of 11 wrong.”D......D B.....D.A BBA BD.B.....CD..BCC....BB...DC..CA..D.EC.....D......AD...B.DDAB...B..A..A.AA....CDB.AA.C......BDDBE.B..Intuitive override: “ I had found by chance (Bonnie and Clyde), but what about the rest? It's very unlikely that Holmes and Watson were cheating, but I think it's likely that the others were”.This instructor then concluded that statistical detection is not really reliable. Bad statistical detection often discredits the good.
14Aside: A Better “quickie” Index G.O. WesolowskyAside: A Better “quickie” IndexA better but not good simple index is the Harpp-Hogan index, which is the number of wrong matches divided by the number of differences. One is supposed to be suspicious when it is > 1. For Holmes and Watson this works out to 8/32.
15Problems with “Simple” Indices or combinations thereof G.O. WesolowskyProblems with “Simple” Indices or combinations thereofThe value of the indices can depend in an unknown way on class size, number of questions, number of choices, etc.They use very little information. Capability of students and difficulty of questions are often not incorporatedThe risk of “false accusations” is not predictableMany combinations of indices and plots are possible, and they may point in different directions.
16How I Got Into This Request from an administrator G.O. WesolowskyHow I Got Into ThisRequest from an administratorTwo students were suspected in another course, how many exactly similar answers did they have in my course?Probability tree diagramsChecked the literatureWesolowsky G.O. (2000) "Detecting Excessive Similarity in Answers on Multiple Choice Exams", Journal of Applied Statistics, Vol. 27,
17pki pji 1 - pki 1 - pji 1 - pki 1 - pki 1 - pki match w1i match w1i G.O. WesolowskymatchProbability correctpjiw1iCond. probability wrongmatch1 - pkiw1iQuestion i1 - pjiw2iw2iProbability of a match by students j and k on question I = sum of match probabilities1 - pkimatchw3i1 - pkiw3imatchA matching answer occurs on a question if both students get the same right answer or the same wrong answerWhat is needed are the probabilities of a correct answer for each student, and the conditional probability of a wrong answerSum the product of the probabilities along each of the “match” branchesw4i1 - pkiw4imatch
18G.O. WesolowskyAssumptionsThe probability that a student gets an answer right depends on the ability of the student and the difficulty of the questionThe probability of a match on wrong questions depends on the ‘popularity’ of wrong answersIndependencies as implicit in the diagram
19But how to we estimate wli and pji ? G.O. WesolowskyBut how to we estimate wli and pji ?
20depends on two things 1 1 Above average student Below average student G.O. Wesolowskydepends on two thingsAbove average student1Below average studentProportion of class that answered correctly on question iThe ability of the student and the difficulty of the question1
21Finding cj = proportion of questions answered correctly by student j G.O. WesolowskyFindingcj = proportion of questions answered correctly by student j1) aj is the student ability index2) The function is borrowed from location theory.3) For each student, the aj estimate is obtained by making sure that the modeled proportion correct is equal to the proportion correct actually obtained.Find by solving
22G.O. WesolowskyP value for each pair of students = probability of the observed number of matches or moreQuestion qQuestion 1Question jMMM1) If the probability of a match on each question were the same, this p-value could be found by the binomial distribution. A Bernoulli process always has the same probability of success2) Here, the probability of a success (match on a question) is different for every question3) The probability distribution is called the compound binomial distributionCompound Binomial Distribution because the probability of a match is different on each question
23Example of SCheck Output G.O. Wesolowsky** pair = ** Harpp-Hogan stat = #wr.mat/#diff = 19.00##################################################################Zb = 'equivalent' z from the BVP modelSignificance of Zb on a pre-selected pair = 1.5E-15Significance bound (Bonferroni)on program selected pairs = 1.3E-11#matches = 33 | 34 (mu,s)=( , )prop. right for 2 = prop. right for 78 = 0.412Quest. range = [ ] #students = 132.d.abccd.e .e.abedb.. ...da..b.. ea.e.d.abccdee .e.abedb.. ...da..b.. ea.eestimated match probabilities:** pair = ** Harpp-Hogan stat = #wr.mat/#diff = 19.00##################################################################Zb = 'equivalent' z from the BVP modelSignificance of Zb on a pre-selected pair = 1.5E-15Significance bound (Bonferroni)on program selected pairs = 1.3E-11#matches = 33 | 34 (mu,s)=( , )prop. right for 2 = prop. right for 78 = 0.412Quest. range = [ ] #students = 132.d.abccd.e .e.abedb.. ...da..b.. ea.e.d.abccdee .e.abedb.. ...da..b.. ea.eestimated match probabilities:
24Data Dredging The number of student pairs examined is n(n-1)/2. G.O. WesolowskyData DredgingThe number of student pairs examined is n(n-1)/2.For 693 students this is 239,778 pairssuspiciousAn oversight by many statistical detection methods. Consider a standardized normally distributed index of similarity .It might seem that a Z of 4 would be very unusual. But a dotplot shows otherwise. With so many pairs, rare single draw occurrences become common.The level of suspiciousness has to be raised.
25“Unusual” Z’s Depend on Class Size G.O. Wesolowsky“Unusual” Z’s Depend on Class SizeClass sizeNo. of pairsP(Zmax >3)P(Zmax >4)P(Zmax >5)P(Zmax >6)213.167E-52.8665E-79.8659E-1010049504.8836E-640079800E-510004995005000Analogy. Suppose the Z’s are independent.The probability of seeing a Z > 5 in a class is 3 in 10 million.But in a class of 5000, the probability is .97
26Multiply the Pvalue by n(n-1)/2 G.O. WesolowskyMultiply the Pvalue by n(n-1)/2** pair = ** Harpp-Hogan stat = #wr.mat/#diff = 19.00##################################################################Zb = 'equivalent' z from the BVP modelSignificance of Zb on a pre-selected pair = 1.5E-15Significance bound (Bonferroni)on program selected pairs = 1.3E-11#matches = 33 | 34 (mu,s)=( , )prop. right for 2 = prop. right for 78 = 0.412Quest. range = [ ] #students = 132.d.abccd.e .e.abedb.. ...da..b.. ea.e.d.abccdee .e.abedb.. ...da..b.. ea.eestimated match probabilities:A similarity this unusual will occur at most 1.3 times, on the average, per 100 billion classes.
27G.O. WesolowskyImportant!The significance (probability that a similarity that high will occur for an innocent pair) is different for a pair that is pre-selected by, say, suspicious behavior, from that of a pair that was selected purely by the program. In other words, the former case does not need as high a level of similarity evidence. Scheck, therefore, allows pre-selected pairs to be forced into the analysis
28Features of SCheck New: Two Type I methods for setting cutoffs G.O. WesolowskyDeveloped from experience with large scale testing, research into cheating psychology, tribunal cases, different data formats, etc.New: Two Type I methods for setting cutoffsAdjustment for “speed tests”Interactive or stored option choiceBatch processing of multiple filesOptional Excel grades outputFiles with all components necessary for verification of calculationsDiagnostic graphOptional fine tuning (T parameter)Compact and intuitive question diagnosticsUtility programs (format translators)Up to studentsUp to 200 questionsUp to 27 choices, numbers or lettersTrue or false or multiple choice in any combinationSelect a contiguous block of questionsOption for pre-selected student pairsOption for similarity scores for all studentsOptions for removing student identification from input and output filesSince 1998, quite a few features were added.Important ones are:
36This forces suspect pairs into the output for analysis. G.O. WesolowskyThis forces suspect pairs into the output for analysis.
371) The students pre-chosen don’t have to be adjacent G.O. Wesolowsky1) The students pre-chosen don’t have to be adjacent
38G.O. WesolowskyThis means that a false positive should occur in fewer that 1 in a 100 classes or runs.
39A marginal improvement in the model G.O. WesolowskyA marginal improvement in the model
40Straightness = normality Slope indicates stdev of Z’s G.O. WesolowskyVertical red line indicates similarity cutoff. Position depends on class sizeStraightness = normalitySlope indicates stdev of Z’sInnocent class is symmetrical within the linesTypical class with no identified cheatersOne possibly suspicious outlier, but outliers can happen by chance.
42Forced pairs in NAM file G.O. WesolowskyThis pair was forced in as an illustration. The program would not have selected it otherwise. It has a negative Z and I hence not suspicious,Students are identified
43Forced pairs in OUT file G.O. WesolowskyStudents are not identified, but there is more information
44Diagnostics on questions G.O. WesolowskyDiagnostics on questions1) The program gives all the information necessary to manually check the model2) Also has tables to check if answer keys are correct, questions are not flawed, etc.3) This is not item analysis but I think more useful.
45It’s Cheating timeG.O. WesolowskyFinally, what if there were dark deeds done during the test?
46Detected Pairs Summary of significances of identified pairs G.O. WesolowskyDetected PairsSummary of significances of identified pairspair Z A Priori BonferroniSignif Signif.2, E E-112, E E-1036, E E-636, E E-336, E E-460, E E-369, E E-769, E E-470, E E-778, E E-9All pairs were found in adjacent seatingBut some individuals have similarity links to more than one person!
47G.O. Wesolowsky366970722789760119These students were all in adjacent seating and proximity copying was possible132 studentsteamwork
48Have you ever seen anything like it? G.O. WesolowskyHave you ever seen anything like it?(Contact at a testing agency)When I saw something like this previously, the question was a comment: “There is something wrong with your program”
49Pair Z Pair Z Pair Z Pair Z 16, 39 6.453 16, 41 5.453 16, 42 6.090 G.O. WesolowskyPair ZPair ZPair ZPair Z16,16,16,16,16,16,16,39,39,39,39,39,39,39,39,39,39,41,41,41,41,41,41,41,41,42,42,42,42,42,42,42,42,43,43,43,44,44,44,44,46,46,46,46,46,49,50,50,50,57,64,65,82,82,82,86,87,87,91,91,91,93,93,100,109,113,113,113,113,117,118,120,120,120,120,122,122,124,Students seem to have a lot of partners, to whom they are linked with high similarities
50Really enthusiastic teamwork! G.O. Wesolowsky821131687491171093910911950110110411205786921221184264124138431376591931001071) How did “the firm” manage this?2) I asked and was told this was sensitive and confidential, so I will never know.3) The student numbers have adjacency tendencies, so it could be proximity copying, as opposed to electronics, but how?44141694635 suspects, 79 pairs, 200 students
51A NEW OPTION SCheck Version 8a7. #### 03-05-2009 11:21:34 G.O. WesolowskyA NEW OPTIONSCheck Version 8a7. #### :21:34Bonferroni program selected significance bound is 0.01_____________________________________________________________** pair = ** Harpp-Hogan stat = #wr.mat/#diff = 2.12##################################################################Zb = 'equivalent' z from the BVP modelSignificance of Zb on a pre-selected pair = 9.4E-8Approximate significance of program selected pair = 3.7E-5Signif. bound (Bonferroni) on program selected pairs = 5.6E-3#matches = 42 | 50 (mu,s)=( , )prop. right for 48 = prop. right for 188 = 0.580Quest. range = [ ] NRT = #students = 348.b.b...aa. a.ba..baa. ..cb..b.c. .c.e.aad.e .....b.a.b.b.b...aa. a.ba..baa. ..cbb.b.c. .c...d..ce .....b...aestimated match probabilities:
52History of Applications G.O. WesolowskyHistory of ApplicationsEarly studies for economics departmentFor individual instructorsLarge assessment organizationsNational education departments“This is the only instance where separate examination venues have shown up and I have used your analysis on probably about 30,000 candidates.”
53G.O. WesolowskyDallas Morning News Study on Cheating on the TAKS test (June 2007), an Application of SCheckDMN:“ The test scores of more than 50,000 students show evidence of cheating. Some of those students were the innocent victims of others copying their answers. But experts say most were likely either deliberately copying answers or had their answer sheets doctored by school staff. “TEA response“Officials at the Texas Education Agency have consistently argued that statistical analysis can't prove cheating and that they must rely on other forms of evidence – like getting teachers to confess to misbehavior – in their investigations. TEA decided not to use data drawn from student answer sheets – even with evidence of widespread copying in a classroom. “
54G.O. WesolowskyCommon Objections: We studied together, we are from a similar background, we are twins, etc.Some direct studies have been done.Note that a huge number of pairs is being looked at.We would expect that there would be a large number of highly similar pairs that couldn’t have cheatedWould expect that if prevention is implemented the high similarity rate would continue
55Both expectations have been proven false in thousands of data sets. G.O. WesolowskyBoth expectations have been proven false in thousands of data sets.Prior to electronic cheating methods, no very high similarities lacking the opportunity to cheat (adjacent seating) were found.Multiple versions of exams cause a drastic decrease in the cheating rate.
56Pitfalls in interpretation G.O. WesolowskyPitfalls in interpretationHigh marks make detection difficult*Non-responses can invalidate the modelCheating must be substantial for detectionToo much cheating can invalidate the modelHierarchical questions violate assumptionsData sets may be too small*In the extreme, if one student copies from another with a perfecttest, there is no way statistical detection can distinguish this case from thecase of two brilliant students with perfect papers. Scheck knows this.
57Effective Prevention Measures G.O. WesolowskyEffective Prevention MeasuresMultiple versionsRandomized or assigned seatingSeat spacingInvigilationElectronic counter-measures“Education on integrity” & ethics indoctrination
58G.O. WesolowskyDecember 2002 exams McGill University. Proper prevention measures are in place.finalsmidterms217
60Some Controversial Personal Opinions G.O. WesolowskySome Controversial Personal OpinionsMultiple choice tests provide a substantial component of grades in undergraduate courses. Statistical detection of copying or collusion on such tests has proven to be quite successful, and has in turn demonstrated that simple and non-intrusive prevention measures are very effective. Why then is this subject conspicuously absent in most discussions on cheating prevention strategies?
61G.O. WesolowskyThe most common general attitudes of student leaders, instructors, and administrators towards cheating are remarkably similarSee no Evil, Hear no Evil, Speak no EvilGenuine concensus
62G.O. WesolowskyCommentsA “look the other way” strategy is actually the best one if the goal is to protect the integrity reputation of a universityWhy? Media assume that the cheating problem is proportional to the number of reported cases or the amount of discussion about cheating. Keeping quiet, therefore, creates the impression that there is no problem.A university that tries to do something about cheating often gets a reputation for being infested with cheating. (Otherwise, why would they talk about it?)No good deed goes unpunished.
63Instructor: Not on my tests you won’t! G.O. WesolowskyInstructor AttitudesInstructor: Not on my tests you won’t!
64#1 Believe in justice? Feel personal betrayal? G.O. WesolowskyBelieve in justice?Feel personal betrayal?Not very many of these, and so their effect on the system is relatively minor.
66G.O. WesolowskyInstructor: I’m not a Policeman or Prison Guard, I’m an Educator and ResearcherTranslation: I am not going to charge any students with cheating or take any special prevention measuresThis sounds idealistic and has the added benefit of saving a lot of work and avoiding unpleasant confrontations1) Find may more of these.2) Variations, such as I’m too busy with research.
67G.O. WesolowskyCommentsImplicit assumption: The university’s main function is to provide an education and grades are merely an unimportant side-issue.I disagree: The main product of a university is grades and degrees and diplomas. Without these, even if it continued to give a good education, the university would be out of business overnight.On the other hand, degrees without education (diploma mills) are a growing phenomenon.The assumption of a degree is that a required level of academic achievement has been met. Grades and degrees without this level of achievement are defective.By giving degrees with based on defective grades it is giving the public a defective product.
68Postscript: Education versus Grades G.O. WesolowskyPostscript: Education versus GradesBy JANE ARMSTRONGFriday, January 27, 2006 Posted at 5:20 AM ESTFrom Friday's Globe and Mail“Professor David Weale called it a "January clearance" -- and clear out they did. Dismayed by his crowded classroom, the history teacher at the University of Prince Edward Island offered his students a deal some couldn't resist: Drop this Christianity class and you'll get a B minus. “Rule of 20?“The offer, also dubbed the "Weale deal" worked. The next week, about 20 of the 95 students were gone. So too is Prof. Weale after shocked administrators caught wind of the unorthodox academic transaction.”
69G.O. WesolowskyAddendum“Vice-president Gary Bradshaw said the school had no choice but to suspend Prof. Weale while a disciplinary probe begins. Offering students a credit without doing the work "strikes at the very heart of the academic principles," Mr. Bradshaw said.”The next week, he sweetened the offer, saying he'd give students who left a mark of 68. Students mulled it over during the break and negotiated the deal up to 70, which is a B minus. Departing students were required to send Prof. Weale an and pay the $450 for the course.
70Student Leaders and Instructors G.O. WesolowskyStudent Leaders and Instructors
71Student Leaders and Instructors: A Matter of Trust* G.O. WesolowskyStudent Leaders and Instructors: A Matter of Trust*Student Leader: “If you take prevention measures you show you don’t trust us and that cheating is expected. This will only increase cheating. Anyway, cheating is very rare.”Translation: My constituency would feel threatened.Instructor: “Cheating prevention and detection poisons the atmosphere and destroys the vital rapport and atmosphere of trust I have with my students”Translation: My teaching evaluations would go down.*paraphrased
73Administration: The Best Solution is ‘Ethics Reengineering’ G.O. Wesolowsky“E-tegrity board members, …,have developed a range of new initiatives to integrate integrity into the college’s culture.”“… suggest a more broadly focused approach that creates an educational community valuing academic integrity and focusing on the moral and ethical development of students”Sounds good and plays well in the media. Often arises out of bad publicity on cheating incidents.1) Will it work if it is seriously implemented?McCabe versus the economists2) Will it ever be implemented beyond the talk level?Web pages, articles in school newsletters, “integrity officers”, inviting McCabe.Usual outcomes: a) effort fades away b) majority of students and instructors are not involved.3) Contrast this with prevention measures on MC tests, which have been shown to virtually eliminate cheating.The re-engineering initiatives usually crop up after some bad publicity. Local or nationalEconomists: Cheating depends on opportunity payoffs and risks, - As Reagan would say” there they go again.”Psychologists: linking cheating to psychopathic tendencies. One study uses my software.2) In the case of business schools it could be because of misbehavior in corporations.3) The great majority of students are unaffected.
74Comments on Honor Codes G.O. WesolowskyComments on Honor Codes
75Honor Codes* Traditional Honor Codes Pledge and ceremonies G.O. WesolowskyHonor Codes*Traditional Honor CodesPledge and ceremoniesProctoring of tests not allowedStudents in charge of tribunalsStudents obliged to report infractions (rarely enforced)Modified Honor CodesProctoring allowed*McCabe
76G.O. WesolowskyQuote on Honor codesIn this same study that found over 75 percent of students admitting to cheating, McCabe saw that only 57 percent of students cheated at schools with honor codes. Chronic cheating also seems to reduce with honor codes. At schools without codes, 1 in 5 students admits to cheating more than twice. Only 1 in 16 students admits to the same offense at schools with honor codes."I think it's a question of making your students understand that academic integrity is important to the school," McCabe said. "Just the fact that it's being discussed" can heighten student's awareness and reduce cheating, he concluded.Administrator friendly comment.
77Do Honor Codes Work? Comments G.O. WesolowskyDo Honor Codes Work? Comments?Only? 57%Response bias:‘Honor’ rhetoric leads to fewer admissions of cheating?Fear of Draconian penalties?Control for other variables: Types of exams, subject matter etc.Low response rates → non-response bias“it is clear the response rate is below desired levels, averaging between 10% and 15% and varying from as little as 5% to 10% on some large campuses to over 50% on a limited number of small, residential campuses”
78Advantages of Honor Codes G.O. WesolowskyAdvantages of Honor CodesMedia friendlyReduce the number of reported cases of dishonestyDisadvantages of Honor Codes?
81G.O. WesolowskyRecommendations*Make statistical testing (even without identifying the students involved) a non-optional part of the scanning report.Institute mandatory or strongly “encouraged” prevention measures: multiple versions, assigned seating, electronic counter-measures, etc..As a last and least important step, use statistical testing to support charges of academic dishonesty. In other words, use it this way only after the cheating is cleaned up.Observation: Statistical evidence is very unlikely to be sufficient by itself. Other evidence, such as a proctor’s observations, will be required to make charges stick.* Plan is plagiarized from McGill