Download presentation

Presentation is loading. Please wait.

Published byAraceli Mellin Modified over 2 years ago

2
Demystifying Data Forensics An Overview of the Logic Underlying Cheating Detection Techniques Jim Wollack, Ph.D. Associate Professor, Educational Psychology Director, Testing and Evaluation Services Director, UW Center for Placement Testing

3
Answer Copying Example with MBE Tends not to steal the headlines Most common type of cheating without premeditation ≈ 20% undergraduates copy annually Moderately serious One’s ability to help themselves depends on ability of Source(s), visual acuity, and line of sight Is Cheating Really a Problem

4
Item Preknowledge 2002 GRE braindump site 2010 FSBPT group email accounts IT Certification Test prep sites Certexperts.com: 60,000 items from 60 testing programs (2006) Testking.com: Microsoft exams (2006) Scoretop.com: GMAT (2008) Huge Problem Upwards of 85% examinees Single most common type of cheating among undergrads Extremely serious Is Cheating Really a Problem

5
Illegal Coaching and Test Tampering 2003 TAKS in Dallas Numerous public schools (Atlanta, Washington, Philly, Dallas, LA, etc.) Surprisingly frequent 2 - 4% of educators Extremely serious Is Cheating Really a Problem

6
Proxy Testing SAT/ACT proxies Very infrequent Extremely serious Led to changes in registration process Is Cheating Really a Problem

7
Data Forensics Statistical approaches to identify examinees whose scores are of questionable validity Utilized by almost all major testing programs Can be used as a trigger and/or to corroborate suspicion Not a substitute for an investigation False positives are possible Legitimate reasons may also exist for statistical irregularities Combatting Cheating

8
Answer Copying Preknowledge Other forms of collusion Illegal Coaching Test tampering Proxy Testing Types of Data Forensics Addressed

9
Tailor method to specific type of cheating Approaches should focus on specific, statistically observable elements of cheating Observable means that it is evident in the data record Using cellphones, testing with a fake ID, talking during test are NOT observable to the statistician. Answers to specific test questions, test scores, testing history, etc. ARE observable to the statistician. What strange patterns would I expect to see if someone were engaged in the cheating behavior of interest that I would not expect to see otherwise? The Creation a Forensic Tool

10
What are observable characteristics that we’d expect to see if one examinee copies from another? How persuasive is that evidence? How can we convert that observable into a statistic that is likely to be high for cheaters and low for non-cheaters? What issues might be associated with that statistic? Answer Copying

11
Observable: Large # of identical responses #Matches between examinees should vary with Abilities of C and S Answer Copying

12
Observable: Large # of identical responses #Matches between examinees should vary with Abilities of C and S Number of questions Number of item alternatives Difficulty of questions Attractiveness of alternatives Most common approach is to standardize the number of matches Answer Copying

13
Conversion of raw data to a scale that makes direct comparisons possible Approach involves two steps Evaluating each data point with its expected value Expected value is the value that, on average, we would expect to see under these exact circumstances if there really were no cheating. Evaluation against expected value tells us if #Matches is more or less than we’d expect of this C from this S Dividing this difference by a measure of the expected variability (standard error). Standardization

14
Standardization 50% = 1 in 2 0.14% = 1 in 740 2.3% = 1 in 44 15.9% = 1 in 6 Index Values Likelihood 3.091 in 1,000 3.721 in 10,000 4.271 in 100,000 5.201 in 1M

15
Empirically-based Construct large dataset of pairs of examinees who could not have copied Condition the dataset Divide data into smaller, homogeneous groups Test scores for one or both examinees Sum or product of examinees’ test scores Longest string of consecutive matches Compute the average #Matches across all examinee pairs within the group into which the C-S pair of interest falls. St.dev(#Match) is the standard deviation across those same values. How Does One Find Expected # Matches?

16
Model-based Use a statistical model to estimate the probability of examinees selecting each item choice. Nominal Response Model How Does One Find Expected # Matches?

17
Example Probability Function Low Average High Ability of Alleged Copier Prob(selecting alternative k)

18
Example Probability Function Low Average High Ability of Alleged Copier Prob(selecting alternative k)

19
Example Probability Function Low Average High Ability of Alleged Copier Prob(selecting alternative k) A C B D

20
Example Probability Function Low Average High Ability of Alleged Copier Prob(selecting alternative k) A C B D B C D A Alleged C Alleged S

21
Example Probability Function Low Average High Ability of Alleged Copier Prob(selecting alternative k) A C B D B C D A Alleged C Alleged S S selected (A) Prob(Match) =.12

22
Model-based Estimate probability of C selecting each item choice. How Does One Find Expected # Matches? CopierSourceP(A)P(B)P(C)P(D) BA0.120.470.230.18 CC0.020.170.570.24 BB0.450.340.130.08 DA0.060.680.080.18 DD0.100.020.840.04 Find prob of C selecting S’s answer Sum these probabilities across items How unusual is it to observe 3 answer matches given the expected is 1.13? 0.12 + 0.57 + 0.34 + 0.06 + 0.04 = 1.13 3 Answer Matches

23
Unusual or Not? CopierSourceP(A)P(B)P(C)P(D) BA0.120.470.230.18 CC0.020.170.570.24 BB0.450.340.130.08 DA0.060.680.080.18 DD0.100.020.840.04 0.12 + 0.57 + 0.34 + 0.06 + 0.04 = 1.13 3 Answer Matches

24
How Unusual is 2.29? 2.9% = 1 in 91 2.29

25
Answer copying detection indexes provide a probability statement about the likelihood of C’s and S’s responses having been produced Quantifies how unusual observed similarity is Very small probabilities are quite compelling From a recent case: Examinees C and S completed Scrambled forms C scored 26 of 100S scored 76 of 100 Matched answers on 66 of 100 items Index value: 9.89 Statistical Detection of Answer Copying

26
Probability of C’s answers being independently of S? 1 in 45,000,000,000,000,000,000,000 How Unusual is 9.89?

27
Probability of C’s answers being independently of S? 1 in 45,000,000,000,000,000,000,000 How big is 45,000,000,000,000,000,000,000 # people ever born108,000,000,000 # stars in galaxy:400,000,000,000 earth’s age (in seconds)150,000,000,000,000,000 # grains of sand:7,500 000 000 000 000 000 Evidence doesn’t need to be this overwhelming to be useful. Depending on other evidence, statistical evidence in order of 1 in 1,000 or 1 in 10,000 may be adequate. How Unusual is 9.89?

28
Scrambled form Helps with copying detection index Also possible to look at likelihood of C’s score under alternate test key Different success rates on common items and unique items Can find expected score over both sets and ask whether changes are in keeping with expectations Other Observables for Answer Copying

29
Trick is finding or deriving an observable that is predictive of the issue What data patterns might we expect to observe to identify candidates with preknowledge? Other Types of Cheating

30
Premise: Examinees who have studied live items should do much better on those items than on unfamiliar items Challenge: We often do not know which items are compromised. We can design the test with preknowledge detection in mind Preknowledge

31
Test-within-a-test Embed a set of new items on test Observable: change in performance across set of secure and operational items Use score from operational items to predict score on new items Internal Verification Test

32
Example Probability Function Low Average High Ability of Alleged Copier Prob(selecting alternative k) A C B D Prob A.66 B.06 C.25 D.03

33
VT item #Prob(k C )Prob × (1 − Prob) 10.660.224 20.890.098 30.510.250 40.840.134 50.790.166 60.630.233 70.670.221 80.590.242 90.760.182 Sum6.341.751 Expected Number Correct

34
Likelihood VT # Correct Index Value Prob (1 in) 0-4.791,200,000 1-4.0436,700 2-3.281,900 3-2.52172 4-1.7726 5-1.016 6-0.263 70.501.4 81.251.1 92.011.0 Check extremity LOW scores suggestive of preknowledge

35
Works well if Items are highly compromised VT is long enough Internal Verification Test

36
Premise: Examinees view braindump/test prep sites as “Ultimate Authorities,” and will trust their materials unconditionally. Observable: Examinees who have studied INCORRECTLY POSTED live items should do much worse on those items than on unfamiliar items Preknowledge

37
Testing program releases some of its actual items to known braindumps Released items are very easy Items are posted verbatim with incorrect key marked Items appear on exam, but do not count towards score Use score from operational items to predict score on new items Trojan Horse Items

38
VT item #Prob(k C )Prob × (1 − Prob) 10.980.020 20.930.065 30.940.056 40.900.090 50.850.128 60.910.082 Sum5.510.441 Expected Number Correct Ideally suited for cases with very high compromise rates. Scores on operational test are high Items are easy High probabilities of success on items

39
Likelihood VT # Correct Index Value Prob (1 in) 0-8.3019.5 quadrillion 1-6.80185 billion 2-5.2916.2 million 3-3.7812,900 4-2.2887 5-0.775 60.741.3 LOW scores suggestive of preknowledge

40
If test is very highly compromised, methodology works well to detect biggest and dumbest offenders. Can catch biggest offenders with as few as 5-6 items Ethics Dilemma: Should testing companies really be exposing illegitimate information on their program. If they know where people are going to access stolen content, wouldn’t bringing the site down be the right thing to do? Trojan Horse Items

41
Premise: Examinees will be able to answer quickly any questions for which they have preknowledge Observable: Response times for compromised items should be less than for secure items. Preknowledge

42
If you have VT, can compare a person’s standardized RT on VT with their standardized RT on operational test Essentially the same as asking whether the percentile rank of person’s RT is markedly different Can plot RTs to look for anomalies Ways to use RT Data

43
Items re-ordered from shortest to longest average RT Response Time

44
If you have VT, can compare a person’s standardized RT on VT with their standardized RT on operational test Essentially the same as asking whether the percentile rank of person’s RT is markedly different Can plot RTs to look for anomalies Possible flags Within person RT sd Finished test too quickly (< 4 SEs below mean) RT < 20sec for too many questions Really long RTs could signal item harvesting Ways to use RT Data

45
Observable: Unusual similarity among examinees with access to the same materials Similarity index Conceptually very similar to answer copying index, except expected value/st.dev are computed differently Preknowledge

46
Example Probability Function Low Average High Ability of Alleged Copier Prob(selecting alternative k) A C B D B C D A Alleged C Alleged S S selected (A) Prob(Match) =.12

47
Example Probability Function Low Average High Ability of Alleged Copier Prob(selecting alternative k) A C B D B C D A Examinee 1 Examinee 2 12× A.12.66.079 B.47.06.028 C.23.25.058 D.18.03.005 Sum.170

48
Example Probability Function Low Average High Ability of Alleged Copier Prob(selecting alternative k) A C B D B C D A Ex. 1 Ex. 2 12× A.49.66.323 B.13.06.008 C.30.25.075 D.08.03.002 Sum.408

49
Example Probability Function Low Average High Ability of Alleged Copier Prob(selecting alternative k) A C B & D C A 1 2 12× A.76.87.661 B.02.01.000 C.20.11.022 D.02.01.000 Sum.683

50
Find P(Match) for all items and sum across items Find P(Match)× [1 − P(Match)] and sum across items Square root = 1.064 Similarity Ex. 1Ex. 2P(Match)P(Match) × [1 − P(Match)] BA0.330.221 CC0.530.249 BB0.370.233 DA0.280.202 DD0.350.228 SUM1.861.132

51
Compute index between all possible pairs Identify all pairs with probability < some criterion Use a clustering method to unite linked examinees Detecting Preknowledge with Similarity Indexes

52
11 10 9 8 7 6 5 4 12 13 1 2 3 78 total pairs

53
11 10 9 8 7 6 5 4 12 13 1 2 3 78 total pairs

54
Requires lots of investigative work What connects these individuals? Results of investigation Compelling evidence that leads to questioning the validity of a score can result in score cancelation, with automatic, free retest May not be possible to impose more severe sanctions absent a clear picture of how anomalous patterns emerged and how patterns are connected Challenges with Detecting Collusion

55
Premise: Examinees who have previously been unsuccessful will be more likely to attempt to acquire preknowledge Observable: Repeat examinees with preknowledge will see large gains in their scores, perhaps over a short interval Preknowledge/Tampering

56
Most standardized test are highly reliable Scores on independent retests will be comparable Scores for repeat candidates that increase by too much (e.g., 3 SEs or more) are viewed as suspicious. Lots of legitimate reasons scores could increase acquisition of new material Examinee’s physical/mental health during initial test Poor time management Alignment error on P&P exam Gain Scores

57
Premises/Observables Tampering will only occur on items for which examinee gave incorrect response Disproportionate number of WTR erasures Unusually high number correct score on erased items Tampering will only occur for lower-ability students and will be done for a handful of items only Will produce peculiar item response patterns with correct items for several hard questions and incorrect for several easy questions—person fit Tampered students scores will regress once they no longer have that teacher/school/district Unusually high positive gain score followed by unusually negative gain score Test Tampering

58
Index Profiles Profile 1Profile 2Profile 3 Verification TestFLAGNO Gain ScoreBigSmallBig SimilarityBigSmallModerate Resp. LatencyFLAGNO ErasuresNOYES (Mod)YES (BIG) Item Preknowledge Illegal CoachingTest Tampering

59
There are many different statistical approaches to detecting different kinds of cheating Just because it is possible to create a statistic does not mean that it will work well Very important to conduct research on these methods to see how they work in practice Validating Data Forensic Techniques

60
Most popular approach is to see how well methods identify known cases of cheating Problem is that amount and severity of cheating not carefully controlled We don’t know who else cheated who wasn’t caught We don’t know how extreme the cheating was in order for them to have been caught Impossible to get an estimate of false positive or true positive rate BEST approach is to simulate different types/amounts of cheating and see how well methods identify them Requires accurate assumptions about what cheating looks like Validating Data Forensic Techniques

61
Data forensics is an emerging but promising field Much that we don’t yet know about how well methods work Much improvement necessary to detect low-to-moderate amounts of cheating Forensic tools are imperfect and investigations are expensive and time-consuming Best resource for protecting score integrity is PREVENTION Securing/Limited access to materials before and after exam Seating charts Vigilant proctoring Careful checking-in of examinees/authentication Careful collection of exam materials Strict adherence to administrative guidelines Report anything suspicious Future of Data Forensics

62
Jim Wollack University of Wisconsin - Madison jwollack@wisc.edu 608-262-0675 Thank you

Similar presentations

Presentation is loading. Please wait....

OK

Chapter 14 Inferential Data Analysis

Chapter 14 Inferential Data Analysis

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google