# Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation May, 2012 1.

## Presentation on theme: "Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation May, 2012 1."— Presentation transcript:

Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation May, 2012 1

Statistical methods for data forensic Univariate distributional techniques: e.g., average wrong-to-right erasures. Multivariate techniques – Simple regression. E.g., 2011 Reading is predicted by 2010 Reading score. – A school is flagged if the observed dependent variable differs significantly from the model’s prediction The schools are flagged when it is an outliers compared to ALL other schools 2 Global outlier

What if schools is suspicious but not extreme? Schools with suspicious behavior may not display sufficient extremity to make them outliers in comparison to all schools. Nevertheless, it is reasonable to assume that their scores will be higher than that of their peers—schools that are very similar in many relevant aspects. 3 Local outlier

Local v.s. global outlier Traditional statistical data forensic techniques lack the ability to detect local outliers – Regression will miss the blue rectangle – Univariate approach (e.g., using only variable a) will miss both blue rectangle and red triangle – Cluster analysis is not for outlier detection 4

The goal of RegLOD Regression based local outlier detection algorithm is introduced: RegLOD. We wish to find schools that are very similar to the peers in most respects (in terms of most independent variables) but differ significantly in current year’s score (the dependent variable). 5

Assumptions of RegLOD When most independent variables are very similar, we expect the dependent variable to be similar, as well. This assumption is very reasonable and is frequently exploited: this is the principle on which regression trees or nearest neighbor regression are built (Hastie, 2009). 6

Data and variables Data: – A large scale standardized state assessment test Variables: – School level Math and Reading scale score in 2010 and 2011. – School level Math and Reading cohort scale score in 2010 (for grade 4, scale score when they were in grade 3) – School level average wrong-to-right erasures in 2010 and 2011. 7

Data and variables Scale scores were transformed into logit for a better sense of school level during the analysis Dependent variable – 2011 Reading or 2011 Math Five independent variables – 2011 Math or Reading, 2010 Math and Reading, and 2010 cohort Math and Reading. – Erasure counts were not used in the algorithm 8

RegLOD algorithm overview 1.Select a set of independent variables 2.Find local weights 3.Make a peer group for each school 4.Obtain empirical p-values and flag schools when criterions are met. 9

RegLOD Example: Grade 4 Reading 10 Step 1. Select a set of independent variables 2011 Reading (G4) 2010 Reading (G4) 2010 Cohort Math (G3) 2010 Cohort Reading (G3) 2010 Math (G4) 2011 Math (G4) DV IV R 2 = 0.99

RegLOD Example: Grade 4 Reading 11 Step 2. Determine the local weights 2011 Logit Score 2010 Logit Score 2010 Cohort Logit Score MRMRM SubjectGradeP1P2P3P4P5 R40.358180.23228-0.122160.23214-0.05164

RegLOD Example: Grade 4 Reading 12 School iSchool jDistance/Similarity 120.051504 130.144772 14-0.03006 150.21838 16-0.03232 170.084222 180.097555......... 160016030.293038 160116022.342625 160116030.050075 160216031.848821 Step 3. Select peer schools Compute pair-wise distance using the weights. Select peer schools within +/- 0.03 (Dist value) from a school.

RegLOD Example: Grade 4 Reading 13 Step 4. Obtain empirical p-value Bootstrap 2011 Reading grade 4 scores of the peer school. Obtain empirical p-values for replication of bootstrap and average them. Flag a school if the empirical p-value 0.05 or less. Flag a if the number of peer schools are 10 or less.

How many schools were flagged? Local weights (Dist = 0.03) SubjectGrade Total School N Flagged School N Proportion Flagged R 41603 493.06 R 51493 372.48 R 61056 383.60 R 7836 202.39 R 8836 222.63 M 41603 332.06 M 51493 634.22 M 61056 363.41 M 7836 364.31 M 8836 374.43 14

Compare to the results of other statistical methods for data forensic SS: scale score analysis, e.g., 2011 scale score is predicted by 2010 scale score. PL: performance level analysis, e.g, proportion of proficient or above in 2011 is predicted by 2010 proportion of proficient or above. Reg: regression analysis using two subject, e.g., 2011 reading is predicted by 2011 mathematic. Rasch: use of Rasch residual. WR: wrong-to-right erasure count. SSco: scale score analysis using cohort students, e.g., 2011 scale score is predicted by 2010 scale score using cohort students. StdRes: standardized residual of multiple regression using the same variables as RegLOD analysis. 15

Grade 4 Reading: Comparison of Local and global outlier detection RegLODStatistical Methods (global outlier detection) Num Peer N P- value SS 1 PL 1 Reg 1 Rasch 1 WR 1 SSCo 1 StdRes 2 13 - 3.31.19.90.04.911.43.12 29 - 9.34.61.41.00.213.81.09 3930.0006.43.51.10.0 6.73.02 4570.0004.64.08.10.02.817.42.41 56 - 0.92.68.80.0 7.10.64 63 - 3.92.20.0 20.816.81.83 74240.0055.21.011.01.60.08.43.91 83120.0061.51.19.70.32.47.23.61 93800.0089.14.611.80.14.45.04.19 101190.0082.42.28.70.60.01.13.49 114350.0092.81.18.00.0 5.03.34 121470.0146.44.54.90.04.03.24.14 131120.0180.11.116.90.02.26.81.68 144180.0194.23.114.30.0 4.72.23 15520.0191.10.48.60.02.53.02.77 16

A school with 10 or less peers E.g., The school number 1 in the table – 2011 the wrong-to-right erasure was 96 percentile, which is rather high. – The RegLOD fond only three peer schools including the school, indicating this school is an outlier. – Large increase in percentile (26 to 95), indicating suspicious increase in score. – There are reasonable evidences that this school needs further scrutiny. 17 Percentiles DVP1P2P3P4P5 2011 Reading 2011 Math 2010 Reading 2010 Math 2010 Cohort Reading 2010 Cohort Math 2010 WR 2011 WR 95521009326206096

A school with many peer schools The school number 3 in the table – The RegLOD fond 93 three peer schools including the school. – Large increase in percentile (23 to 76 percentile) – Since there are many peers, we can plot the variables with the peer schools 18 Percentiles DVP1P2P3P4P5 2011 Reading 2011 Math 2010 Reading 2010 Math 2010 Cohort Reading 2010 Cohort Math 2010 WR 2011 WR 767522672341352

Comparison to peers for the IVs 2010 Reading and cohort 2010 Reading are around 20 percentile 19 School is within the peer distribution

Comparison to peers with 2011 Reading score (DV) This school was 23 percentile with 2010 cohort Reading and 76 percentile with 2011 Reading 20 School is an outlier among the peers with 2011 Reading

A lower achieving school with many peer schools The school number 12 in the table – The RegLOD fond 147 peer schools including the school. – Moderate increase in percentile (13 to 42 percentile) – The 2010 cohort math’s 48 percentile seems little strange given that 2011 Math is 14 percentile. – Erasure was 96 percentile, which is rather high. – We can take a look at the histograms 21 Percentiles DVP1P2P3P4P5 2011 Reading 2011 Math 2010 Reading 2010 Math 2010 Cohort Reading 2010 Cohort Math 2010 WR 2011 WR 4214122713483596

Comparison to peers for the IVs 22 2010 Math is in right tail because of the odd high percentile. Other than that, the school is well within the peer distribution. School is within the peer distribution

Comparison to peers with 2011 Reading score (DV) 23 School is an outlier among the peers with 2011 Reading This school had a moderate increase in percentile, but since it is an outlier compare to the peers, it is a local outlier.

Did all flagged schools exhibited suspicious behavior? 12 schools – potentially incorrectly (extremely high/low achievement) Majority of the flagged schools exhibited suspicious behavior. Some flagged schools by RegLOD were also flagged by other statistical methods – these were local and also global outliers. Other schools were flagged by RegLOD but not by other statistical method – these schools were local outliers but not global outliers. 24

Conclusion RegLOD have shown great promise in data forensic and it is a valuable addition to our data forensic tools. Its applicability is not limited to cheating detection in educational testing. Given its robust design of RegLOD, specifically its model-based design (the concept of dependent and independent variables in data mining) and its ability to adapt makes it applicable to a wide range of outlier detection problems. We continue to study its capabilities, extend and apply it to other contexts and tasks. 25

26 Thank you!

Download ppt "Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation May, 2012 1."

Similar presentations