Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Methodology for Finding Bad Data

Similar presentations


Presentation on theme: "A Methodology for Finding Bad Data"— Presentation transcript:

1 A Methodology for Finding Bad Data
Jaime Miranda1, Richard Weber1, Derek Partridge2 1)Departamento de Ingeniería Industrial Universidad de Chile 2) Department of Computer Science, University of Exeter, UK

2 Outline Data Mining – Introduction
KDD Process: Knowledge Discovery in Databases Methodology for finding and correcting bad data Application of proposed methodology Conclusions and future work

3 Process of knowledge discovery in databases (KDD)
selected data Pre-processing pre-processed data Transformation transformed data Data Mining Patterns Interpretation Evaluation Selection

4 Preprocessing Missing value (no value)
Example Missing value (no value) Value out of range (value impossible) Age = 250 “Bad data” (could be, but strange) Age = 112 or Age = 81 and student

5 Identification of data problems
Missing value (no value): sure Value out of range (value impossible): sure “Bad data” (could be, but strange): unsure

6 Missing values, example:

7 Treatment of missing values
Do not use, flag field of missing data Fill in missing value (mean value, imputation algorithms)

8 Value out of range, example:

9 “Bad data”, example:

10 Proposed generic methodology to find and correct “bad data” 1 of 2 (“replace all”)
Develop regression model with “good data” Identify candidates for “bad data” STOP Replace all “bad data”

11 Proposed generic methodology to find and correct “bad data” 2 of 2 (“replace iteratively”)
Develop regression model with “good data” Identify candidates for “bad data” Yes “bad data” remaining? STOP No Replace only “worst data” of remaining set of “bad data”

12 Identify candidates for “bad data”
Analysis per column, independently, identify “deviation” from “norm” e.g. Deviation from mean value Expert opinion Combination of the two (Filtering for expert judgement)

13 Develop regression model with “good data”
Am = F(A1, … , Am-1) i.e. predict “bad” attribute value based on all the other (good) attribute values

14 Example for proposed methodology: Customer segmentation

15 Clustering C l u s t e r n = ^ 1 Clusters

16 Customer segmentation with clustering

17 Centers of 6 segments Total database: 200.000 customers,
take subset of 320 customers for experimentation

18 Experiment take subset of 320 customers,
change value of attribute “Income” for 20 customers (10 values below minimum (0) and 10 values above maximum (5.000)) Apply proposed methodology

19 Step 1: Identify candidates for “bad data”
Identify “deviation”for attribute Income (here: Deviation from mean value) Could identify 18 of 20 “strange values”

20 Step 2: Regression model used: neural network (MLP)
Am = F(A1, … , Am-1)

21 Neural networks natural å Connections with weights Neuron artificial

22 Neural networks (Multilayer Perceptron)
h g s N u r I p L a y H d O å A1 Am Am-1

23 Results (“replace all”)

24 Evaluation of Results

25 Results (“replace iteratively vs. replace all”)

26 Characteristics of proposed methodology
Identifies candidates for “bad data” per attribute (column) without looking at other attributes No background knowledge regarding attributes (e.g. Negative income) Each step offers opportunities for different methods (here: Deviation detection using distance to mean, Regression model by neural network)

27 Future work Apply to larger data sets
Try different techniques for identifying “candidates for bad data”, e.g. By looking at other attributes Implementation in Matlab


Download ppt "A Methodology for Finding Bad Data"

Similar presentations


Ads by Google