Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.

Slides:

Advertisements

Similar presentations

Applications of one-class classification

Advertisements

UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

RIPPER Fast Effective Rule Induction

Replacing Missing Values Jukka Parviainen Tik Special Course in Information Technology

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Editing and Imputing VAT Data for the Purpose of Producing Mixed- Source Turnover Estimates Hannah Finselbach and Daniel Lewis Office for National Statistics,

Preventing Overfitting Problem: We don’t want to these algorithms to fit to ``noise’’ The generated tree may overfit the training data –Too many branches,

Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.

Simple Multiple Line Fitting Algorithm Yan Guo. Motivation To generate better result than EM algorithm, to avoid local optimization.

Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”

Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.

An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.

Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.

Decision Tree Rong Jin. Determine Milage Per Gallon.

Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.

Three kinds of learning

Special Topic: Missing Values. Missing Values Common in Real Data  Pneumonia: –6.3% of attribute values are missing –one attribute is missing in 61%

© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.

Data Mining – Intro.

Filtering, Robust Filtering, Polishing: Techniques for Addressing Quality in Software Data Gernot Liebchen Bheki Twala Martin.

Data Mining: A Closer Look

Introduction to Directed Data Mining: Decision Trees

1 Prediction of Software Reliability Using Neural Network and Fuzzy Logic Professor David Rine Seminar Notes.

Data Mining for Intrusion Detection: A Critical Review Klaus Julisch From: Applications of data Mining in Computer Security (Eds. D. Barabara and S. Jajodia)

Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.

Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.

COMP3503 Intro to Inductive Modeling

Cost-Sensitive Bayesian Network algorithm Introduction: Machine learning algorithms are becoming an increasingly important area for research and application.

Outline What Neural Networks are and why they are desirable Historical background Applications Strengths neural networks and advantages Status N.N and.

Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)

Data MINING Data mining is the process of extracting previously unknown, valid and actionable information from large data and then using the information.

The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation.

1. 2 Traditional Income Statement LO1: Prepare a contribution margin income statement.

BAGGING ALGORITHM, ONLINE BOOSTING AND VISION Se – Hoon Park.

Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.

Spatially Assessing Model Error Using Geographically Weighted Regression Shawn Laffan Geography Dept ANU.

Decision Trees. What is a decision tree? Input = assignment of values for given attributes –Discrete (often Boolean) or continuous Output = predicated.

ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.

ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

Nurissaidah Ulinnuha. Introduction Student academic performance ( ) Logistic RegressionNaïve Bayessian Artificial Neural Network Student Academic.

Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

Data Cleaning Data Cleaning Importance “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball “Data.

Data Mining and Decision Support

Finding τ → μ−μ−μ+ Decays at LHCb with Data Mining Algorithms

Brian Lukoff Stanford University October 13, 2006.

Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.

WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

Fraud Detection Notes from the Field. Introduction Dejan Sarka –Data.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Data Mining What is to be done before we get to Data Mining?

Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.

Experience Report: System Log Analysis for Anomaly Detection

Software Defects Cmpe 550 Fall 2005

Machine Learning with Spark MLlib

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

NBA Draft Prediction BIT 5534 May 2nd 2018

Classification and Prediction

Machine Learning Feature Creation and Selection

Intro to Machine Learning

Introduction to Data Mining, 2nd Edition by

Dr. Morgan C. Wang Department of Statistics

Classification and Prediction

CSCI N317 Computation for Scientific Applications Unit Weka

Intro to Machine Learning

Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.

Presentation transcript:

Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle Cartwright

What is it all about? Motivations Dataset – the origin & quality issues Noise & cleaning methods The Experiment Issues & conclusion Future Work

Motivations A previous investigation compared 3 noise handling methods (robust algorithms [pruning], filtering, polishing) Predictive accuracy was highest with polishing followed by pruning and only then by filtering But suspicions were mentioned (at EASE)

Suspicions about previous investigation The dataset contained missing values which were imputed (artificially created) during the build of the model (decision tree) Polishing alters the data (What impact can that have?) The methods were evaluated by using the predictions of another decision tree -> Can the findings be supported by metrics specialist?

Why do we bother? Good quality data is important for good quality predictions and assessments How can we hope for good quality results if the quality of the input data is not good? The data is used for a variety of different purposes – esp. analysis and estimation support

The Dataset Given a large dataset provided by a EDS The original dataset contains more than cases with 22 attributes Contains information about software projects carried out since the beginning of the 1990s Some attributes are more administrative (e.g. Project Name, Project ID), and will not have any impact on software productivity

Suspicions The data might contain noise which was confirmed by the preliminary analysis of the data which also indicated the existence of outliers.

How could it occur? (in the case of the dataset) Input errors (some teams might be more meticulous than others) / the person approving the data might not be as meticulous Misunderstood standards The input tool might not provide range checking (or maybe limited) “Service Excellence” dashboard in head quarters Local management pressure

Suspicious Data Example Start Date: 01/08/ /06/2002 Finish Date: 24/02/ /02/2004 Name: *******Rel 24 -*******Rel 24 FP Count: Effort: CountryIRELAND-UK Industry SectorGovernment-Government Project TypeEnhance.-Enhance. Etc. But there were also example with extremely high/low FP counts per hour (1FP for hours; or 52 FP in 4 hours; 1746 FP in 468 hours)

What imperfections could occur? Noise – Random Errors Outliers – Exceptional “True” Cases Missing data From now on Noise and Outliers will be called Noise because both are unwanted

Noise Detection can be Distance based (e.g. visualisation methods; cooks, mahalanobis and euclidean distance; distance clustering) Distribution based (e.g. neural networks, forward search algorithms and robust tree modelling)

What to do with noise? First detection (we used decision trees - usually a pattern detection tool in data mining - but used to categorise the data in a training set and cases tested in a test set) 3 basic options of cleaning : Polishing, Filtering, Pruning

Polishing/Filtering/Pruning Polishing – identifying the noise and correcting it Filtering – Identifying the noise and eliminating it Pruning – Avoiding Overfitting (trying to ignore the leverage effects) – the instances which lead us to overfitting can be seen as noise and are taken out

What did we do? & How did we do it? Compared the results of filtering and pruning and discussed a implications of pruning Reduced the dataset to eliminate cases with missing values (avoid missing value imputation) Produced lists of “noisy” instances and polished counterparts Passed them on to Mark ( as metrics specialist)

Results Filtering produced a list of 226 cases from 436 (36% in noise list/ in cleaned set 21%) Pruning produced a list of 191 from 436 (33% in noise list/ 25% in cleaned set) Both were inspected and both contain a large number of possible true cases and unrealistic cases (productivity)

Results 2 By just inspecting historical data it was not possible to judge which method performed better The decision tree as a noise detector does not detect unrealistic instances but outliers in the dataset which can only be overcome with domain knowledge

So what about polishing? Polishing does not necessarily alter size or effort, and we are still left with unrealistic instances It makes them fit into the regression model Is this acceptable from the point of view of the data owner? - depends on the application of the results - What if unrealistic cases impact on the model?

Issues/Conclusions In order to build the models we had to categorise the dependent variable – 3 categories ( ) BUT these categories appeared to coarse for our evaluation of the predictions If we know there are unrealistic cases, we should really take them out before we apply the cleaning methods (avoid the inclusion of these cases in the building of the model)

Where to go from here? Rerun the experiment without “unrealistic cases” Simulate a dataset with model, induce noise and missing values and evaluate methods with the knowledge of what the real underlying model is

What was it all about? Motivations Dataset – the origin & quality issues Noise & Cleaning methods The Experiment Issues & conclusion Future Work

Any Questions?