Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.

Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle Cartwright

Gernot.Liebchen@Brunel.ac.uk What is it all about? Motivations Dataset – the origin & quality issues Noise & cleaning methods The Experiment Issues & conclusion Future Work

Gernot.Liebchen@Brunel.ac.uk Motivations A previous investigation compared 3 noise handling methods (robust algorithms [pruning], filtering, polishing) Predictive accuracy was highest with polishing followed by pruning and only then by filtering But suspicions were mentioned (at EASE)

Gernot.Liebchen@Brunel.ac.uk Suspicions about previous investigation The dataset contained missing values which were imputed (artificially created) during the build of the model (decision tree) Polishing alters the data (What impact can that have?) The methods were evaluated by using the predictions of another decision tree -> Can the findings be supported by metrics specialist?

Gernot.Liebchen@Brunel.ac.uk Why do we bother? Good quality data is important for good quality predictions and assessments How can we hope for good quality results if the quality of the input data is not good? The data is used for a variety of different purposes – esp. analysis and estimation support

Gernot.Liebchen@Brunel.ac.uk The Dataset Given a large dataset provided by a EDS The original dataset contains more than 10 000 cases with 22 attributes Contains information about software projects carried out since the beginning of the 1990s Some attributes are more administrative (e.g. Project Name, Project ID), and will not have any impact on software productivity

Gernot.Liebchen@Brunel.ac.uk Suspicions The data might contain noise which was confirmed by the preliminary analysis of the data which also indicated the existence of outliers.

Gernot.Liebchen@Brunel.ac.uk How could it occur? (in the case of the dataset) Input errors (some teams might be more meticulous than others) / the person approving the data might not be as meticulous Misunderstood standards The input tool might not provide range checking (or maybe limited) “Service Excellence” dashboard in head quarters Local management pressure

Gernot.Liebchen@Brunel.ac.uk Suspicious Data Example Start Date: 01/08/2002-01/06/2002 Finish Date: 24/02/2004 - 09/02/2004 Name: *******Rel 24 -*******Rel 24 FP Count:1522- 1522 Effort:38182.75- 33461.5 CountryIRELAND-UK Industry SectorGovernment-Government Project TypeEnhance.-Enhance. Etc. But there were also example with extremely high/low FP counts per hour (1FP for 6916.25 hours; or 52 FP in 4 hours; 1746 FP in 468 hours)

Gernot.Liebchen@Brunel.ac.uk What imperfections could occur? Noise – Random Errors Outliers – Exceptional “True” Cases Missing data From now on Noise and Outliers will be called Noise because both are unwanted

Gernot.Liebchen@Brunel.ac.uk Noise Detection can be Distance based (e.g. visualisation methods; cooks, mahalanobis and euclidean distance; distance clustering) Distribution based (e.g. neural networks, forward search algorithms and robust tree modelling)

Gernot.Liebchen@Brunel.ac.uk What to do with noise? First detection (we used decision trees - usually a pattern detection tool in data mining - but used to categorise the data in a training set and cases tested in a test set) 3 basic options of cleaning : Polishing, Filtering, Pruning

Gernot.Liebchen@Brunel.ac.uk Polishing/Filtering/Pruning Polishing – identifying the noise and correcting it Filtering – Identifying the noise and eliminating it Pruning – Avoiding Overfitting (trying to ignore the leverage effects) – the instances which lead us to overfitting can be seen as noise and are taken out

Gernot.Liebchen@Brunel.ac.uk What did we do? & How did we do it? Compared the results of filtering and pruning and discussed a implications of pruning Reduced the dataset to eliminate cases with missing values (avoid missing value imputation) Produced lists of “noisy” instances and polished counterparts Passed them on to Mark ( as metrics specialist)

Gernot.Liebchen@Brunel.ac.uk Results Filtering produced a list of 226 cases from 436 (36% in noise list/ in cleaned set 21%) Pruning produced a list of 191 from 436 (33% in noise list/ 25% in cleaned set) Both were inspected and both contain a large number of possible true cases and unrealistic cases (productivity)

Gernot.Liebchen@Brunel.ac.uk Results 2 By just inspecting historical data it was not possible to judge which method performed better The decision tree as a noise detector does not detect unrealistic instances but outliers in the dataset which can only be overcome with domain knowledge

Gernot.Liebchen@Brunel.ac.uk So what about polishing? Polishing does not necessarily alter size or effort, and we are still left with unrealistic instances It makes them fit into the regression model Is this acceptable from the point of view of the data owner? - depends on the application of the results - What if unrealistic cases impact on the model?

Gernot.Liebchen@Brunel.ac.uk Issues/Conclusions In order to build the models we had to categorise the dependent variable – 3 categories ( 2985.5) BUT these categories appeared to coarse for our evaluation of the predictions If we know there are unrealistic cases, we should really take them out before we apply the cleaning methods (avoid the inclusion of these cases in the building of the model)

Gernot.Liebchen@Brunel.ac.uk Where to go from here? Rerun the experiment without “unrealistic cases” Simulate a dataset with model, induce noise and missing values and evaluate methods with the knowledge of what the real underlying model is

Gernot.Liebchen@Brunel.ac.uk What was it all about? Motivations Dataset – the origin & quality issues Noise & Cleaning methods The Experiment Issues & conclusion Future Work

Gernot.Liebchen@Brunel.ac.uk Any Questions?

Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.

Similar presentations

Presentation on theme: "Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.

Similar presentations

Presentation on theme: "Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle."— Presentation transcript:

Similar presentations

About project

Feedback