Presentation is loading. Please wait.

Presentation is loading. Please wait.

G54DMT – Data Mining Techniques and Applications Dr. Jaume Bacardit

Similar presentations


Presentation on theme: "G54DMT – Data Mining Techniques and Applications Dr. Jaume Bacardit"— Presentation transcript:

1 G54DMT – Data Mining Techniques and Applications http://www.cs.nott.ac.uk/~jqb/G54DMT http://www.cs.nott.ac.uk/~jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk Topic 2: Data Preprocessing Lecture 4: Handling missing values Structure of the lecture taken from http://sci2s.ugr.es/MVDM/index.phphttp://sci2s.ugr.es/MVDM/index.php

2 Outline of the lecture What is a missing value and why is it important? Strategies for missing values handling Imputation techniques

3 What is a missing values? A missing value (Mv) is an empty cell in the table that represents a dataset ? Instances Attributes

4 Missing values in the ARFF format @relation labor @attribute 'duration' real @attribute 'wage-increase-first-year' real @attribute 'wage-increase-second-year' real @attribute 'wage-increase-third-year' real @attribute 'cost-of-living-adjustment' {'none','tcf','tc'} @attribute 'working-hours' real @attribute 'pension' {'none','ret_allw','empl_contr'} @attribute 'standby-pay' real @attribute 'shift-differential' real @attribute 'education-allowance' {'yes','no'} @attribute 'statutory-holidays' real @attribute 'vacation' {'below_average','average','generous'} @attribute 'longterm-disability-assistance' {'yes','no'} @attribute 'contribution-to-dental-plan' {'none','half','full'} @attribute 'bereavement-assistance' {'yes','no'} @attribute 'contribution-to-health-plan' {'none','half','full'} @attribute 'class' {'bad','good'} @data 1,5,?,?,?,40,?,?,2,?,11,'average',?,?,'yes',?,'good' 2,4.5,5.8,?,?,35,'ret_allw',?,?,'yes',11,'below_average',?,'full',?,'full','good' ?,?,?,?,?,38,'empl_contr',?,5,?,11,'generous','yes','half','yes','half','good' 3,3.7,4,5,'tc',?,?,?,?,'yes',?,?,?,?,'yes',?,'good'

5 Why do missing values exist? Faulty equipment Incorrect measurements Missing cells in manual data entry – Very frequent in questionnaires for medical scenarios – Actually having a low rate of missing values may be suspicious (Barnard and Meng, 99) Censored/anonymous data

6 Why missing values are important? Three reasons – Loss of efficiency: less patterns extracted from data or conclusions statistically less strong – Complications in handling and analyzing the data. Methods are in general not prepared to handle them – Bias resulting from differences between missing and complete data. Data Mining methods generate different models (Barnard and Meng, 99)

7 Caracterisation of missing values (Little & Rubin, 87) Three categories of missing values – Missing completely at random (MCAR), when the distribution of an example having a missing value for an attribute does not depend on either the observed data or the missing data – Missing at random (MAR), when the distribution of an example having a missing value for an attribute depends on the observed data, but does not depend on the missing data – Not missing at random (NMAR), when the distribution of an example having a missing value for an attribute depends on the missing values. Depending on the type of missing value, some of the handling methods will be suitable or not

8 Strategies for missing values handling (Farhangfar et al., 08) Discarding examples with missing values – Simplest approach. Allows the use of unmodified data mining methods – Only practical if there are few examples with missing values. Otherwise, it can introduce bias Convert the missing values into a new value – Use a special value for it – Add an attribute that indicates if value is missing or not – Greatly increases the difficulty of the data minig process Imputation methods – Assign a value to the missing one, based on the rest of the dataset – Use the unmodified data mining methods

9 Imputation methods As they extract a model from the dataset to perform the imputation, they are suitable for MCAR and, to a lesser extent, MAR types of missing values Not suitable for NMAR type of missing data – It would be necessary in this case to go back to the source of the data to obtain more information In the next slides several imputation methods will be described. All references to the original papers presenting them are available at http://sci2s.ugr.es/MVDM/index.php http://sci2s.ugr.es/MVDM/index.php

10 Do Not Impute (DNI) Simply use the default MV policy of the data mining method Only suitable if such policy exists Example for rule learning – Attributes with missing values would be considered as irrelevant Function match (instance x, rule r) Foreach attribute att in the domain If att is present in x and (x.att r.att.upper) Return false EndIf EndFor Return true

11 Most Common (MC) value If the missing value is continuous – Replace it with the mean value of the attribute for the dataset If the missing value is discrete – Replace it with the most frequent value of the attribute for the dataset Simple and fast to compute Assumes that each attribute presents a normal distribution ? Ave

12 Concept Most Common (CMC) value Refinement of the MC policy The MV is replaced with the mean/most frequent value computed from the instances belonging to the same class Assumes that the distribution for an attribute of all instances from the same class is normal ? Ave

13 Imputation with k-Nearest Neighbour (KNNI) k-NN machine learning algorithm – Given an unlabeled new instance – Select the k instances from the training set most similar to the new instance What is similar? E.g. an euclidean distance – Predict the majority class from these k instances k-NN for MV imputation – Select the k nearest neighbours – Replace the MV with the most frequence/mean value from these k instances

14 Weighted imputation with K-Nearest Neighbour (WKNNI) Refinement of KNNI – Select the k neighbours as before – MV generation is now performed through a weighted average of the values for the missing attribute from these k neighbours – The closest neighbours from k have more weight

15 K-means Clustering Imputation (KMI) Clustering: automatic aggregation of instances in groups K-means: Given a dataset it identifies k (predefined parameter) groups (clusters) of similar instances. For each cluster it computes the centroid – Artificial representative of the cluster – Mean/most frequent value of instances in a cluster MV imputation – Identify the cluster to which the instance with MV belongs to – Take the value of the centroid

16 Imputation with Fuzzy K-means Clustering (FKMI) Fuzzy logic: Reasoning framework that explicitly takes into account uncertainty In fuzzy k-means each instances does not simply belongs to a cluster or not It has a membership degree to each cluster Missing values are computed as weighed sum of all centroids, using the membership function of each cluster as the weight

17 Support Vector Machines Imputation (SVMI) SVM (Vapnik, 1995) are state-of-the-art statistical learning methods for classification and regression For each attribute with missing values in the dataset – Select the instances that do not have a missing value for the attribute – Train a SVM to predict the attribute from the class labels – Impute the missing values using the SVM

18 Event Covering (EC) Method based on information theory – The branch of mathematics dealing with the efficient and accurate storage, transmission, and representation of information (Mathworld, Wolfram Research) Requires discrete variables. It includes its own discretisation procedure for the continuous attributes EC assesses the degree of interdependency of each pair of variables, and identifies the significant dependencies MV imputation is computed from the variables that are interdependant with the MV attribute, taking into account the values for these

19 Expectation-Maximization (EM) The EM method is an statistical procedure that adjust the parameters of a mathematical model of data with missing observations.EM method Contribution from the statistics world It iterates between two steps – Expectation: Generate imputations of the missing data from the model – Maximization: Use both the imputed and the known data to refine the parameters of the model The procedure stops when the model remains unchanged from one iteration to the next

20 Expectation-Maximization (EM) What is the model? – Mean and covariance of the data How is the imputation done? – For each record with missing values, a linear regression process is used to estimate the missing values from the known values and the model If the number of variables is larger than the number of records a special, regularised, version of the algorithm is required

21 Singular Value Decomposition Imputation (SVDI) Singular Value Decomposition (SVD) is a linear algebra procedure that given an mxn matrix M it computes Where U is a mxm unitary matrix, Σ is a mxn diagonalmatrix and V* is a nxn unitary matrix The k rows from V* with highest corresponding value in Σ are used to fit a linear regression model for the record with missing values Imputation is done from a linear combination of the top rows of V* using the coefficients of the regression SVM can only be performed on a complete matrix, so it is necessary to bootstrap this method with another imputation algorithm

22 Other methods Bayesian Principal Component Analysis (BPCA) Local Least Squares Imputation (LLSI) Event Covering (EC) Support Vector Machine Imputation (SVMI) Expectation Maximisation Imputation (EMI) Singular Value Decomposition Imputation (SVDI) Details and references of these methods in http://sci2s.ugr.es/MVDM/index.php

23 And what is the global picture? In terms of attributes – Methods that treat each attribute separately – Methods that take decisions from the whole record – Methods that consider a subset of attributes In terms of instances – Imputation based on the complete instance set – Imputation based on a subset of similar records Methods that decompose the dataset and take decisions in a different space

24 Which method is the best? The literature of full of comparisons of methods – Impact of imputation of missing values on classification error for discrete data Impact of imputation of missing values on classification error for discrete data – A Study on the Use of Imputation Methods for Experimentation with Radial Basis Function Network Classifiers Handling Missing Attribute Values: The good synergy between RBFs and EventCovering method A Study on the Use of Imputation Methods for Experimentation with Radial Basis Function Network Classifiers Handling Missing Attribute Values: The good synergy between RBFs and EventCovering method – Missing value estimation for DNA microarray gene expression data: Local least squares imputation Missing value estimation for DNA microarray gene expression data: Local least squares imputation

25 Resources List of web site dedicated to missing values – http://sci2s.ugr.es/MVDM/index.php#four http://sci2s.ugr.es/MVDM/index.php#four Bibliography on missing values – http://sci2s.ugr.es/MVDM/biblio.php http://sci2s.ugr.es/MVDM/biblio.php Implementation of the methods described in this lecture is available in the KEEL packageKEEL

26 Questions?


Download ppt "G54DMT – Data Mining Techniques and Applications Dr. Jaume Bacardit"

Similar presentations


Ads by Google