# Missing values problem in Data Mining

## Presentation on theme: "Missing values problem in Data Mining"— Presentation transcript:

Missing values problem in Data Mining
Jelena Stojanovic 03/20/2014

Outline Missing data problem Aapproaches to Missing values
Missing values in attributes Missing values in target variable Missingness mechanisms Aapproaches to Missing values Eliminate Data Objects Estimate Missing Values Handling the Missing Value During Analysis Experimental analisys Conclusion

Missing Data problem There are a lot of serious data quality problems in real datasets: incomplete, redundant, inconsistent and noisy reduce the performance of data mining algorithms Missing data is a common issue in almost every real dataset. Caused by varied factors: high cost involved in measuring variables, failure of sensors, reluctance of respondents in answering certain questions or an ill-designed questionnaire. Very common anomalies in large data sets are Missing data, which are data values ​​that should be present in the data set, but for various reasons are absent.

Missing values in datasets
The missing data problem arises when values for one or more variables are missing from recorded observations.

Missing values in attributes (independant variables)

Missing labels

Missingness mechanism
Missing Completely At Random Missing At Random Missing Not At Random Missing data randomness can be divided into three classes: 1. Missing completely at random (MCAR).This is the highest level of randomness. It occurs when the probability of an instance (case) having a missing value for an attribute does not depend on either the known values or the missing data. In this level of randomness, any missing data treatment method can be applied without risk of introducing bias on the data; 2. Missing at random (MAR).When the probability of an instance having a missing value for an attribute may depend on the known values, but not on the value of the missing data itself; 3. Not missing at random (NMAR).When the probability of an instance having a missing value for an attribute could depend on the value of that attribute.

Missing Completely at Random - MCAR
Missing Completely at Random - the missingness mechanism does not depend on the variable of interest, or any other variable, which is observed in the dataset. The data are collected and observed arbitrarily and the collected data does not depend on any other variable of the dataset. The case when respondents decide to reveal their income levels based on coin- flips This type of missing data is very rarely found and the best method is to ignore such cases. Missing completely at random (MCAR).This is the highest level of randomness. It occurs when the probability of an instance (case) having a missing value for an attribute does not depend on either the known values or the missing data. In this level of randomness, any missing data treatment method can be applied without risk of introducing bias on the data; The term “Missing Completely at Random” refers to data where the missingness mechanism does not depend on the variable of interest, or any other variable, which is observed in the dataset. The data are collected and observed arbitrarily and the collected data does not depend on any other variable of the dataset. The case when respondents decide to reveal their income levels based on coin-flips This type of missing data is very rarely found and the best method is to ignore such cases.

MCAR (continued) Estimate E(X) from partially observed data:
X* = [0, 1, m, m,1,1, m, 0, 0, m…] E(X)=? True data: X = [0, 1, 0, 0, 1, 1, 0, 0, 0, 1…] E(X) = 0.5 Rx = [0, 0, 1, 1, 0, 0, 1, 0, 0, 1…] If MCAR: X* = [0, 1, m, m,1,1, m, 0, 0, m…] and E(X) = 3/6 =0.5

Missing At Random - MAR Missing at random - when the probability of an instance having a missing value for an attribute may depend on the known values, but not on the value of the missing data itself; Missingness can only be explained by variables that are fully observed whereas those that are partially observed cannot be responsible for missingness in others; an unrealistic assumption in many cases. Women in the population are more likely to not reveal their age, therefore percentage of missing data among female individuals will be higher. Data are Missing at random if the data meets the requirement that missingness should not depend on the value of Yi after controlling for another variable.

Missing Not Ar Random- MNAR
When data are not either MCAR or MAR Missingness mechanism depends on another partially observed variable Situation in witch the missingness mechanism depends on the actual value of missing data. The probability of an instance having a missing value for an attribute could depend on the value of that attribute Difficult task; model the missingness Easier said than done. Not missing at random (NMAR).When the probability of an instance having a missing value for an attribute could depend on the value of that attribute. Modeling such a condition is a very difficult task to achieve. In MNAR problem the only way to attain an estimate of parameters is to model the missingness. This means we need to write a model for missing data and then integrate it into a more complex model for estimating missing values.

Missing data consequences
They can significantly bias the outcome of research studies. Response profiles of non-respondents and respondents can be significantly different from each other. Performing the analysis using only complete cases and ignoring the cases with missing values can reduce the sample size thereby substantially reducing estimation efficiency. Many of the algorithms and statistical techniques are generally tailored to draw inferences from complete datasets. It may be difficult or even inappropriate to apply these algorithms and statistical techniques on incomplete datasets.

Handling missing values
In general, methods to handle missing values belong either to sequential methods (preprocessing methods) or to parallel methods (methods in which missing attribute values are taken into account during the main process of acquiring knowledge). Existing approaches: Eliminate Data Objects or Attributes Estimate Missing Values Handling the Missing Values During Analysis

Eliminate data objects
Complete case analisys Eliminate objects with missing values (listwise deletion) Simple and effective strategy Even partially specified objects contains some information If there are many objects- reliable analysis can be difficult or impossible Unless data are missing completely at random, listwise deletion can bias the outcome.

Eliminating data attributes
Eliminate attributes that have missing values Carefully: These attributes maybe critical for analysis Or maybe we can only delete instances/attributes with high levels of missing data. Before deleting any attribute, it is necessary to evaluate its relevance to the analysis. Unfortunately, relevant attributes should be kept even with a high degree of missing values. Both methods, complete case analysis and discarding instances and/or attributes, should be applied only if missing data are MCAR, because missing data that are not MCAR have non-random elements that can bias the results;

Estimate Missing Values
most common/mean value Missing data sometimes can be estimated reliably using values of remaing cases or attrubutes: replacing a missing attribute value by the most common value of that attribute, replacing a missing attribute value by the mean for numerical attributes

Imputation assigning to a missing attribute value the corresponding value taken from the closest case, replacing a missing attribute value by a new value, computed from a new data set, considering the original attribute as a decision (imputation) For this strategy, comonly used are machine learning algorithms: Unstructured (Decision trees, Naive Bayes, K-Neares neighbors…) Structured (Hidden Markov Models, Conditional Random Fields, Structured SVM…) Some of these methods are more accurate, but more computationaly expensive, so different situations require different solutions In most cases, data sets attributes are not independent from each other. Thus, through the identification of relationships among attributes, missing values can be determined. Imputation is a term that denotes a procedure that replaces the missing values in a data set by some plausible values. One advantage of this approach is that the missing data treatment is independent of the learning algorithm used. This allows the user to select the most suitable imputation method for each situation.

Imputation- nearest neighbor
K-NN Use of the k-nearest neighbour algorithm to estimate and substitute missing data. The main benefits of this approach are: (i)k-nearest neighbour can predict both qualitative attributes (the most frequent value among the knearest neighbours) and quantitative attributes (the mean among the knearest neighbours); (ii)There is no necessity for creating a predictive model for each attribute with missing data. Actually, the k-nearest neighbour algorithm does not create explicit models (like a decision tree or a set of rules), since the data set is used as a “lazy” model. Thus, the k-nearest neighbour algorithm can be easily adapted to work with any attribute as class, by just modifying the attributes to be considered in the distance metric. Also, this approach can easily treatexamples with multiple missing values. The main drawback of the‚k-nearest neighbour approach is that, whenever thek-nearest neighbour looks for the most similar instances, the algorithm searches through all the data set. This limitation can be very critical for KDD, since this research area has, as one of its main objectives, the analysis of large databases. (Several works that aim to solve this limitation can be found in the literature. One method is the creation of a reduced training set for thek-nearest neighbour composed only by proto-typical examples)

Handling the Missing Value During Analysis
Missing values are taken into account during the main process of acquiring knowledge Some examples: Clustering - similarity between the objects calculated using only the attributes that do not have missing values. C4.5 - splitting cases with missing attribute values into fractions and adding these fractions to new case subsets. CART -A method of surrogate splits to handle missing attribute values Rule-based induction algorithms- missing values „do not care conditions“ Pairwise deletion is used to evaluate statistical parameters from available information CRF-marginalizing out effect of missing label instances on labeled data Missing attribute values are taken into account during the main process of acquiring knowledge -In clustering, similarity between the objects calculated using only the attributes that do not have missing values. Similarity in this case only approximation, but unless the total number of attributes is small or the numbers of missing values is high, degree of inaccuracy may not matter much. -C4.5 induces a decision tree during tree generation, splitting cases with missing attribute values into fractions and adding these fractions to new case subsets. -A method of surrogate splits to handle missing attribute values was introduced in CART. -In modification of the LEM2 (Learning from Examples Module, version 2) rule induction algorithm rules are induced form the original data set, with missing attribute values considered to be "do not care" conditions or lost values. -In statistics, pairwise deletion is used to evaluate statistical parameters from available information: to compute the covariance of variables X and Y , all those cases or observations in which both X and Y are observed are used regardless of whether other variables in the dataset have missing values. -In CRFs, marginalizing out effect of missing label instances on labeled data, and thus utilizing information of all observations and preserving the observed graph structre.

Internal missing data strategy used by C4.5
C4.5 uses a probabilistic approach to handle missing data C4.5: Multiple split (Each node T can be partitioned into T1 , T2 … Tn subsets) Evaluation measure: Information Gain ratio If there exist missing values in an attribute X, C4.5 uses the subset with all known values of X to calculate the information gain. Once a test based on an attribute X is chosen, sC4.5 uses a probabilistic approach to partition the instances with missing values in X C4.5 can handle missing values in any attribute, except the class attribute, for both training and test sets. C4.5 uses a probabilistic approach to handle missing data. Given a training set,T, C4.5 finds a suitable test, based on a single attribute, that has one or more mutually exclusive outcomes O1,O2,...,On. T is partitioned into subsets T1, T2, ..., Tn, where Ti contains all the instances in T that satisfy the test with outcomeOi. The same algorithm is applied to each subset Ti until a stop criteria is obeyed. C4.5 uses the information gain ratio measure to choose a good test to partition the instances. If there exist missing values in an attribute X, C4.5 uses the subset with all known values of X to calculate the information gain. Once a test based on an attributeXis chosen, C4.5 uses a probabilistic approach to partition the instances with missing values in X

Internal missing data strategy used by C4.5
When an instance in T with known value is assigned to a subset Ti, probability of that instance belonging to subset Ti is 1 probability of that instance belonging to all other subsets is 0 C4.5 associates to each instance in Ti a weight representing the probability of that instance belonging to Ti. If the instance has a known value, and satisfies the test with outcome Oi, then this instance is assigned to Ti with weight 1 If the instance has an unknown value, this instance is assigned to all partitions with different weights for each one: The weight for the partition Ti is the probability that instance belongs to Ti. This probability is estimated as the sum of the weights of instances in T known to satisfy the test with outcome Oi, divided by the sum of weights of the cases in T with known values on the attribute X. When an instance in T with known value is assigned to a subset Ti, this indicates that the probability of that instance belonging to subset Ti is 1 and to all other subsets is 0. When the value is not known, only a weaker probabilistic statement can be made. C4.5 associates to each instance in Ti a weight representing the probability of that instance belonging to Ti. If the instance has a known value, and satisfies the test with outcome Oi, then this instance is assigned to Ti with weight 1; if the instance has an unknown value, this instance is assigned to all partitions with different weights for each one. The weight for the partition Ti is the probability that instance belongs to Ti. This probability is estimated as the sum of the weights of instances in T known to satisfy the test with outcome Oi, divided by the sum of weights of the cases in T with known values on the attribute X.

Experimental Analysis*
Using cross-validation estimated error rates compare performance of : K-nearest neighbour algorithm as an imputation method Mean or mode imputation method Internal algorithms used by C4.5 and CN2 to learn with missing data Missing values were artificially implanted, in different rates and attributes (more than 50%) Data sets from UCI [10]: Bupa, Cmc, Pima and Breast *G. Batista and M.C. Monard, “An Analysis of Four Missing Data Treatment Methods for Supervised Learning,”Applied Artificial Intelligence,vol. 17, pp , 2003 The main objective of the experiments conducted in this work is to evaluate the efficiency of the k-nearest neighbour algorithm as an imputation method to treat missing data, comparing its perfor-mance with the performance obtained by the internal algorithms used by C4.5 and CN2 to learn with missing data, and by the mean or mode imputation method. This work analyses the behaviour of four methods for missing data treatment: the 10-NNI method using a k-nearest neighbour algorithm for missing data imputation; the mean or mode imputation; and the internal algorithms used by C4.5 and CN2 to treat missing data. These methods were analysed inserting different percentages of missing data into different attributes of four data sets, showing promising results. The 10-NNI method provides very good results, even for training sets having a large amount of missing data. The Breast data set provided a valuable insight into the limitations of the missing data treatment methods. The first decision to be taken is if the attribute should be treated. The existence of others attributes with similar information (high correlation), or similar predicting power can make the missing data imputation useless, or even harmful. Missing data imputation can be harmful because even the most advanced imputation method is only able to approximate the actual (missing) value. The predicted values are usually more well-behaved, since they conform with other attributes values. In the experiments carried out, as more attributes with missing values were inserted and as the amount of missing data increased, more simple were the induced models. In this way, missing data imputation should be carefully applied, under the risk of oversimplifying the problem under study.

Comparative results for the Breast data set
Although missing data imputation with k-nearest neighbour can provide good results, there are occasions that its use should be avoided. This is illustrated by the Breast data set. Breast was chosen because its attributes have strong correlations among each other. These correlations cause an interesting situation: in one hand, the k-nearest neighbour can predict the missing values with precision; on the other hand, the inducer can decide not to use the treated attribute, replacing it by another attribute with high correlation. The results for Breast data set are shown in Figure 4, where it can be seen that 10-NNI does not outperform the others missing data treatment methods. This scenario is interesting because 10-NNI was able to predict the missing data with higher precision than the mean or mode imputation. As missing values were artificially implanted into the data, the mean square error (MSE) between the predicted values and the actual ones can be measured. These errors are presented in Table 3.

Comparative results for the Bupa data set
Considering the results shown in Figure 1, it can be observed that the performance of 10-NNI is superior to the performances of C4.5 and CN2 internal algorithms, and the mean imputation for Bupa data set. Furthermore, the C4.5 internal algorithm is competitive to 10-NNI only when missing values were inserted into the attributes 2, 4 and 5. The mean or mode imputation obtained good results when missing values are inserted into the attributes 2, 4 and 5, for the CN2 inducer

Comparative results for the Cmc data set
Figure 2. The performance of 10-NNI is in most cases superior to the performance obtained without missing data treatment, for both C4.5 and CN2. The performance of 10-NNI is also superior or, in some few cases, competitive to the performance of the mean or mode imputation method. In fact, the mean or mode imputation method is competitive to 10-NNI only when missing values were inserted into the attributes 0 and 3 and 0, 1, and 3, using CN2 as inducer.

Comparative results for the Prima data set
Figure 3 shows the comparative results for Pima data set. In this data set, the 10-NNI method shows a slightly superior performance compared with C4.5 without missing data treatment, and a superior performance compared with CN2 without missing data treatment. Besides, 10-NNI is superior to the mean or mode imputation when missing data were inserted into attribute 1 for both inducers.

Conclusion Missing data huge data quality problem
Vast variety of causes of missingess In general, there is no best, universal method of handling missing values Different types of missingness mechanism (MCAR, MAR, MNAR) and datasets require different approaches of dealing with missing values Reasons for missing values Information is not collected (e.g., people decline to give their age and weight) Attributes may not be applicable to all cases (e.g., annual income is not applicable to children)

Thank you for your attention! Questions?

Homework problem: 1. List the types of missingness mechanisms. State one way you think should be appropriate for solving each of them and shortly explain way.

Eliminate data objects or attributes
Eliminate objects with missing values (listwise deletion) Simple and effective strategy Even partially specified objects contains some information If there are many objects- reliable analysis can be difficult or impossible Unless data are missing completely at random, listwise deletion can bias the outcome. Eliminate attributes that have missing values Carefully: These attributes maybe critical for analysis Listwise deletion and pairwise deletion used in approximately 96% of studies in the social and behavioral sciences. Listwise deletion and pairwise deletion used in approximately 96% of studies in the social and behavioral sciences. Listwise deletion refers to a simple method in which cases with missing values are deleted. Unless data are missing completely at random, listwise deletion can bias the outcome. Pairwise deletion (or “available case”) is a deletion method used for estimating pairwise relations among variables. For example, to compute the covariance of variables X and Y , all those cases or observations in which both X and Y are observed are used regardless of whether other variables in the dataset have missing values.

Estimate Missing Values
Missing data sometimes can be estimated reliably using values of remaing cases or attrubutes: replacing a missing attribute value by the most common value of that attribute, replacing a missing attribute value by the mean for numerical attributes, assigning all possible values to the missing attribute value, assigning to a missing attribute value the corresponding value taken from the closest case, replacing a missing attribute value by a new value, computed from a new data set, considering the original attribute as a decision (imputation) For this strategy, comonly used are machine learning algorithms: Unstructured (Decision trees, Naive Bayes, K-Neares neighbors…) Structured (Hidden Markov Models, Conditional Random Fields, Structured SVM…) Some of these methods are more accurate, but more computationaly expensive, so different situations require different solutions

Handling the Missing Value During Analysis
Missing attribute values are taken into account during the main process of acquiring knowledge In clustering, similarity between the objects calculated using only the attributes that do not have missing values. Similarity in this case only approximation, but unless the total number of attributes is small or the numbers of missing values is high, degree of inaccuracy may not matter much. C4.5 induces a decision tree during tree generation, splitting cases with missing attribute values into fractions and adding these fractions to new case subsets. A method of surrogate splits to handle missing attribute values was introduced in CART. In modification of the LEM2 (Learning from Examples Module, version 2) rule induction algorithm rules are induced form the original data set, with missing attribute values considered to be "do not care" conditions or lost values. In statistics, pairwise deletion is used to evaluate statistical parameters from available information: to compute the covariance of variables X and Y , all those cases or observations in which both X and Y are observed are used regardless of whether other variables in the dataset have missing values. In CRFs, marginalizing out effect of missing label instances on labeled data, and thus utilizing information of all observations and preserving the observed graph structre.