Download presentation

Presentation is loading. Please wait.

Published byTanya Keel Modified about 1 year ago

1
Replacing Missing Values Jukka Parviainen Tik Special Course in Information Technology

2
Jukka Parviainen2 Agenda n Motivation n Objectives n Meaning for the conclusions n Origin of missing values (MV) n Detection of missing values n Replacing missing values n Examples

3
Jukka Parviainen3 References n Pyle, DP for DM, chapter 8 n Hair, Anderson, Tatham, Black: Multivariate Data Analysis n Bishop: NN for PR

4
Jukka Parviainen4 …missing values?

5
Jukka Parviainen5 Motivation n There are always MVs in a real data set n MVs may have an impact on modeling, in fact, they can destroy it! n MVs contain also information!!! n Hint for the modeler: Avoid-Detect- Replace-Understand

6
Jukka Parviainen6 “Definitions” n Missing value - not captured in the data set: errors in feeding, transmission,... n Empty value - no value in the population n Outlier, out-of-range value

7
Jukka Parviainen7 Objectives n Controlled and understood by the modeler n “Least harm”, no “new” information into a data set n statistical estimation of MVs not the primary issue, but DM n KISS - speed and simplicity n PIE-I/O - training+testing+execution

8
Jukka Parviainen8 Origin and Detection n Missing data process n Degree of randomness u nonrandom u missing at random u missing completely at random n Detecting missing value patterns u number of MVs in each variable/case u compare MVP to complete sets

9
Jukka Parviainen9 Replacing missing values n Randomness of MVs? n Methods u Use the complete data u Delete variable(s)/case(s) u Imputation methods... u Model based (ML, Bayes) u Use robust models

10
Jukka Parviainen10 Imputation methods n Process of estimating MVs based on valid values of other variables / cases n Techniques: u distribution characteristics from all available valid values u replacing: case, mean substitution, cold deck, regression imputation

11
Jukka Parviainen11 Examples n Polls, Questionnaires u Planning more than essential u human factors! u small amounts of data n Data from steel plant u Information system u errors, default values u lots of data

12
Jukka Parviainen12 Questions n Does software applications help or hide the effect of missing values? (SPSS Clementine) n Execution/prediction phase of DM process? n What to do with alpha variables?

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google