Coping with Missing Data for Active Learning Automation of Biological Research
What is Missing? In active learning the category label is missing, and we can query an oracle, mindful of cost What else can be missing? – Features: we may not have enough for prediction – Feature combinations: beyond those the classifier is able to generate automatically (e.g. XOR, ratios) – Values of features: Not all instances have values for all their features. – Feature relevance: Some features are noisy or irrelevant – Feature redundancy: e.g. high feature co-variance
Reducing the Feature Space Feature selection – Subsample features using IG, MI, … Well studied, e.g. Yang & Pedersen ICML 1997 – Wrapper methods Inefficient but accurate, less studied Feature projection (to lower dimensions) – LDA, SVD, LSI Slow, well studied, e.g. Falluchi et al 2009 – Kernel functions on feature sub-spaces
Missing Feature Values Active learning of features – Not as extensively studied as active instance learning (See Saar-Tsechansky et al, 2007) – Determines which feature values to seek for given instances, or which features across the board – Can be combined with active instance learning But, what if there is no oracle? – Impossible to get feature values – Too costly or too time consuming – Do we ignore instances with missing features?
Missing Data Feature:X1X2X3X4X5X6Y Inst 11.53YRG Inst 21.4?NRB031.1?-- Inst 3??N?1.2?+ Inst YRG880.1?+ Inst YRB65?2.2-- Inst 62.0?Y? Inst 7?2.8YRB Inst YRG ? Inst 9?3.3NRG ?
How to Cope with Missing Features ML training assumes feature completeness – Filter our features that are mostly missing – Filter out instances with missing features – Impute values for missing features – Radically change ML algorithms When do we do each of the above? – With lots of data and few missing features… – With sparse training data and few missing… – With sparse data and mostly missing features…
Missing Feature Imputation How do we estimate missing feature values? – Infer the mean value across all instances – Infer the mean value in neighborhood – Apply a classifier with other features as input and missing feature value as y (label) How do we know if it makes a difference? – Sensitivity analysis (extrema, pertubations) – Train without instances with missing features vs instances with imputed values for missing features
More on Missing Values Missing Completely at Random (MCAR) – It is generally impossible to prove MCAR or MAR Missing at Random (MAR) – Statisticians assume MAR as default Missing values that depend on observables – Imputation via classification/regression Missing valued that depend on unobservables Missing depending on the value itself
9 Imputation – Example [From: Fan 2008] How to impute the missing SCL for patient # 5? – Sample mean: ( )/4 = 1.7 – By age: ( )/2 = 2.2 – By sex: 1.1 – By education: 1.3 – By race: ( )/3 = 1.9 – By ADL: ( )/2 = 1.2 Who is/are in the same “slice” with #5? IDAgeSexEdu.RaceSCLADLPainComorb. 170F16W F16W M12B F21W M21W243
Further Reading Saar-Tsechansky & Provost /fulltext.pdf /fulltext.pdf Yang, Y., Pedersen J.P. A Comparative Study on Feature Selection in Text Categorization ICML 1997, pp A Comparative Study on Feature Selection in Text Categorization Gelman chapter: df df Applications in biomed: Lavori, P., R. Dawson and D. Shera (1995) “A Multiple Imputation Strategy for Clinical Trialswith Truncation of Patient Data.” Statistics in Medicine 14: