Presentation is loading. Please wait.

Presentation is loading. Please wait.

Coping with Missing Data for Active Learning 02-750 Automation of Biological Research

Similar presentations


Presentation on theme: "Coping with Missing Data for Active Learning 02-750 Automation of Biological Research"— Presentation transcript:

1 Coping with Missing Data for Active Learning Automation of Biological Research

2 What is Missing? In active learning the category label is missing, and we can query an oracle, mindful of cost What else can be missing? – Features: we may not have enough for prediction – Feature combinations: beyond those the classifier is able to generate automatically (e.g. XOR, ratios) – Values of features: Not all instances have values for all their features. – Feature relevance: Some features are noisy or irrelevant – Feature redundancy: e.g. high feature co-variance

3 Reducing the Feature Space Feature selection – Subsample features using IG, MI, … Well studied, e.g. Yang & Pedersen ICML 1997 – Wrapper methods Inefficient but accurate, less studied Feature projection (to lower dimensions) – LDA, SVD, LSI Slow, well studied, e.g. Falluchi et al 2009 – Kernel functions on feature sub-spaces

4 Missing Feature Values Active learning of features – Not as extensively studied as active instance learning (See Saar-Tsechansky et al, 2007) – Determines which feature values to seek for given instances, or which features across the board – Can be combined with active instance learning But, what if there is no oracle? – Impossible to get feature values – Too costly or too time consuming – Do we ignore instances with missing features?

5 Missing Data Feature:X1X2X3X4X5X6Y Inst 11.53YRG Inst 21.4?NRB031.1?-- Inst 3??N?1.2?+ Inst YRG880.1?+ Inst YRB65?2.2-- Inst 62.0?Y? Inst 7?2.8YRB Inst YRG ? Inst 9?3.3NRG ?

6 How to Cope with Missing Features ML training assumes feature completeness – Filter our features that are mostly missing – Filter out instances with missing features – Impute values for missing features – Radically change ML algorithms When do we do each of the above? – With lots of data and few missing features… – With sparse training data and few missing… – With sparse data and mostly missing features…

7 Missing Feature Imputation How do we estimate missing feature values? – Infer the mean value across all instances – Infer the mean value in neighborhood – Apply a classifier with other features as input and missing feature value as y (label) How do we know if it makes a difference? – Sensitivity analysis (extrema, pertubations) – Train without instances with missing features vs instances with imputed values for missing features

8 More on Missing Values Missing Completely at Random (MCAR) – It is generally impossible to prove MCAR or MAR Missing at Random (MAR) – Statisticians assume MAR as default Missing values that depend on observables – Imputation via classification/regression Missing valued that depend on unobservables Missing depending on the value itself

9 9 Imputation – Example [From: Fan 2008] How to impute the missing SCL for patient # 5? – Sample mean: ( )/4 = 1.7 – By age: ( )/2 = 2.2 – By sex: 1.1 – By education: 1.3 – By race: ( )/3 = 1.9 – By ADL: ( )/2 = 1.2 Who is/are in the same “slice” with #5? IDAgeSexEdu.RaceSCLADLPainComorb. 170F16W F16W M12B F21W M21W243

10 Further Reading Saar-Tsechansky & Provost 8723/fulltext.pdf 8723/fulltext.pdf Yang, Y., Pedersen J.P. A Comparative Study on Feature Selection in Text Categorization ICML 1997, pp A Comparative Study on Feature Selection in Text Categorization Gelman chapter: df df Applications in biomed: Lavori, P., R. Dawson and D. Shera (1995) “A Multiple Imputation Strategy for Clinical Trialswith Truncation of Patient Data.” Statistics in Medicine 14:


Download ppt "Coping with Missing Data for Active Learning 02-750 Automation of Biological Research"

Similar presentations


Ads by Google