Wei Fan, IBM T.J.Watson Research

Slides:



Advertisements
Similar presentations
Wei Fan Ed Greengrass Joe McCloskey Philip S. Yu Kevin Drummey
Advertisements

Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Kun Zhang,
Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson.
ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany.
ECG Signal processing (2)
Random Forest Predrag Radenković 3237/10
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:
Pattern Recognition and Machine Learning
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
An Introduction of Support Vector Machine
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
CMPUT 466/551 Principal Source: CMU
The loss function, the normal equation,
Evaluation.
x – independent variable (input)
On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions Jing Gao† Wei Fan‡ Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Evaluation.
Ensemble Learning: An Introduction
Sample Selection Bias Lei Tang Feb. 20th, Classical ML vs. Reality  Training data and Test data share the same distribution (In classical Machine.
Evaluating Hypotheses
Machine Learning: Ensemble Methods
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Part I: Classification and Bayesian Learning
Classification and Prediction: Regression Analysis
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Radial Basis Function Networks
For Better Accuracy Eick: Ensemble Learning
Efficient Direct Density Ratio Estimation for Non-stationarity Adaptation and Outlier Detection Takafumi Kanamori Shohei Hido NIPS 2008.
by B. Zadrozny and C. Elkan
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
Aug. 27, 2003IFAC-SYSID2003 Functional Analytic Framework for Model Selection Masashi Sugiyama Tokyo Institute of Technology, Tokyo, Japan Fraunhofer FIRST-IDA,
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
D. M. J. Tax and R. P. W. Duin. Presented by Mihajlo Grbovic Support Vector Data Description.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Machine Learning 5. Parametric Methods.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Computacion Inteligente Least-Square Methods for System Identification.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
1 Ka-fu Wong University of Hong Kong A Brief Review of Probability, Statistics, and Regression for Forecasting.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Machine Learning: Ensemble Methods
Statistical Learning Dong Liu Dept. EEIS, USTC.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Presentation transcript:

Sample Selection Bias – Covariate Shift: Problems, Solutions, and Applications Wei Fan, IBM T.J.Watson Research Masashi Sugiyama, Tokyo Institute of Technology Updated PPT is available: http//www.weifan.info/tutorial.htm

Overview of Sample Selection Bias Problem

A Toy Example Two classes: red and green red: f2>f1 green: f2<=f1

Unbiased and Biased Samples Not so-biased sampling Biased sampling

Effect on Learning Some techniques are more sensitive to bias than others. One important question: How to reduce the effect of sample selection bias? Unbiased 96.405% Biased 92.7% Unbiased 96.9% Biased 95.9% Unbiased 97.1% Biased 92.1%

Ubiquitous Loan Approval Drug screening Weather forecasting Ad Campaign Fraud Detection User Profiling Biomedical Informatics Intrusion Detection Insurance etc Normally, banks only have data of their own customers “Late payment, default” models are computed using their own data New customers may not completely follow the same distribution.

The Yale Face Database B Face Recognition Sample selection bias: Training samples are taken inside research lab, where there are a few women. Test samples: in real-world, men-women ratio is almost 50-50. The Yale Face Database B

Brain-Computer Interface (BCI) Control computers by EEG signals: Input: EEG signals Output: Left or Right Figure provided by Fraunhofer FIRST, Berlin, Germany

Movie provided by Fraunhofer FIRST, Berlin, Germany Training Imagine left/right-hand movement following the letter on the screen Movie provided by Fraunhofer FIRST, Berlin, Germany

Testing: Playing Games “Brain-Pong” Movie provided by Fraunhofer FIRST, Berlin, Germany

Non-Stationarity in EEG Features Different mental conditions (attention, sleepiness etc.) between training and test phases may change the EEG signals. Bandpower differences between training and test phases Features extracted from brain activity during training and test phases Figures provided by Fraunhofer FIRST, Berlin, Germany

Robot Control by Reinforcement Learning Let the robot learn how to autonomously move without explicit supervision. Khepera Robot

Rewards Robot moves autonomously = goes forward without hitting wall Give robot rewards: Go forward: Positive reward Hit wall: Negative reward Goal: Learn the control policy that maximizes future rewards

Example After learning:

Policy Iteration and Covariate Shift Updating the policy correspond to changing the input distributions! Evaluate control policy Improve control policy

Different Types of Sample Selection Bias

Bias as Distribution Think of “sampling an example (x,y) into the training data” as an event denoted by random variable s s=1: example (x,y) is sampled into the training data s=0: example (x,y) is not sampled. Think of bias as a conditional probability of “s=1” dependent on x and y P(s=1|x,y) : the probability for (x,y) to be sampled into the training data, conditional on the example’s feature vector x and class label y.

Categorization (Zadrozy’04, Fan et al’05, Fan and Davidson’07) No Sample Selection Bias P(s=1|x,y) = P(s=1) Feature Bias/Covariate Shift P(s=1|x,y) = P(s=1|x) Class Bias P(s=1|x,y) = P(s=1|y) Complete Bias No more reduction

Bias for a Training Set How P(s=1|x,y) is computed Practically, for a given training set D P(s=1|x,y) = 1: if (x,y) is sampled into D P(s=1|x,y) = 0: otherwise Alternatively, consider D of the size can be sampled “exhaustively” from the universe of examples.

Realistic Datasets are biased? Most datasets are biased. Unlikely to sample each and every feature vector. For most problems, it is at least feature bias. P(s=1|x,y) = P(s=1|x)

Effect on Learning Learning algorithms estimate the “true conditional probability” True probability P(y|x), such as P(fraud|x)? Estimated probabilty P(y|x,M): M is the model built. Conditional probability in the biased data. P(y|x,s=1) Key Issue: P(y|x,s=1) = P(y|x) ?

Bias Resolutions

Heckman’s Two-Step Approach Estimate one’s donation amount if one does donate. Accurate estimate cannot be obtained by a regression using only data from donors. First Step: Probit model to estimate probability to donate: Second Step: regression model to estimate donation: Expected error Gaussian assumption

Covariate Shift or Feature Bias However, no chance for generalization if training and test samples have nothing in common. Covariate shift: Input distribution changes Functional relation remains unchanged

Example of Covariate Shift (Weak) extrapolation: Predict output values outside training region Training samples Test samples

Covariate Shift Adaptation To illustrate the effect of covariate shift, let’s focus on linear extrapolation Training samples Test samples True function Learned function

Generalization Error = Bias + Variance : expectation over noise

Model Specification Model is said to be correctly specified if In practice, our model may not be correct. Therefore, we need a theory for misspecified models!

Ordinary Least-Squares (OLS) If model is correct: OLS minimizes bias asymptotically If model is misspecified: OLS does not minimize bias even asymptotically. We want to reduce bias!

Law of Large Numbers Sample average converges to the population mean: We want to estimate the expectation over test input points only using training input points .

Key Trick: Importance-Weighted Average Importance: Ratio of test and training input densities Importance-weighted average: (cf. importance sampling)

Importance-Weighted LS (Shimodaira, JSPI2000) :Assumed strictly positive Even for misspedified models, IWLS minimizes bias asymptotically. We need to estimate importance in practice.

Use of Unlabeled Samples: Importance Estimation Assumption: We have training inputs and test inputs . Naïve approach: Estimate and separately, and take the ratio of the density estimates This does not work well since density estimation is hard in high dimensions.

Vapnik’s Principle When solving a problem, more difficult problems shouldn’t be solved. (e.g., support vector machines) Knowing densities Knowing ratio Directly estimating the ratio is easier than estimating the densities!

Modeling Importance Function Use a linear importance model: Test density is approximated by Idea: Learn so that well approximates .

Kullback-Leibler Divergence (constant) (relevant)

Learning Importance Function Thus Since is density, (objective function) (constraint)

KLIEP (Kullback-Leibler Importance Estimation Procedure) (Sugiyama et al., NIPS2007) Convexity: unique global solution is available Sparse solution: prediction is fast!

Examples

Experiments: Setup Input distributions: standard Gaussian with Training: mean (0,0,…,0) Test: mean (1,0,…,0) Kernel density estimation (KDE): Separately estimate training and test input densities. Gaussian kernel width is chosen by likelihood cross-validation. KLIEP Gaussian kernel width is chosen by likelihood cross-validation

Experimental Results KDE Normalized MSE KLIEP dim KDE: Error increases as dim grows KLIEP: Error remains small for large dim

Ensemble Methods (Fan and Davidson’07) Averaging of estimated class probabilities weighted by posterior Integration Over Model Space Class Probability Posterior weighting Removes model uncertainty by averaging

How to Use Them Estimate “joint probability” P(x,y) instead of just conditional probability, i.e., P(x,y) = P(y|x)P(x) Makes no difference use 1 model, but Multiple models

Examples of How This Works P1(+|x) = 0.8 and P2(+|x) = 0.4 P1(-|x) = 0.2 and P2(-|x) = 0.6 model averaging, P(+|x) = (0.8 + 0.4) / 2 = 0.6 P(-|x) = (0.2 + 0.6)/2 = 0.4 Prediction will be –

But if there are two P(x) models, with probability 0.05 and 0.4 Then Recall with model averaging: P(+|x) = 0.6 and P(-|x)=0.4 Prediction is + But, now the prediction will be – instead of + Key Idea: Unlabeled examples can be used as “weights” to re-weight the models.

Structure Discovery (Ren et al’08) Structural Discovery Original Dataset Structural Re-balancing Corrected Dataset

Active Learning Quality of learned functions depends on training input location . Goal: optimize training input location Good input location Poor input location Target Learned

Challenges Generalization error is unknown and needs to be estimated. In experiment design, we do not have training output values yet. Thus we cannot use, e.g., cross-validation which requires . Only training input positions can be used in generalization error estimation!

(Fedorov 1972; Cohn et al., JAIR1996) Agnostic Setup The model is not correct in practice. Then OLS is not consistent. Standard “experiment design” method does not work! (Fedorov 1972; Cohn et al., JAIR1996)

Bias Reduction by Importance-Weighted LS (IWLS) (Wiens JSPI2001; Kanamori & Shimodaira JSPI2003; Sugiyama JMLR2006) The use of IWLS mitigates the problem of in consistency under agnostic setup. Importance is known in active learning setup since is designed by us! Importance

Model Selection and Testing

Model Selection Choice of models is crucial: We want to determine the model so that generalization error is minimized: Polynomial of order 1 Polynomial of order 2 Polynomial of order 3

Generalization Error Estimation Generalization error is not accessible since the target function is unknown. Instead, we use a generalization error estimate. Model complexity Model complexity

Cross-Validation Divide training samples into groups. Train a learning machine with groups. Validate the trained machine using the rest. Repeat this for all combinations and output the mean validation error. CV is almost unbiased without covariate shift. But, it is heavily biased under covariate shift! Group 1 Group 2 … Group k-1 Group k Training Validation

Importance-Weighted CV (IWCV) (Zadrozny ICML2004; Sugiyama et al., JMLR2007) When testing the classifier in CV process, we also importance-weight the test error. IWCV gives almost unbiased estimates of generalization error even under covariate shift Set 1 Set 2 Set k-1 Set k … Training Testing

Example of IWCV IWCV gives better estimates of generalization error. Model selection by IWCV outperforms CV!

Reserve Testing (Fan and Davidson’06) Train Reserve Testing (Fan and Davidson’06) Test MAA MAB MBA MBB DA DB Labeled test data A B A B MA MB Train Estimate the performance of MA and MB based on the order of MAA, MAB, MBA and MBB

Rule If “A’s labeled test data” can construct “more accurate models” for both algorithm A and B evaluated on labeled training data, then A is expected to be more accurate. If MAA > MAB and MBA > MBB then choose A Similarly, If MAA < MAB and MBA < MBB then choose B Otherwise, undecided.

Why CV won’t work? Sparse Region

Examples

Ozone Day Prediction (Zhang et al’06) Daily summary maps of two datasets from Texas Commission on Environmental Quality (TCEQ)

Challenges as a Data Mining Problem Rather skewed and relatively sparse distribution 2500+ examples over 7 years (1998-2004) 72 continuous features with missing values Large instance space If binary and uncorrelated, 272 is an astronomical number 2% and 5% true positive ozone days for 1-hour and 8-hour peak respectively

A large number of irrelevant features Only about 10 out of 72 features verified to be relevant, No information on the relevancy of the other 62 features For stochastic problem, given irrelevant features Xir , where X=(Xr, Xir), P(Y|X) = P(Y|Xr) only if the data is exhaustive. May introduce overfitting problem, and change the probability distribution represented in the data. P(Y = “ozone day”| Xr, Xir) 1 P(Y = “normal day”|Xr, Xir) 0

Training Distribution “Feature sample selection bias”. Given 7 years of data and 72 continuous features, hard to find many days in the training data that is very similar to a day in the future Given these, 2 closely-related challenges How to train an accurate model How to effectively use a model to predict the future with a different and yet unknown distribution Training Distribution Testing Distribution 1 2 3 + -

Reliable probability estimation under irrelevant features Recall that due to irrelevant features: P(Y = “ozone day”| Xr, Xir) 1 P(Y = “normal day”|Xr, Xir) 0 Construct multiple models Average their predictions P(“ozone”|xr): true probability P(“ozone”|Xr, Xir, θ): estimated probability by model θ MSEsinglemodel: Difference between “true” and “estimated”. MSEAverage Difference between “true” and “average of many models” Formally show that MSEAverage ≤ MSESingleModel

Training Distribution TrainingSet Algorithm PrecRec plot Recall Precision Ma Mb Prediction with feature sample selection bias A CV based procedure for decision threshold selection Decision threshold VE Training Distribution Testing Distribution 1 2 3 + - ….. Estimated probability values 1 fold 10 fold 10CV 2 fold “Probability- TrueLabel” file Concatenate P(y=“ozoneday”|x,θ) Lable 7/1/98 0.1316 Normal 7/2/98 0.6245 Ozone 7/3/98 0.5944 Ozone ……… P(y=“ozoneday”|x,θ) Lable 7/1/98 0.1316 Normal 7/3/98 0.5944 Ozone 7/2/98 0.6245 Ozone ……… TrainingSet Algorithm

Addressing Data Mining Challenges Prediction with feature sample selection bias Future prediction based on decision threshold selected Whole TrainingSet θ Classification on future days if P(Y = “ozonedays”|X,θ ) ≥ VE Predict “ozonedays”

Results

KDD/Netflix CUP’07 Task1 (Liu and Kou,07)

Task 1 Task 1: Who rated what in 2006 Given a list of 100,000 pairs of users and movies, predict for each pair the probability that the user rated the movie in 2006 Result: They are the close runner-up, No 3 out of 39 teams Challenges: Huge amount of data how to sample the data so that any learning algorithms can be applied is critical Complex affecting factors: decrease of interest in old movies, growing tendency of watching (reviewing) more movies by Netflix users

NETFLIX data generation process NO User or Movie Arrival User Arrival Movie Arrival Task 1 17K movies Task 2 Training Data 1998 Time 2005 2006 4 5 ? 3 2 Qualifier Dataset 3M

Task 1: Effective Sampling Strategies Sampling the movie-user pairs for “existing” users and “existing” movies from 2004, 2005 as training set and 4Q 2005 as developing set The probability of picking a movie was proportional to the number of ratings that movie received; the same strategy for users Movies …… Movie5 .0011 Movie3 .001 Movie4 .0007 User7 .0007 User6 .00012 User8 .00003 Movie5 User 7 Movie3 User 7 Movie4 .User 8 …. 1488844,3,2005-09-06 822109,5,2005-05-13 885013,4,2005-10-19 30878,4,2005-12-26 823519,3,2004-05-03 … Samples History Users

Learning Algorithm: Single classifiers: logistic regression, Ridge regression, decision tree, support vector machines Naïve Ensemble: combining sub-classifiers built on different types of features with pre-set weights Ensemble classifiers: combining sub-classifiers with weights learned from the development set

Brain-Computer Interface (BCI) Control computers by brain signals: Input: EEG signals Output: Left or Right

BCI Results Subject Trial No adaptation With adaptation KL 1 9.3 % 10.0 % 0.76 2 8.8 % 1.11 3 4.3 % 0.69 40.0 % 0.97 39.3 % 38.7 % 1.05 25.5 % 0.43 36.9 % 34.4 % 2.63 21.3 % 19.3 % 2.88 22.5 % 17.5 % 1.25 4 9.23 2.4 % 5.58 6.4 % 1.83 5 0.79 15.3 % 14.0 % 2.01 KL divergence from training to test input distributions When KL is large, covariate shift adaptation tends to improve accuracy. When KL is small, no difference.

Robot Control by Reinforcement Learning Swing-up inverted pendulum: Swing-up the pole by controlling the car. Reward:

Covariate shift adaptation Results Covariate shift adaptation Existing method (b) Existing method (a)

Demo: Proposed Method

Wafer Alignment in Semiconductor Exposure Apparatus Recent silicon wafers have layer structure. Circuit patterns are exposed multiple times. Exact alignment of wafers is very important.

Active learning problem! Markers on Wafer Wafer alignment process: Measure marker location printed on wafers. Shift and rotate the wafer to minimize the gap. For speeding up, reducing the number of markers to measure is very important. Active learning problem!

Non-linear Alignment Model When gap is only shift and rotation, linear model is exact: However, non-linear factors exist, e.g., Warp Biased characteristic of measurement apparatus Different temperature conditions Exactly modeling non-linear factors is very difficult in practice! Agnostic setup!

(Sugiyama & Nakajima ECML-PKDD2008) Experimental Results (Sugiyama & Nakajima ECML-PKDD2008) Mean squared error of wafer position estimation IWLS-based OLS-based “Outer” heuristic Passive 2.27(1.08) 2.37(1.15) 2.36(1.15) 2.32(1.11) 20 markers (out of 38) are chosen by experiment design methods. Gaps of all markers are predicted. Repeated for 220 different wafers. Mean (standard deviation) of the gap prediction error Red: Significantly better by 5% Wilcoxon test Blue: Worse than the baseline passive method IWLS-based active learning works very well!

Conclusions

Book on Dataset Shift Quiñonero-Candela, Sugiyama, Schwaighofer & Lawrence (Eds.), Dataset Shift in Machine Learning, MIT Press, Cambridge, 2008.