Presentation on theme: "Wei Fan, IBM T.J.Watson Research"— Presentation transcript:
1 Sample Selection Bias – Covariate Shift: Problems, Solutions, and Applications Wei Fan, IBM T.J.Watson ResearchMasashi Sugiyama, Tokyo Institute of TechnologyUpdated PPT is available:http//www.weifan.info/tutorial.htm
3 A Toy Example Two classes: red and green red: f2>f1 green: f2<=f1
4 Unbiased and Biased Samples Not so-biased samplingBiased sampling
5 Effect on LearningSome techniques are more sensitive to bias than others.One important question:How to reduce the effect of sample selection bias?Unbiased %Biased 92.7%Unbiased 96.9%Biased 95.9%Unbiased 97.1%Biased 92.1%
6 Ubiquitous Loan Approval Drug screening Weather forecasting Ad CampaignFraud DetectionUser ProfilingBiomedical InformaticsIntrusion DetectionInsuranceetcNormally, banks only have data of their own customers“Late payment, default” models are computed using their own dataNew customers may not completely follow the same distribution.
7 The Yale Face Database B Face RecognitionSample selection bias:Training samples are taken inside research lab, where there are a few women.Test samples: in real-world, men-women ratio is almostThe Yale Face Database B
8 Brain-Computer Interface (BCI) Control computers by EEG signals:Input: EEG signalsOutput: Left or RightFigure provided by Fraunhofer FIRST, Berlin, Germany
9 Movie provided by Fraunhofer FIRST, Berlin, Germany TrainingImagine left/right-hand movement following the letter on the screenMovie provided by Fraunhofer FIRST, Berlin, Germany
10 Testing: Playing Games “Brain-Pong”Movie provided by Fraunhofer FIRST, Berlin, Germany
11 Non-Stationarity in EEG Features Different mental conditions (attention, sleepiness etc.) between training and test phases may change the EEG signals.Bandpower differences betweentraining and test phasesFeatures extracted from brain activityduring training and test phasesFigures provided by Fraunhofer FIRST, Berlin, Germany
12 Robot Control by Reinforcement Learning Let the robot learn how to autonomously move without explicit supervision.Khepera Robot
13 Rewards Robot moves autonomously = goes forward without hitting wall Give robot rewards:Go forward: Positive rewardHit wall: Negative rewardGoal: Learn the control policy that maximizes future rewards
17 Bias as DistributionThink of “sampling an example (x,y) into the training data” as an event denoted by random variable ss=1: example (x,y) is sampled into the training datas=0: example (x,y) is not sampled.Think of bias as a conditional probability of “s=1” dependent on x and yP(s=1|x,y) : the probability for (x,y) to be sampled into the training data, conditional on the example’s feature vector x and class label y.
18 Categorization (Zadrozy’04, Fan et al’05, Fan and Davidson’07) No Sample Selection BiasP(s=1|x,y) = P(s=1)Feature Bias/Covariate ShiftP(s=1|x,y) = P(s=1|x)Class BiasP(s=1|x,y) = P(s=1|y)Complete BiasNo more reduction
19 Bias for a Training Set How P(s=1|x,y) is computed Practically, for a given training set DP(s=1|x,y) = 1: if (x,y) is sampled into DP(s=1|x,y) = 0: otherwiseAlternatively, consider D of the size can be sampled “exhaustively” from the universe of examples.
20 Realistic Datasets are biased? Most datasets are biased.Unlikely to sample each and every feature vector.For most problems, it is at least feature bias.P(s=1|x,y) = P(s=1|x)
21 Effect on LearningLearning algorithms estimate the “true conditional probability”True probability P(y|x), such as P(fraud|x)?Estimated probabilty P(y|x,M): M is the model built.Conditional probability in the biased data.P(y|x,s=1)Key Issue:P(y|x,s=1) = P(y|x) ?
23 Heckman’s Two-Step Approach Estimate one’s donation amount if one does donate.Accurate estimate cannot be obtained by a regression using only data from donors.First Step: Probit model to estimate probability to donate:Second Step: regression model to estimate donation:Expected errorGaussian assumption
24 Covariate Shift or Feature Bias However, no chance for generalization if training and test samples have nothing in common.Covariate shift:Input distribution changesFunctional relation remains unchanged
25 Example of Covariate Shift (Weak) extrapolation:Predict output values outside training regionTraining samplesTest samples
26 Covariate Shift Adaptation To illustrate the effect of covariate shift, let’s focus on linear extrapolationTraining samplesTest samplesTrue functionLearned function
28 Model Specification Model is said to be correctly specified if In practice, our model may not be correct.Therefore, we need a theory for misspecified models!
29 Ordinary Least-Squares (OLS) If model is correct:OLS minimizes bias asymptoticallyIf model is misspecified:OLS does not minimize bias even asymptotically.We want to reduce bias!
30 Law of Large Numbers Sample average converges to the population mean: We want to estimate the expectation over test input points only using training input points
31 Key Trick: Importance-Weighted Average Importance: Ratio of test and training input densitiesImportance-weighted average:(cf. importance sampling)
32 Importance-Weighted LS (Shimodaira, JSPI2000):Assumed strictly positiveEven for misspedified models, IWLS minimizes bias asymptotically.We need to estimate importance in practice.
33 Use of Unlabeled Samples: Importance Estimation Assumption: We have training inputs and test inputsNaïve approach: Estimate and separately, and take the ratio of the density estimatesThis does not work well since density estimation is hard in high dimensions.
34 Vapnik’s Principle When solving a problem, more difficult problems shouldn’t be solved.(e.g., support vector machines)Knowing densitiesKnowing ratioDirectly estimating the ratio is easier than estimating the densities!
35 Modeling Importance Function Use a linear importance model:Test density is approximated byIdea: Learn so that well approximates
40 Experiments: Setup Input distributions: standard Gaussian with Training: mean (0,0,…,0)Test: mean (1,0,…,0)Kernel density estimation (KDE):Separately estimate training and test input densities.Gaussian kernel width is chosen by likelihood cross-validation.KLIEPGaussian kernel width is chosen by likelihood cross-validation
41 Experimental Results KDE Normalized MSE KLIEP dim KDE: Error increases as dim growsKLIEP: Error remains small for large dim
42 Ensemble Methods (Fan and Davidson’07) Averaging of estimated class probabilities weighted by posteriorIntegration OverModel SpaceClassProbabilityPosteriorweightingRemoves model uncertainty by averaging
43 How to Use ThemEstimate “joint probability” P(x,y) instead of just conditional probability, i.e.,P(x,y) = P(y|x)P(x)Makes no difference use 1 model, butMultiple models
44 Examples of How This Works P1(+|x) = 0.8 and P2(+|x) = 0.4P1(-|x) = 0.2 and P2(-|x) = 0.6model averaging,P(+|x) = ( ) / 2 = 0.6P(-|x) = ( )/2 = 0.4Prediction will be –
45 But if there are two P(x) models, with probability 0.05 and 0.4 Then Recall with model averaging:P(+|x) = 0.6 and P(-|x)=0.4Prediction is +But, now the prediction will be – instead of +Key Idea:Unlabeled examples can be used as “weights” to re-weight the models.
47 Active LearningQuality of learned functions depends on training input locationGoal: optimize training input locationGood input locationPoor input locationTargetLearned
48 Challenges Generalization error is unknown and needs to be estimated. In experiment design, we do not have training output values yet.Thus we cannot use, e.g., cross-validation which requiresOnly training input positions can be used in generalization error estimation!
49 (Fedorov 1972; Cohn et al., JAIR1996) Agnostic SetupThe model is not correct in practice.Then OLS is not consistent.Standard “experiment design” method does not work!(Fedorov 1972; Cohn et al., JAIR1996)
50 Bias Reduction by Importance-Weighted LS (IWLS) (Wiens JSPI2001; Kanamori & Shimodaira JSPI2003; Sugiyama JMLR2006)The use of IWLS mitigates the problem of in consistency under agnostic setup.Importance is known in active learning setup since is designed by us!Importance
52 Model Selection Choice of models is crucial: We want to determine the model so that generalization error is minimized:Polynomial of order 1Polynomial of order 2Polynomial of order 3
53 Generalization Error Estimation Generalization error is not accessible since the target function is unknown.Instead, we use a generalization error estimate.Model complexityModel complexity
54 Cross-Validation Divide training samples into groups. Train a learning machine with groups.Validate the trained machine using the rest.Repeat this for all combinations and output the mean validation error.CV is almost unbiased without covariate shift.But, it is heavily biased under covariate shift!Group 1Group 2…Group k-1Group kTrainingValidation
55 Importance-Weighted CV (IWCV) (Zadrozny ICML2004; Sugiyama et al., JMLR2007)When testing the classifier in CV process, we also importance-weight the test error.IWCV gives almost unbiased estimates of generalization error even under covariate shiftSet 1Set 2Set k-1Set k…TrainingTesting
56 Example of IWCV IWCV gives better estimates of generalization error. Model selection by IWCV outperforms CV!
57 Reserve Testing (Fan and Davidson’06) TrainReserve Testing (Fan and Davidson’06)TestMAAMABMBAMBBDADBLabeledtest dataABABMAMBTrainEstimate the performance of MA and MB based on the order of MAA, MAB, MBA and MBB
58 RuleIf “A’s labeled test data” can construct “more accurate models” for both algorithm A and B evaluated on labeled training data, then A is expected to be more accurate.If MAA > MAB and MBA > MBB then choose ASimilarly,If MAA < MAB and MBA < MBB then choose BOtherwise, undecided.
61 Ozone Day Prediction (Zhang et al’06) Daily summary maps of two datasets from Texas Commission on Environmental Quality (TCEQ)
62 Challenges as a Data Mining Problem Rather skewed and relatively sparse distribution2500+ examples over 7 years ( )72 continuous features with missing valuesLarge instance spaceIf binary and uncorrelated, 272 is an astronomical number2% and 5% true positive ozone days for 1-hour and 8-hour peak respectively
63 A large number of irrelevant features Only about 10 out of 72 features verified to be relevant,No information on the relevancy of the other 62 featuresFor stochastic problem, given irrelevant features Xir , where X=(Xr, Xir),P(Y|X) = P(Y|Xr) only if the data is exhaustive.May introduce overfitting problem, and change the probability distribution represented in the data.P(Y = “ozone day”| Xr, Xir)P(Y = “normal day”|Xr, Xir)
64 Training Distribution “Feature sample selection bias”.Given 7 years of data and 72 continuous features, hard to find many days in the training data that is very similar to a day in the futureGiven these, 2 closely-related challengesHow to train an accurate modelHow to effectively use a model to predict the future with a different and yet unknown distributionTraining DistributionTesting Distribution123+-
65 Reliable probability estimation under irrelevant features Recall that due to irrelevant features:P(Y = “ozone day”| Xr, Xir) 1P(Y = “normal day”|Xr, Xir) 0Construct multiple modelsAverage their predictionsP(“ozone”|xr): true probabilityP(“ozone”|Xr, Xir, θ): estimated probability by model θMSEsinglemodel:Difference between “true” and “estimated”.MSEAverageDifference between “true” and “average of many models”Formally show that MSEAverage ≤ MSESingleModel
66 Training Distribution TrainingSet Algorithm PrecRecplotRecallPrecisionMaMbPrediction with feature sample selection biasA CV based procedure for decision threshold selectionDecisionthresholdVETraining DistributionTesting Distribution123+-…..Estimatedprobabilityvalues1 fold10 fold10CV2 fold“Probability-TrueLabel”fileConcatenateP(y=“ozoneday”|x,θ) Lable7/1/ Normal7/2/ Ozone7/3/ Ozone………P(y=“ozoneday”|x,θ) Lable7/1/ Normal7/3/ Ozone7/2/ Ozone………TrainingSet Algorithm
67 Addressing Data Mining Challenges Prediction with feature sample selection biasFuture prediction based on decision threshold selectedWhole TrainingSetθClassification on future daysif P(Y = “ozonedays”|X,θ ) ≥ VEPredict “ozonedays”
70 Task 1 Task 1: Who rated what in 2006 Given a list of 100,000 pairs of users and movies, predict for each pair the probability that the user rated the movie in 2006Result: They are the close runner-up, No 3 out of 39 teamsChallenges:Huge amount of data how to sample the data so that any learning algorithms can be applied is criticalComplex affecting factors: decrease of interest in old movies, growing tendency of watching (reviewing) more movies by Netflix users
71 NETFLIX data generation process NO Useror MovieArrivalUser ArrivalMovie ArrivalTask 117K moviesTask 2Training DataTime45?32QualifierDataset3M
72 Task 1: Effective Sampling Strategies Sampling the movie-user pairs for “existing” users and “existing” movies from 2004, 2005 as training set and 4Q 2005 as developing setThe probability of picking a movie was proportional to the number of ratings that movie received; the same strategy for usersMovies……MovieMovie3 .001MovieUserUserUserMovie5 User 7Movie3 User 7Movie4 .User 8….,3,822109,5,885013,4,30878,4,823519,3,…SamplesHistoryUsers
73 Learning Algorithm:Single classifiers: logistic regression, Ridge regression, decision tree, support vector machinesNaïve Ensemble: combining sub-classifiers built on different types of features with pre-set weightsEnsemble classifiers: combining sub-classifiers with weights learned from the development set
74 Brain-Computer Interface (BCI) Control computers by brain signals:Input: EEG signalsOutput: Left or Right
75 BCI ResultsSubjectTrialNo adaptationWith adaptationKL19.3 %10.0 %0.7628.8 %1.1134.3 %0.6940.0 %0.9739.3 %38.7 %1.0525.5 %0.4336.9 %34.4 %2.6321.3 %19.3 %2.8822.5 %17.5 %1.2549.232.4 %5.586.4 %1.8350.7915.3 %14.0 %2.01KL divergence from trainingto test input distributionsWhen KL is large, covariate shift adaptation tends to improve accuracy.When KL is small, no difference.
76 Robot Control by Reinforcement Learning Swing-up inverted pendulum:Swing-up the pole by controlling the car.Reward:
79 Wafer Alignment in Semiconductor Exposure Apparatus Recent silicon wafers have layer structure.Circuit patterns are exposed multiple times.Exact alignment of wafers is very important.
80 Active learning problem! Markers on WaferWafer alignment process:Measure marker location printed on wafers.Shift and rotate the wafer to minimize the gap.For speeding up, reducing the number of markers to measure is very important.Active learning problem!
81 Non-linear Alignment Model When gap is only shift and rotation, linear model is exact:However, non-linear factors exist, e.g.,WarpBiased characteristic of measurement apparatusDifferent temperature conditionsExactly modeling non-linear factors is very difficult in practice!Agnostic setup!
82 (Sugiyama & Nakajima ECML-PKDD2008) Experimental Results(Sugiyama & Nakajima ECML-PKDD2008)Mean squared error of wafer position estimationIWLS-basedOLS-based“Outer” heuristicPassive2.27(1.08)2.37(1.15)2.36(1.15)2.32(1.11)20 markers (out of 38) are chosen by experiment design methods.Gaps of all markers are predicted.Repeated for 220 different wafers.Mean (standard deviation) of the gap prediction errorRed: Significantly better by 5% Wilcoxon testBlue: Worse than the baseline passive methodIWLS-based active learning works very well!