Download presentation
Presentation is loading. Please wait.
Published byMarilyn Flowers Modified over 8 years ago
1
CSE 446: Ensemble Learning Winter 2012 Daniel Weld Slides adapted from Tom Dietterich, Luke Zettlemoyer, Carlos Guestrin, Nick Kushmerick, Padraig Cunningham
2
© Daniel S. Weld2 Ensembles of Classifiers Traditional approach: Use one classifier Can one do better? Approaches: Cross-validated committees Bagging Boosting Stacking
3
© Daniel S. Weld3 Voting
4
© Daniel S. Weld4 Ensembles of Classifiers Assume – Errors are independent (suppose 30% error) – Majority vote Probability that majority is wrong… If individual area is 0.3 Area under curve for 11 wrong is 0.026 Order of magnitude improvement! Ensemble of 21 classifiers Prob 0.2 0.1 Number of classifiers in error = area under binomial distribution
5
© Daniel S. Weld5 Constructing Ensembles Partition examples into k disjoint equiv classes Now create k training sets – Each set is union of all equiv classes except one – So each set has (k-1)/k of the original training data Now train a classifier on each set Cross-validated committees Holdout
6
© Daniel S. Weld6 Ensemble Construction II Generate k sets of training examples For each set – Draw m examples randomly (with replacement) – From the original set of m examples Each training set corresponds to – 63.2% of original (+ duplicates) Now train classifier on each set Intuition: Sampling helps algorithm become more robust to noise/outliers in the data Bagging
7
Indicator function: 0 or 1 © Daniel S. Weld7 Ensemble Creation III Create 1 st weight distribution (uniform) over training ex: {w i 1 } Create M classifiers iteratively: – On iteration m – Train C m by minimizing i w i m I (y m (x i ) t i ) – Modify distribution: increase P of each example predicted incorrectly – Assign confidence to classifier C m = f(error); prefer accurate ones Combine Create harder and harder learning problems... Optimized choice of examples Boosting
8
Boosting Decision Stumps Poison 8 Edible Gills > X
9
Bagging vs Boosting 9 Slide from T Dietterich
10
© Daniel S. Weld10 Ensemble Creation IV Stacking Train several base learners Next train meta-learner – Learns when base learners are right / wrong – Now meta learner arbitrates Train using cross validated committees Meta-L inputs = base learner predictions Training examples = ‘test set’ from cross validation
11
Causes of Expected Error Variance: – How much h(x*) varies from one training set to another Bias: – Describes the average error of h(x*). Noise: – Describes how much y* varies from f(x*) True Function: f Hypothesis:h Observed Data:
12
Tradeoff Variance: – How much h(x*) varies from one training set to another Bias: – Describes the average error of h(x*). Overfitting – too much variance Underfitting – too much bias
13
Interlude: Bias Bias (Statistical) – the difference between an estimator's expectation and the true value of the parameter being estimated. Inductive Bias – Set of assumptions that the learner uses to predict outputs given inputs that it has not encountered Bias (Engineering) – establishing predetermined voltage at a point in a circuit to set an appropriate operating point. – y = w o + i w i X i 13
14
Bias-Variance Analysis in Regression True function is y = f(x) + – where is normally distributed with zero mean and standard deviation . Given a set of training examples, {(x i, y i )}, we fit an hypothesis h(x) = w · x + b to the data to minimize the squared error i [y i – h(x i )] 2 Slide from T Dietterich
15
Example: 20 points y = x + 2 sin(1.5x) + N(0,0.2) Slide from T Dietterich
16
50 fits (20 examples each) Slide from T Dietterich
17
Bias-Variance Analysis Now, given a new data point x* (with observed value y* = f(x*) + ), we would like to understand the expected prediction error E[ (y* – h(x*)) 2 ] Slide from T Dietterich
18
Classical Statistical Analysis Imagine that our particular training sample S is drawn from some population of possible training samples according to P(S). Compute E P [ (y* – h(x*)) 2 ] Decompose this into “bias”, “variance”, and “noise” Slide from T Dietterich
19
Lemma Let Z be a random variable with probability distribution P(Z) Let Z = E P [ Z ] be the average value of Z. Lemma: E[ (Z – Z) 2 ] = E[Z 2 ] – Z 2 E[ (Z – Z) 2 ] = E[ Z 2 – 2 Z Z + Z 2 ] = E[Z 2 ] – 2 E[Z] Z + Z 2 = E[Z 2 ] – 2 Z 2 + Z 2 = E[Z 2 ] – Z 2 Corollary: E[Z 2 ] = E[ (Z – Z) 2 ] + Z 2 Slide from T Dietterich
20
Bias-Variance-Noise Decomposition E[ (h(x*) – y*) 2 ] = E[ h(x*) 2 – 2 h(x*) y* + y* 2 ] = E[ h(x*) 2 ] – 2 E[ h(x*) ] E[y*] + E[y* 2 ] = E[ (h(x*) – h(x*)) 2 ] + h(x*) 2 (lemma) – 2 h(x*) f(x*) + E[ (y* – f(x*)) 2 ] + f(x*) 2 (lemma) = E[ (h(x*) – h(x*)) 2 ] + [variance] (h(x*) – f(x*)) 2 + [bias 2 ] E[ (y* – f(x*)) 2 ] [noise] Slide from T Dietterich
21
Derivation (continued) E[ (h(x*) – y*) 2 ] = = E[ (h(x*) – h(x*)) 2 ] + (h(x*) – f(x*)) 2 + E[ (y* – f(x*)) 2 ] = Var(h(x*)) + Bias(h(x*)) 2 + E[ 2 ] = Var(h(x*)) + Bias(h(x*)) 2 + 2 Expected prediction error = Variance + Bias 2 + Noise 2 Slide from T Dietterich
22
Bias, Variance, and Noise Variance: E[ (h(x*) – h(x*)) 2 ] Describes how much h(x*) varies from one training set S to another Bias: [h(x*) – f(x*)] Describes the average error of h(x*). Noise: E[ (y* – f(x*)) 2 ] = E[ 2 ] = 2 Describes how much y* varies from f(x*) Slide from T Dietterich
23
50 fits (20 examples each) Slide from T Dietterich
24
Bias Slide from T Dietterich
25
Variance Slide from T Dietterich
26
Noise Slide from T Dietterich
27
50 fits (20 examples each) Slide from T Dietterich
28
Distribution of predictions at x=2.0 Slide from T Dietterich
29
50 fits (20 examples each) Slide from T Dietterich
30
Distribution of predictions at x=5.0 Slide from T Dietterich
31
Polynomial Curve Fitting Hypothesis Space Sin(2 x)
32
1 st Order Polynomial
33
3 rd Order Polynomial
34
9 th Order Polynomial
35
Regularization Penalize large coefficient values Increasing trades bias for variance
36
Regularization:
38
Regularization: vs.
39
Ensemble Methods Combining many biased learners – Eg decision stumps Keeps variance low Can represent more expressive hypotheses – Hence, also lowers error from bias 39
40
Measuring Bias and Variance In practice (unlike in theory), we have only ONE training set S. We can simulate multiple training sets by bootstrap replicates – S’ = {x | x is drawn at random w/ replacement from S} – And |S’| = |S|. Slide from T Dietterich
41
Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S 1, …, S B Apply learning algorithm to each replicate S b to obtain hypothesis h b Let T b = S \ S b be the data points that do not appear in S b (out of bag points) Compute predicted value h b (x) for each x in T b Slide from T Dietterich
42
Estimating Bias and Variance (continued) For each data point x, we will now have the observed corresponding value y and several predictions y 1, …, y K. Compute the average prediction h. Estimate bias as (h – y) Estimate variance as k (y k – h) 2 /(K – 1) Assume noise is 0 Slide from T Dietterich
43
Approximations in this Procedure Bootstrap replicates are not real data We ignore the noise – If we have multiple data points with the same x value, then we can estimate the noise – We can also estimate noise by pooling y values from nearby x values (another use for Random Forest proximity measure?) Slide from T Dietterich
44
Bagging Bagging constructs B bootstrap replicates and their corresponding hypotheses h 1, …, h B It makes predictions according to y = b h b (x) / B Hence, bagging’s predictions are h(x) Slide from T Dietterich
45
Estimated Bias and Variance of Bagging If we estimate bias and variance using the same B bootstrap samples, we will have: – Bias = (h – y) [same as before] – Variance = k (h – h) 2 /(K – 1) = 0 Hence, according to this approximate way of estimating variance, bagging removes the variance while leaving bias unchanged. In reality, bagging only reduces variance and tends to slightly increase bias Slide from T Dietterich
46
Bias/Variance Heuristics Models that fit the data poorly have high bias: “inflexible models” such as linear regression, regression stumps Models that can fit the data very well have low bias but high variance: “flexible” models such as nearest neighbor regression, regression trees This suggests that bagging of a flexible model can reduce the variance while benefiting from the low bias Slide from T Dietterich
47
BAGGing = Bootstrap AGGregation (Breiman, 1996) for i = 1, 2, …, K: –T i randomly select M training instances with replacement –h i learn(T i ) [ID3, NB, kNN, neural net, …] Now combine the T i together with uniform voting (w i =1/K for all i)
48
48
49
49 decision tree learning algorithm; very similar to ID3
50
shades of blue/red indicate strength of vote for particular classification
52
Fighting the bias-variance tradeoff Simple (a.k.a. weak) learners are good – e.g., naïve Bayes, logistic regression, decision stumps (or shallow decision trees) – Low variance, don’t usually overfit Simple (a.k.a. weak) learners are bad – High bias, can’t solve hard learning problems Can we make weak learners always good??? – No!!! – But often yes…
53
Boosting Idea: given a weak learner, run it multiple times on (reweighted) training data, then let learned classifiers vote On each iteration t: – weight each training example by how incorrectly it was classified – Learn a hypothesis – h t – A strength for this hypothesis – t Final classifier: Practically useful Theoretically interesting [Schapire, 1989]
54
54 time = 0 blue/red = class size of dot = weight weak learner = Decision stub: horizontal or vertical line
55
55 time = 1 this hypothesis has 15% error and so does this ensemble, since the ensemble contains just this one hypothesis
56
56 time = 2
57
57 time = 3
58
58 time = 13
59
59 time = 100
60
60 time = 300 overfitting!!
61
Learning from weighted data Consider a weighted dataset – D(i) – weight of i th training example (x i,y i ) – Interpretations: i th training example counts as if it occurred D(i) times If I were to “resample” data, I would get more samples of “heavier” data points Now, always do weighted calculations: – e.g., MLE for Naïve Bayes, redefine Count(Y=y) to be weighted count: – setting D(j)=1 (or any constant value!), for all j, will recreates unweighted case
62
How? Many possibilities. Will see one shortly! Why? Reweight the data: examples i that are misclassified will have higher weights! y i h t (x i ) > 0 h i correct y i h t (x i ) < 0 h i wrong h i correct, α t > 0 D t+1 (i) < D t (i) h i wrong, α t > 0 D t+1 (i) > D t (i) Final Result: linear sum of “base” or “weak” classifier outputs.
63
ε t : error of h t, weighted by D t 0 ≤ ε t ≤ 1 α t : No errors: ε t =0 α t =∞ All errors: ε t = 1 α t =−∞ Random: ε t = 0.5 α t =0 αtαt εtεt
64
What t to choose for hypothesis h t ? Idea: choose t to minimize a bound on training error! Where [Schapire, 1989]
65
What t to choose for hypothesis h t ? Idea: choose t to minimize a bound on training error! Where And If we minimize t Z t, we minimize our training error!!! We can tighten this bound greedily, by choosing t and h t on each iteration to minimize Z t. h t is estimated as a black box, but can we solve for t ? [Schapire, 1989] This equality isn’t obvious! Can be shown with algebra (telescoping sums)!
66
Summary: choose t to minimize error bound We can squeeze this bound by choosing t on each iteration to minimize Z t. For boolean Y: differentiate, set equal to 0, there is a closed form solution! [Freund & Schapire ’97]: [Schapire, 1989]
67
Strong, weak classifiers If each classifier is (at least slightly) better than random: t < 0.5 Another bound on error: What does this imply about the training error? – Will reach zero! – Will get there exponentially fast! Is it hard to achieve better than random training error?
68
Boosting results – Digit recognition Boosting: – Seems to be robust to overfitting – Test error can decrease even after training error is zero!!! [Schapire, 1989] Test error Training error
69
Boosting generalization error bound Constants: T: number of boosting rounds – Higher T Looser bound, what does this imply? d: VC dimension of weak learner, measures complexity of classifier – Higher d bigger hypothesis space looser bound m: number of training examples – more data tighter bound [Freund & Schapire, 1996]
70
Boosting generalization error bound Constants: T: number of boosting rounds: – Higher T Looser bound, what does this imply? d: VC dimension of weak learner, measures complexity of classifier – Higher d bigger hypothesis space looser bound m: number of training examples – more data tighter bound [Freund & Schapire, 1996] Theory does not match practice: Robust to overfitting Test set error decreases even after training error is zero Need better analysis tools we’ll come back to this later in the quarter
71
Boosting: Experimental Results Comparison of C4.5, Boosting C4.5, Boosting decision stumps (depth 1 trees), 27 benchmark datasets [Freund & Schapire, 1996] error
72
Train Test Train
73
Logistic Regression as Minimizing Loss Logistic regression assumes: And tries to maximize data likelihood, for Y={-1,+1}: Equivalent to minimizing log loss:
74
Boosting and Logistic Regression Logistic regression equivalent to minimizing log loss: Boosting minimizes similar loss function: Both smooth approximations of 0/1 loss!
75
Logistic regression and Boosting Logistic regression: Minimize loss fn Define where x j predefined Jointly optimize parameters w 0, w 1, … w n via gradient ascent. Boosting: Minimize loss fn Define where h t (x i ) defined dynamically to fit data Weights j learned incrementally (new one for each training pass)
76
What you need to know about Boosting Combine weak classifiers to get very strong classifier – Weak classifier – slightly better than random on training data – Resulting very strong classifier – can get zero training error AdaBoost algorithm Boosting v. Logistic Regression – Both linear model, boosting “learns” features – Similar loss functions – Single optimization (LR) v. Incrementally improving classification (B) Most popular application of Boosting: – Boosted decision stumps! – Very simple to implement, very effective classifier
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.