Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSE 446: Ensemble Learning Winter 2012 Daniel Weld Slides adapted from Tom Dietterich, Luke Zettlemoyer, Carlos Guestrin, Nick Kushmerick, Padraig Cunningham.

Similar presentations


Presentation on theme: "CSE 446: Ensemble Learning Winter 2012 Daniel Weld Slides adapted from Tom Dietterich, Luke Zettlemoyer, Carlos Guestrin, Nick Kushmerick, Padraig Cunningham."— Presentation transcript:

1 CSE 446: Ensemble Learning Winter 2012 Daniel Weld Slides adapted from Tom Dietterich, Luke Zettlemoyer, Carlos Guestrin, Nick Kushmerick, Padraig Cunningham

2 © Daniel S. Weld2 Ensembles of Classifiers Traditional approach: Use one classifier Can one do better? Approaches: Cross-validated committees Bagging Boosting Stacking

3 © Daniel S. Weld3 Voting

4 © Daniel S. Weld4 Ensembles of Classifiers Assume – Errors are independent (suppose 30% error) – Majority vote Probability that majority is wrong… If individual area is 0.3 Area under curve for  11 wrong is 0.026 Order of magnitude improvement! Ensemble of 21 classifiers Prob 0.2 0.1 Number of classifiers in error = area under binomial distribution

5 © Daniel S. Weld5 Constructing Ensembles Partition examples into k disjoint equiv classes Now create k training sets – Each set is union of all equiv classes except one – So each set has (k-1)/k of the original training data Now train a classifier on each set Cross-validated committees Holdout

6 © Daniel S. Weld6 Ensemble Construction II Generate k sets of training examples For each set – Draw m examples randomly (with replacement) – From the original set of m examples Each training set corresponds to – 63.2% of original (+ duplicates) Now train classifier on each set Intuition: Sampling helps algorithm become more robust to noise/outliers in the data Bagging

7 Indicator function: 0 or 1 © Daniel S. Weld7 Ensemble Creation III Create 1 st weight distribution (uniform) over training ex: {w i 1 } Create M classifiers iteratively: – On iteration m – Train C m by minimizing  i w i m I (y m (x i )  t i ) – Modify distribution: increase P of each example predicted incorrectly – Assign confidence to classifier C m = f(error); prefer accurate ones Combine Create harder and harder learning problems... Optimized choice of examples Boosting

8 Boosting Decision Stumps Poison 8 Edible Gills > X

9 Bagging vs Boosting 9 Slide from T Dietterich

10 © Daniel S. Weld10 Ensemble Creation IV Stacking Train several base learners Next train meta-learner – Learns when base learners are right / wrong – Now meta learner arbitrates Train using cross validated committees Meta-L inputs = base learner predictions Training examples = ‘test set’ from cross validation

11 Causes of Expected Error Variance: – How much h(x*) varies from one training set to another Bias: – Describes the average error of h(x*). Noise: – Describes how much y* varies from f(x*) True Function: f Hypothesis:h Observed Data:

12 Tradeoff Variance: – How much h(x*) varies from one training set to another Bias: – Describes the average error of h(x*). Overfitting – too much variance Underfitting – too much bias

13 Interlude: Bias Bias (Statistical) – the difference between an estimator's expectation and the true value of the parameter being estimated. Inductive Bias – Set of assumptions that the learner uses to predict outputs given inputs that it has not encountered Bias (Engineering) – establishing predetermined voltage at a point in a circuit to set an appropriate operating point. – y = w o +  i w i X i 13

14 Bias-Variance Analysis in Regression True function is y = f(x) +  – where  is normally distributed with zero mean and standard deviation . Given a set of training examples, {(x i, y i )}, we fit an hypothesis h(x) = w · x + b to the data to minimize the squared error  i [y i – h(x i )] 2 Slide from T Dietterich

15 Example: 20 points y = x + 2 sin(1.5x) + N(0,0.2) Slide from T Dietterich

16 50 fits (20 examples each) Slide from T Dietterich

17 Bias-Variance Analysis Now, given a new data point x* (with observed value y* = f(x*) +  ), we would like to understand the expected prediction error E[ (y* – h(x*)) 2 ] Slide from T Dietterich

18 Classical Statistical Analysis Imagine that our particular training sample S is drawn from some population of possible training samples according to P(S). Compute E P [ (y* – h(x*)) 2 ] Decompose this into “bias”, “variance”, and “noise” Slide from T Dietterich

19 Lemma Let Z be a random variable with probability distribution P(Z) Let Z = E P [ Z ] be the average value of Z. Lemma: E[ (Z – Z) 2 ] = E[Z 2 ] – Z 2 E[ (Z – Z) 2 ] = E[ Z 2 – 2 Z Z + Z 2 ] = E[Z 2 ] – 2 E[Z] Z + Z 2 = E[Z 2 ] – 2 Z 2 + Z 2 = E[Z 2 ] – Z 2 Corollary: E[Z 2 ] = E[ (Z – Z) 2 ] + Z 2 Slide from T Dietterich

20 Bias-Variance-Noise Decomposition E[ (h(x*) – y*) 2 ] = E[ h(x*) 2 – 2 h(x*) y* + y* 2 ] = E[ h(x*) 2 ] – 2 E[ h(x*) ] E[y*] + E[y* 2 ] = E[ (h(x*) – h(x*)) 2 ] + h(x*) 2 (lemma) – 2 h(x*) f(x*) + E[ (y* – f(x*)) 2 ] + f(x*) 2 (lemma) = E[ (h(x*) – h(x*)) 2 ] + [variance] (h(x*) – f(x*)) 2 + [bias 2 ] E[ (y* – f(x*)) 2 ] [noise] Slide from T Dietterich

21 Derivation (continued) E[ (h(x*) – y*) 2 ] = = E[ (h(x*) – h(x*)) 2 ] + (h(x*) – f(x*)) 2 + E[ (y* – f(x*)) 2 ] = Var(h(x*)) + Bias(h(x*)) 2 + E[  2 ] = Var(h(x*)) + Bias(h(x*)) 2 +  2 Expected prediction error = Variance + Bias 2 + Noise 2 Slide from T Dietterich

22 Bias, Variance, and Noise Variance: E[ (h(x*) – h(x*)) 2 ] Describes how much h(x*) varies from one training set S to another Bias: [h(x*) – f(x*)] Describes the average error of h(x*). Noise: E[ (y* – f(x*)) 2 ] = E[  2 ] =  2 Describes how much y* varies from f(x*) Slide from T Dietterich

23 50 fits (20 examples each) Slide from T Dietterich

24 Bias Slide from T Dietterich

25 Variance Slide from T Dietterich

26 Noise Slide from T Dietterich

27 50 fits (20 examples each) Slide from T Dietterich

28 Distribution of predictions at x=2.0 Slide from T Dietterich

29 50 fits (20 examples each) Slide from T Dietterich

30 Distribution of predictions at x=5.0 Slide from T Dietterich

31 Polynomial Curve Fitting Hypothesis Space Sin(2  x)

32 1 st Order Polynomial

33 3 rd Order Polynomial

34 9 th Order Polynomial

35 Regularization Penalize large coefficient values Increasing trades bias for variance

36 Regularization:

37

38 Regularization: vs.

39 Ensemble Methods Combining many biased learners – Eg decision stumps Keeps variance low Can represent more expressive hypotheses – Hence, also lowers error from bias 39

40 Measuring Bias and Variance In practice (unlike in theory), we have only ONE training set S. We can simulate multiple training sets by bootstrap replicates – S’ = {x | x is drawn at random w/ replacement from S} – And |S’| = |S|. Slide from T Dietterich

41 Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S 1, …, S B Apply learning algorithm to each replicate S b to obtain hypothesis h b Let T b = S \ S b be the data points that do not appear in S b (out of bag points) Compute predicted value h b (x) for each x in T b Slide from T Dietterich

42 Estimating Bias and Variance (continued) For each data point x, we will now have the observed corresponding value y and several predictions y 1, …, y K. Compute the average prediction h. Estimate bias as (h – y) Estimate variance as  k (y k – h) 2 /(K – 1) Assume noise is 0 Slide from T Dietterich

43 Approximations in this Procedure Bootstrap replicates are not real data We ignore the noise – If we have multiple data points with the same x value, then we can estimate the noise – We can also estimate noise by pooling y values from nearby x values (another use for Random Forest proximity measure?) Slide from T Dietterich

44 Bagging Bagging constructs B bootstrap replicates and their corresponding hypotheses h 1, …, h B It makes predictions according to y =  b h b (x) / B Hence, bagging’s predictions are h(x) Slide from T Dietterich

45 Estimated Bias and Variance of Bagging If we estimate bias and variance using the same B bootstrap samples, we will have: – Bias = (h – y) [same as before] – Variance =  k (h – h) 2 /(K – 1) = 0 Hence, according to this approximate way of estimating variance, bagging removes the variance while leaving bias unchanged. In reality, bagging only reduces variance and tends to slightly increase bias Slide from T Dietterich

46 Bias/Variance Heuristics Models that fit the data poorly have high bias: “inflexible models” such as linear regression, regression stumps Models that can fit the data very well have low bias but high variance: “flexible” models such as nearest neighbor regression, regression trees This suggests that bagging of a flexible model can reduce the variance while benefiting from the low bias Slide from T Dietterich

47 BAGGing = Bootstrap AGGregation (Breiman, 1996) for i = 1, 2, …, K: –T i  randomly select M training instances with replacement –h i  learn(T i ) [ID3, NB, kNN, neural net, …] Now combine the T i together with uniform voting (w i =1/K for all i)

48 48

49 49 decision tree learning algorithm; very similar to ID3

50 shades of blue/red indicate strength of vote for particular classification

51

52 Fighting the bias-variance tradeoff Simple (a.k.a. weak) learners are good – e.g., naïve Bayes, logistic regression, decision stumps (or shallow decision trees) – Low variance, don’t usually overfit Simple (a.k.a. weak) learners are bad – High bias, can’t solve hard learning problems Can we make weak learners always good??? – No!!! – But often yes…

53 Boosting Idea: given a weak learner, run it multiple times on (reweighted) training data, then let learned classifiers vote On each iteration t: – weight each training example by how incorrectly it was classified – Learn a hypothesis – h t – A strength for this hypothesis –  t Final classifier: Practically useful Theoretically interesting [Schapire, 1989]

54 54 time = 0 blue/red = class size of dot = weight weak learner = Decision stub: horizontal or vertical line

55 55 time = 1 this hypothesis has 15% error and so does this ensemble, since the ensemble contains just this one hypothesis

56 56 time = 2

57 57 time = 3

58 58 time = 13

59 59 time = 100

60 60 time = 300 overfitting!!

61 Learning from weighted data Consider a weighted dataset – D(i) – weight of i th training example (x i,y i ) – Interpretations: i th training example counts as if it occurred D(i) times If I were to “resample” data, I would get more samples of “heavier” data points Now, always do weighted calculations: – e.g., MLE for Naïve Bayes, redefine Count(Y=y) to be weighted count: – setting D(j)=1 (or any constant value!), for all j, will recreates unweighted case

62 How? Many possibilities. Will see one shortly! Why? Reweight the data: examples i that are misclassified will have higher weights! y i h t (x i ) > 0  h i correct y i h t (x i ) < 0  h i wrong h i correct, α t > 0  D t+1 (i) < D t (i) h i wrong, α t > 0  D t+1 (i) > D t (i) Final Result: linear sum of “base” or “weak” classifier outputs.

63 ε t : error of h t, weighted by D t 0 ≤ ε t ≤ 1 α t : No errors: ε t =0  α t =∞ All errors: ε t = 1  α t =−∞ Random: ε t = 0.5  α t =0 αtαt εtεt

64 What  t to choose for hypothesis h t ? Idea: choose  t to minimize a bound on training error! Where [Schapire, 1989]

65 What  t to choose for hypothesis h t ? Idea: choose  t to minimize a bound on training error! Where And If we minimize  t Z t, we minimize our training error!!! We can tighten this bound greedily, by choosing  t and h t on each iteration to minimize Z t. h t is estimated as a black box, but can we solve for  t ? [Schapire, 1989] This equality isn’t obvious! Can be shown with algebra (telescoping sums)!

66 Summary: choose  t to minimize error bound We can squeeze this bound by choosing  t on each iteration to minimize Z t. For boolean Y: differentiate, set equal to 0, there is a closed form solution! [Freund & Schapire ’97]: [Schapire, 1989]

67 Strong, weak classifiers If each classifier is (at least slightly) better than random:  t < 0.5 Another bound on error: What does this imply about the training error? – Will reach zero! – Will get there exponentially fast! Is it hard to achieve better than random training error?

68 Boosting results – Digit recognition Boosting: – Seems to be robust to overfitting – Test error can decrease even after training error is zero!!! [Schapire, 1989] Test error Training error

69 Boosting generalization error bound Constants: T: number of boosting rounds – Higher T  Looser bound, what does this imply? d: VC dimension of weak learner, measures complexity of classifier – Higher d  bigger hypothesis space  looser bound m: number of training examples – more data  tighter bound [Freund & Schapire, 1996]

70 Boosting generalization error bound Constants: T: number of boosting rounds: – Higher T  Looser bound, what does this imply? d: VC dimension of weak learner, measures complexity of classifier – Higher d  bigger hypothesis space  looser bound m: number of training examples – more data  tighter bound [Freund & Schapire, 1996] Theory does not match practice: Robust to overfitting Test set error decreases even after training error is zero Need better analysis tools we’ll come back to this later in the quarter

71 Boosting: Experimental Results Comparison of C4.5, Boosting C4.5, Boosting decision stumps (depth 1 trees), 27 benchmark datasets [Freund & Schapire, 1996] error

72 Train Test Train

73 Logistic Regression as Minimizing Loss Logistic regression assumes: And tries to maximize data likelihood, for Y={-1,+1}: Equivalent to minimizing log loss:

74 Boosting and Logistic Regression Logistic regression equivalent to minimizing log loss: Boosting minimizes similar loss function: Both smooth approximations of 0/1 loss!

75 Logistic regression and Boosting Logistic regression: Minimize loss fn Define where x j predefined Jointly optimize parameters w 0, w 1, … w n via gradient ascent. Boosting: Minimize loss fn Define where h t (x i ) defined dynamically to fit data Weights  j learned incrementally (new one for each training pass)

76 What you need to know about Boosting Combine weak classifiers to get very strong classifier – Weak classifier – slightly better than random on training data – Resulting very strong classifier – can get zero training error AdaBoost algorithm Boosting v. Logistic Regression – Both linear model, boosting “learns” features – Similar loss functions – Single optimization (LR) v. Incrementally improving classification (B) Most popular application of Boosting: – Boosted decision stumps! – Very simple to implement, very effective classifier


Download ppt "CSE 446: Ensemble Learning Winter 2012 Daniel Weld Slides adapted from Tom Dietterich, Luke Zettlemoyer, Carlos Guestrin, Nick Kushmerick, Padraig Cunningham."

Similar presentations


Ads by Google