Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.

Similar presentations


Presentation on theme: "Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde."— Presentation transcript:

1 Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde of Origami Logic)

2 Decision tree learning

3 A decision tree

4 A regression tree Play = 30m, 45min Play = 0m, 0m, 15mPlay = 0m, 0m Play = 20m, 30m, 45m, Play ~= 33 Play ~= 24 Play ~= 37 Play ~= 5 Play ~= 0 Play ~= 32 Play ~= 18 Play = 45m,45,60,40 Play ~= 48

5 Most decision tree learning algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y … or if blahblah... – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree

6 Tree planting Training sample of points covering area [0, 3] x [0, 3] Two possible colors of points Model should predict color of a new point

7 Decision Tree

8 How to grow a decision tree? Split rows in a given node into two sets with respect to an impurity measure Popular splitting criterion: try to lower entropy of the y labels on the resulting partition – i.e., prefer splits that contain fewer labels, or that have very skewed distributions of labels

9 When to stop growing a tree? 1.Build a full tree OR 2.Apply stopping criterion – Tree depth – Minimum number of points on a leaf

10 How to assign leaf value? Leaf value is – Color, if leaf contains only one point – Majority color or color distribution, if leaf contains multiple points

11 Trained decision tree Tree covers an entire area by picking rectangles predicting a point color

12 Decision tree scoring Model can predict a new point color based on its coordinates

13 Decision trees: plus and minus Simple and fast to learn Arguably easy to understand (if compact) Very fast to use: – often you don’t even need to compute all attribute values Can find interactions between variables (play if it’s cool and sunny or ….) and hence non- linear decision boundaries Don’t need to worry about how numeric values are scaled

14 Decision trees: plus and minus Hard to prove things about Not well-suited to probabilistic extensions Don’t (typically) improve over linear classifiers when you have lots of features Sometimes fail badly on problems that linear classifiers perform well on

15

16 Another view of a decision tree

17 Sepal_length<5.7 Sepal_width>2.8

18 Another view of a decision tree

19 Another picture…

20 Problem of overfitting Tree perfectly represents training data (100% classification accuracy on training data), but also learned noise

21 How to handle overfitting? Pre-pruning – Stopping criterion Post-pruning – Random Forests! – Randomize tree building and combine them together

22 “Bagging” Create bootstrap samples of training data Build independent trees on these samples

23 “Bagging”

24 Each tree sees only a sample of the training data and captures only a part of the information Build multiple “weak” trees which vote together to give final prediction – Voting based on majority or weighted average Ensemble learning! Boosting!

25 Validation Each tree built over a sample of training points Remaining points are called “out-of-bag” (OOB) Good approximation for general error – Almost identical to N-fold cross- validation

26

27

28 Generally, bagged decision trees outperform the linear classifier eventually if the data is large enough and clean enough.

29 Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree

30 Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree easy cases!

31 Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree Numeric attribute: – sort examples by a retain label y – scan thru once and update the histogram of y|a θ at each point θ – pick the threshold θ with the best entropy score – O(nlogn) due to the sort – but repeated for each attribute

32 Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree Numeric attribute: – or, fix a set of possible split-points θ in advance – scan through once and compute the histogram of y’s – O(n) per attribute

33 Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree Subset splits: – expensive but useful – there is a similar sorting trick that works for the regression case

34 Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree Points to ponder: – different subtrees are distinct tasks – once the data is in memory, this algorithm is fast. each example appears only once in each level of the tree depth of the tree is usually O(log n) – as you move down the tree, the datasets get smaller

35

36 Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree The classifier is sequential and so is the learning algorithm – it’s really hard to see how you can learn the lower levels without learning the upper ones first!

37 Scaling up decision tree algorithms 1.Given dataset D: – return leaf(y) if all examples are in the same class y – pick the best split, on the best attribute a a<θ or a ≥ θ a or not(a) a=c 1 or a=c 2 or … a in {c 1,…,c k } or not – split the data into D 1,D 2, … D k and recursively build trees for each subset 2.“Prune” the tree Bottleneck points: – what’s expensive is picking the attributes especially at the top levels – also, moving the data around in a distributed setting

38 Boosting Assign weights to each weak learner – Start with uniform weights – Stream in test data – If a weak learner is WRONG: upweight – If a weak learner is RIGHT: downweight AdaBoost

39 Decision Trees in Spark MLlib

40

41 Scaling results Synthetic dataset (unspecified) 10 to 50 million instances 10 to 50 features 2 to 16 machines 700MB to 18GB datasets

42 Large-scale experiment 500M instances, 20 features, 90GB dataset

43

44 The End is coming Student lecture tomorrow Final lecture Friday – Overview of distributed / big data frameworks Thanksgiving break PROJECT TALKS! Final project write-ups: Dec. 9 – 6-10 pages (I stop reading after pg. 10) – NIPS format (see course website)


Download ppt "Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde."

Similar presentations


Ads by Google