Machine Learning & Data Mining CS/CNS/EE 155 Lecture 9: Decision Trees, Bagging & Random Forests 1.

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 9: Decision Trees, Bagging & Random Forests 1

Announcements Homework 2 harder than anticipated – Will not happen again – Lenient grading Homework 3 Released Tonight – Due 2 weeks later – (Much easier than HW2) Homework 1 will be graded soon. 2

Announcements Kaggle Mini-Project Released – https://kaggle.com/join/csee155 (also linked to from course website) https://kaggle.com/join/csee155 – Dataset of Financial Reports to SEC – Predict if Financial Report was during Great Recession – Training set of ~5000, with ~500 features – Submit predictions on test set – Competition ends in ~3 weeks – Report due 2 days after competition ends – Grading: 80% on cross validation & model selection, 20% on accuracy – If you do everything properly, you should get at least 90% – Don’t panic! With proper planning, shouldn’t take that much time. – Recitation tomorrow (short Kaggle & Decision Tree Tutorial) 3

This Week Back to standard classification & regression – No more structured models Focus to achieve highest possible accuracy – Decision Trees – Bagging – Random Forests – Boosting – Ensemble Selection 4

Decision Trees 5

PersonAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena1000 Male? Age>8? Age>11? 1 1 0 0 1 1 0 0 Yes No 6 x y (Binary) Decision Tree Don’t overthink this, it is literally what it looks like.

(Binary) Decision Tree Male? Age>8? Age>11? 1 1 0 0 1 1 0 0 Yes No Internal Nodes Leaf Nodes Root Node Every internal node has a binary query function q(x). Every leaf node has a prediction, e.g., 0 or 1. Prediction starts at root node. Recursively calls query function. Positive response  Left Child. Negative response  Right Child. Repeat until Leaf Node. Alice Gender: Female Age: 14 Input: Prediction: Height > 55” 7

Queries Decision Tree defined by Tree of Queries Binary query q(x) maps features to 0 or 1 Basic form: q(x) = 1[x d > c] – 1[x 3 > 5] – 1[x 1 > 0] – 1[x 55 > 1.2] Axis aligned partitioning of input space 8

Basic Decision Tree Function Class “Piece-wise Static” Function Class – All possible partitionings over feature space. – Each partition has a static prediction. Partitions axis-aligned – E.g., No Diagonals (Extensions next week) 10

Decision Trees vs Linear Models Decision Trees are NON-LINEAR Models! Example: 11 No Linear Model Can Achieve 0 Error Simple Decision Tree Can Achieve 0 Error x 1 >0 1 1 x 2 >0 0 0 1 1

Decision Trees vs Linear Models Decision Trees are NON-LINEAR Models! Example: 12 No Linear Model Can Achieve 0 Error Simple Decision Tree Can Achieve 0 Error x 1 >0 1 1 x 1 >1 1 1 0 0

Decision Trees vs Linear Models Decision Trees are AXIS-ALIGNED! – Cannot easily model diagonal boundaries Example: 13 Simple Linear SVM can Easily Find Max Margin Decision Trees Require Complex Axis-Aligned Partitioning Wasted Boundary

More Extreme Example 14 Decision Tree wastes most of model capacity on useless boundaries. (Depicting useful boundaries)

Decision Trees vs Linear Models Decision Trees are often more accurate! Non-linearity is often more important – Just use many axis-aligned boundaries to approximate diagonal boundaries – (It’s OK to waste model capacity.) Catch: requires sufficient training data – Will become clear later in lecture 15

Real Decision Trees 16 Image Source: http://www.biomedcentral.com/1471-2105/10/116 Can get much larger!

Male? Age>9? Age>11? 1 1 0 0 1 1 0 0 Yes No 17 Decision Tree Training Every node = partition/subset of S Every Layer = complete partitioning of S Children = complete partitioning of parent NameAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena1000 S xy

Thought Experiment What if just one node? – (I.e., just root node) – No queries – Single prediction for all data 18 NameAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena1000 S xy 1 1

19 Simplest Decision Tree is just single Root Node (Also a Leaf Node) Corresponds to Entire Training Set Makes a Single Prediction: Majority class in training set Simplest Decision Tree is just single Root Node (Also a Leaf Node) Corresponds to Entire Training Set Makes a Single Prediction: Majority class in training set

Thought Experiment Continued What if 2 Levels? – (I.e., root node + 2 children) – Single query (which one?) – 2 predictions – How many possible queries? 20 NameAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena1000 S xy Query? ? ? ? ? YesNo

21 #Possible Queries = #Possible Splits =D*N (D = #Features) (N = #training points) How to choose “best” query? #Possible Queries = #Possible Splits =D*N (D = #Features) (N = #training points) How to choose “best” query?

Impurity Define impurity function: – E.g., 0/1 Loss: 22 S L(S) = 1 L(S 1 ) = 0 S1S1 S2S2 L(S 2 ) = 1 Impurity Reduction = 0 Classification Error of best single prediction No Benefit From This Split!

Impurity Define impurity function: – E.g., 0/1 Loss: 23 S L(S) = 1 L(S 1 ) = 0 S1S1 S2S2 L(S 2 ) = 1 Impurity Reduction = 0 Classification Error of best single prediction No Benefit From This Split!

Impurity Define impurity function: – E.g., 0/1 Loss: 24 S L(S) = 1 L(S 1 ) = 0 S1S1 S2S2 L(S 2 ) = 0 Impurity Reduction = 1 Classification Error of best single prediction Choose Split with largest impurity reduction!

Impurity = Loss Function Training Goal: – Find decision tree with low impurity. Impurity Over Leaf Nodes = Training Loss 25 S’ iterates over leaf nodes Union of S’ = S (Leaf Nodes = partitioning of S) Classification Error on S’

Problems with 0/1 Loss What split best reduces impurity? 26 S L(S) = 1L(S 1 ) = 0 All Partitionings Give Same Impurity Reduction! S1S1 S2S2 S1S1 S2S2 L(S 2 ) = 1 L(S 1 ) = 0L(S 2 ) = 1

Problems with 0/1 Loss 0/1 Loss is discontinuous A good partitioning may not improve 0/1 Loss… – E.g., leads to an accurate model with subsequent split… 27 S L(S) = 1 L(S) = 0 S1S1 S2S2 S1S1 S2S2 S3S3 

Surrogate Impurity Measures Want more continuous impurity measure First try: Bernoulli Variance: 28 p S’ L(S’) P = 1/2, L(S’) = |S’|*1/4 P = 1, L(S’) = |S’|*0 P = 0, L(S’) = |S’|*0 Perfect Purity Worst Purity p S’ = fraction of S’ that are positive examples Assuming |S’|=1

Bernoulli Variance as Impurity What split best reduces impurity? 29 S L(S) = 5/6L(S 1 ) = 0 S1S1 S2S2 S1S1 S2S2 L(S 2 ) = 1/2 L(S 1 ) = 0L(S 2 ) = 3/4 Best! p S’ = fraction of S’ that are positive examples

Interpretation of Bernoulli Variance Assume each partition = distribution over y – y is Bernoulli distributed with expected value p S’ – Goal: partitioning where each y has low variance 30 S L(S) = 5/6L(S 1 ) = 0 S1S1 S2S2 S1S1 S2S2 L(S 2 ) = 1/2 L(S 1 ) = 0L(S 2 ) = 3/4 Best!

Other Impurity Measures Entropy: – aka: Information Gain: – (aka: Entropy Impurity Reduction) – Most popular. Gini Index: 31 See also: http://www.ise.bgu.ac.il/faculty/liorr/hbchap9.pdf (Terminology is slightly different.) p S’ L(S’) Define: 0*log(0) = 0 p S’ L(S’) Most Good Impurity Measures Look Qualitatively The Same! Most Good Impurity Measures Look Qualitatively The Same!

Top-Down Training Define impurity measure L(S’) – E.g., L(S’) = Bernoulli Variance 32 NameAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena1000 1 1 S xy Step 1: L(S) = 12/7 See TreeGrowing (Fig 9.2) in http://www.ise.bgu.ac.il/faculty/liorr/hbchap9.pdf Loop: Choose split with greatest impurity reduction (over all leaf nodes). Repeat: until stopping condition.

Top-Down Training Define impurity measure L(S’) – E.g., L(S’) = Bernoulli Variance 33 NameAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena1000 Male? S xy Step 1: L(S) = 12/7 See TreeGrowing (Fig 9.2) in http://www.ise.bgu.ac.il/faculty/liorr/hbchap9.pdf Loop: Choose split with greatest impurity reduction (over all leaf nodes). Repeat: until stopping condition. 1 1 0 0 L(S’)=1 Step 2: L(S) = 5/3 L(S’)=2/3

Top-Down Training Define impurity measure L(S’) – E.g., L(S’) = Bernoulli Variance 34 NameAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena1000 Male? S xy Step 1: L(S) = 12/7 See TreeGrowing (Fig 9.2) in http://www.ise.bgu.ac.il/faculty/liorr/hbchap9.pdf Loop: Choose split with greatest impurity reduction (over all leaf nodes). Repeat: until stopping condition. 1 1 0 0 Step 2: L(S) = 5/3 Step 3: Loop over all leaves, find best split. L(S’)=2/3L(S’)=1

Top-Down Training Define impurity measure L(S’) – E.g., L(S’) = Bernoulli Variance 35 NameAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena1000 Male? S xy Step 1: L(S) = 12/7 See TreeGrowing (Fig 9.2) in http://www.ise.bgu.ac.il/faculty/liorr/hbchap9.pdf Loop: Choose split with greatest impurity reduction (over all leaf nodes). Repeat: until stopping condition. Age>8? 0 0 Step 2: L(S) = 5/3 1 1 0 0 Step 3: L(S) = 1 L(S’)=0 L(S’)=1 L(S’)=0 Try

Top-Down Training Define impurity measure L(S’) – E.g., L(S’) = Bernoulli Variance 36 NameAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena1000 Male? S xy Step 1: L(S) = 12/7 See TreeGrowing (Fig 9.2) in http://www.ise.bgu.ac.il/faculty/liorr/hbchap9.pdf Loop: Choose split with greatest impurity reduction (over all leaf nodes). Repeat: until stopping condition. 1 1 Age>11? Step 2: L(S) = 5/3 1 1 0 0 Step 3: L(S) = 2/3 L(S’)=0 L(S’)=2/3 Try

Top-Down Training Define impurity measure L(S’) – E.g., L(S’) = Bernoulli Variance 37 NameAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena1000 Male? S xy Step 1: L(S) = 12/7 See TreeGrowing (Fig 9.2) in http://www.ise.bgu.ac.il/faculty/liorr/hbchap9.pdf Loop: Choose split with greatest impurity reduction (over all leaf nodes). Repeat: until stopping condition. Age>8? Age>11? Step 2: L(S) = 5/3 1 1 0 0 Step 3: L(S) = 2/3 L(S’)=0 1 1 0 0 Step 4: L(S) = 0

Properties of Top-Down Training Every intermediate step is a decision tree – You can stop any time and have a model Greedy algorithm – Doesn’t backtrack – Cannot reconsider different higher-level splits. 40 SS1S1 S2S2 S1S1 S2S2 S3S3 

When to Stop? In kept going, can learn tree with zero training error. – But such tree is probably overfitting to training set. How to stop training tree earlier? – I.e., how to regularize? 41 Which one has better test error?

Stopping Conditions (Regularizers) Minimum Size: do not split if resulting children are smaller than a minimum size. – Most common stopping condition. Maximum Depth: do not split if the resulting children are beyond some maximum depth of tree. Maximum #Nodes: do not split if tree already has maximum number of allowable nodes. Minimum Reduction in Impurity: do not split if resulting children do not reduce impurity by at least δ%. 42 See also, Section 5 in: http://www.ise.bgu.ac.il/faculty/liorr/hbchap9.pdf

43 Pseudocode for Training See TreeGrowing (Fig 9.2) in http://www.ise.bgu.ac.il/faculty/liorr/hbchap9.pdf Stopping condition is minimum leaf node size: N min Select from Q

Classification vs Regression 44 ClassificationRegression Labels are {0,1}Labels are Real Valued Predict Majority Class in Leaf Node Predict Mean of Labels in Leaf Node Piecewise Constant Function Class Goal: minimize 0/1 LossGoal: minimize squared loss Impurity based on fraction of positives vs negatives Impurity = Squared Loss

Recap: Decision Tree Training Train Top-Down – Iteratively split existing leaf node into 2 leaf nodes Minimize Impurity (= Training Loss) – E.g., Entropy Until Stopping Condition (= Regularization) – E.g., Minimum Node Size Finding optimal tree is intractable – E.g., tree satisfying minimal leaf sizes with lowest impurity. 45

Recap: Decision Trees Piecewise Constant Model Class – Non-linear! – Axis-aligned partitions of feature space Train to minimize impurity of training data in leaf partitions – Top-Down Greedy Training Often more accurate than linear models – If enough training data 46

Bagging (Bootstrap Aggregation) 47

Outline Recap: Bias/Variance Tradeoff Bagging – Method for minimizing variance – Not specific to Decision Trees Random Forests – Extension of Bagging – Specific to Decision Trees 48

Outline Recap: Bias/Variance Tradeoff Bagging – Method for minimizing variance – Not specific to Decision Trees Random Forests – Extension of Bagging – Specific to Decision Trees 49

Test Error “True” distribution: P(x,y) – Unknown to us Train: h S (x) = y – Using training data: – Sampled from P(x,y) Test Error: Overfitting: Test Error >> Training Error 50

PersonAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena800 PersonAgeMale?Height > 55” James1111 Jessica1401 Alice1401 Amy1201 Bob1011 Xavier910 Cathy901 Carol1301 Eugene1310 Rafael1211 Dave810 Peter910 Henry1310 Erin1100 Rose700 Iain811 Paulo1210 Margare t 1001 Frank911 Jill1300 Leon1010 Sarah1200 Gena800 Patrick511 … L (h) = E (x,y)~P(x,y) [ L(h(x),y) ] Test Error: h(x) y Training Set S True Distribution P(x,y) 51

Bias-Variance Decomposition For squared error: “Average prediction on x” Variance Term Bias Term 52

Example P(x,y) x y 53

h S (x) Linear 54

h S (x) Quadratic 55

h S (x) Cubic 56

Bias-Variance Trade-off Variance Bias Variance Bias Variance Bias 57

Overfitting vs Underfitting High variance implies overfitting – Model class unstable – Variance increases with model complexity – Variance reduces with more training data. High bias implies underfitting – Even with no variance, model class has high error – Bias decreases with model complexity – Independent of training data size 58

Decision Trees are Low Bias, High Variance Models Unless you Regularize a lot… …but then often worse than Linear Models Highly Non-Linear, Can Easily Overfit Different Training Samples Can Lead to Very Different Trees Decision Trees are Low Bias, High Variance Models Unless you Regularize a lot… …but then often worse than Linear Models Highly Non-Linear, Can Easily Overfit Different Training Samples Can Lead to Very Different Trees 60

Bagging Goal: reduce variance Ideal setting: many training sets S’ – Train model using each S’ – Average predictions E S [(h S (x) - y) 2 ] = E S [(Z-ž) 2 ] + ž 2 VarianceBias Expected Error On single (x,y) Z = h S (x) – y ž = E S [Z] http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf “Bagging Predictors” [Leo Breiman, 1994] Variance reduces linearly Bias unchanged sampled independently S’ P(x,y) 61

Bagging Goal: reduce variance In practice: resample S’ with replacement – Train model using each S’ – Average predictions from S http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf “Bagging Predictors” [Leo Breiman, 1994] Variance reduces sub-linearly (Because S’ are correlated) Bias often increases slightly S’S Bagging = Bootstrap Aggregation 62 “Bootstrapping” P(x,y) E S [(h S (x) - y) 2 ] = E S [(Z-ž) 2 ] + ž 2 VarianceBias Expected Error On single (x,y) Z = h S (x) – y ž = E S [Z]

Recap: Bagging for DTs Given: Training Set S Bagging: Generate Many Bootstrap Samples S’ – Sampled with replacement from S |S’| = |S| – Train Minimally Regularized DT on S’ High Variance, Low Bias Final Predictor: Average of all DTs – Averaging reduces variance 63

“An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants” Eric Bauer & Ron Kohavi, Machine Learning 36, 105–139 (1999) http://ai.stanford.edu/~ronnyk/vote.pdf Variance DTBagged DT Better 64 Bias

Why Bagging Works Define Ideal Aggregation Predictor h A (x): – Each S’ drawn from true distribution P We will first compare the error of h A (x) vs h S (x) Then show how to adapt comparison to Bagging 65 http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf “Bagging Predictors” [Leo Breiman, 1994] Decision Tree Trained on S

Analysis of Ideal Aggregate Predictor (Squared Loss) 66 http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf “Bagging Predictors” [Leo Breiman, 1994] Decision Tree Trained on S Expected Loss of h S on single (x,y) Linearity of Expectation E[Z 2 ] ≥ E[Z] 2 ( Z=h S’ (x) ) Definition of h A Loss of h A

Key Insight Ideal Aggregate Predictor Improves if: Bagging Predictor Improves if: 67 http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf “Bagging Predictors” [Leo Breiman, 1994] Large improvement if h S (x) is “unstable” (high variance) h A (x) is guranteed to be at least as good as h S (x). Improves if h B (x) is much more stable than h S (x) h B (x) can sometimes be more unstable than h S (x) Bias of h B (x) can be worse than h S (x).

Random Forests 68

Random Forests Goal: reduce variance – Bagging can only do so much – Resampling training data asymptotes Random Forests: sample data & features! – Sample S’ – Train DT At each node, sample features – Average predictions “Random Forests – Random Features” [Leo Breiman, 1997] http://oz.berkeley.edu/~breiman/random-forests.pdf Further de-correlates trees

Top-Down Random Forest Training 70 NameAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena1000 1 1 S’ xy Loop: Sample T random splits at each Leaf. Choose split with greatest impurity reduction. Repeat: until stopping condition. Step 1: “Random Forests – Random Features” [Leo Breiman, 1997] http://oz.berkeley.edu/~breiman/random-forests.pdf

Top-Down Random Forest Training 71 NameAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena1000 Age>9? S’ xy Step 1: Loop: Sample T random splits at each Leaf. Choose split with greatest impurity reduction. Repeat: until stopping condition. 1 1 0 0 Step 2: Randomly decide only look at age, Not gender. “Random Forests – Random Features” [Leo Breiman, 1997] http://oz.berkeley.edu/~breiman/random-forests.pdf 1 1

Top-Down Random Forest Training 72 NameAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena1000 Age>9? S’ xy Step 1: Loop: Sample T random splits at each Leaf. Choose split with greatest impurity reduction. Repeat: until stopping condition. Male? 0 0 Step 2: “Random Forests – Random Features” [Leo Breiman, 1997] http://oz.berkeley.edu/~breiman/random-forests.pdf Randomly decide only look at gender. 1 1 1 1 Step 3: Try

Top-Down Random Forest Training 73 NameAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena1000 Age>9? S’ xy Step 1: Loop: Sample T random splits at each Leaf. Choose split with greatest impurity reduction. Repeat: until stopping condition. 1 1 Age>8? Step 2: “Random Forests – Random Features” [Leo Breiman, 1997] http://oz.berkeley.edu/~breiman/random-forests.pdf Randomly decide only look at age. 0 0 1 1 Step 3: Try

Recap: Random Forests Extension of Bagging to sampling Features Generate Bootstrap S’ from S – Train DT Top-Down on S’ – Each node, sample subset of features for splitting Can also sample a subset of splits as well Average Predictions of all DTs 74 “Random Forests – Random Features” [Leo Breiman, 1997] http://oz.berkeley.edu/~breiman/random-forests.pdf

Better Average performance over many datasets Random Forests perform the best “An Empirical Evaluation of Supervised Learning in High Dimensions” Caruana, Karampatziakis & Yessenalina, ICML 2008 Random Forests

Next Lecture Boosting – Method for reducing bias Ensemble Selection – Very general method for combining classifiers – Multiple-time winner of ML competitions Recitation Tomorrow: – Brief Tutorial of Kaggle – Brief Tutorial of Decision Tree Scikit-Learn Package Although many decision tree packages available online. 76

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 9: Decision Trees, Bagging & Random Forests 1.

Similar presentations

Presentation on theme: "Machine Learning & Data Mining CS/CNS/EE 155 Lecture 9: Decision Trees, Bagging & Random Forests 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 9: Decision Trees, Bagging & Random Forests 1.

Similar presentations

Presentation on theme: "Machine Learning & Data Mining CS/CNS/EE 155 Lecture 9: Decision Trees, Bagging & Random Forests 1."— Presentation transcript:

Similar presentations

About project

Feedback