Machine Learning & Data Mining CS/CNS/EE 155 Lecture 9: Decision Trees, Bagging & Random Forests 1.

Slides:



Advertisements
Similar presentations
CHAPTER 9: Decision Trees
Advertisements

Lectures 17,18 – Boosting and Additive Trees Rice ECE697 Farinaz Koushanfar Fall 2006.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Decision Tree.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 2: Review Part 2.
Overview Previous techniques have consisted of real-valued feature vectors (or discrete-valued) and natural measures of distance (e.g., Euclidean). Consider.
Chapter 7 – Classification and Regression Trees
CMPUT 466/551 Principal Source: CMU
Chapter 7 – Classification and Regression Trees
Longin Jan Latecki Temple University
Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005
An Introduction to Ensemble Methods Boosting, Bagging, Random Forests and More Yisong Yue.
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning: An Introduction
Lecture 5 (Classification with Decision Trees)
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
ICS 273A Intro Machine Learning
Ensemble Learning (2), Tree and Forest
Machine Learning & Data Mining CS/CNS/EE 155
Machine Learning CS 165B Spring 2012
Issues with Data Mining
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 6 Ensembles of Trees.
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Chapter 9 – Classification and Regression Trees
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
Scaling up Decision Trees. Decision tree learning.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Ensemble Methods in Machine Learning
Bias-Variance in Machine Learning. Bias-Variance: Outline Underfitting/overfitting: –Why are complex hypotheses bad? Simple example of bias/variance Error.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Classification Ensemble Methods 1
Lecture Notes for Chapter 4 Introduction to Data Mining
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
Classification and Regression Trees
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Web-Mining Agents: Classification with Ensemble Methods R. Möller Institute of Information Systems University of Luebeck.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Einführung in Web- und Data-Science Ensemble Learning
Machine Learning: Ensemble Methods
Trees, bagging, boosting, and stacking
ECE 5424: Introduction to Machine Learning
ECE 471/571 – Lecture 12 Decision Tree.
Data Mining Practical Machine Learning Tools and Techniques
Decision Trees By Cole Daily CSCI 446.
Statistical Learning Dong Liu Dept. EEIS, USTC.
Ensemble learning.
Lecture 06: Bagging and Boosting
Ensemble learning Reminder - Bagging of Trees Random Forest
Model generalization Brief summary of methods
Classification with CART
CS639: Data Management for Data Science
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 9: Decision Trees, Bagging & Random Forests 1

Announcements Homework 2 harder than anticipated – Will not happen again – Lenient grading Homework 3 Released Tonight – Due 2 weeks later – (Much easier than HW2) Homework 1 will be graded soon. 2

Announcements Kaggle Mini-Project Released – (also linked to from course website) – Dataset of Financial Reports to SEC – Predict if Financial Report was during Great Recession – Training set of ~5000, with ~500 features – Submit predictions on test set – Competition ends in ~3 weeks – Report due 2 days after competition ends – Grading: 80% on cross validation & model selection, 20% on accuracy – If you do everything properly, you should get at least 90% – Don’t panic! With proper planning, shouldn’t take that much time. – Recitation tomorrow (short Kaggle & Decision Tree Tutorial) 3

This Week Back to standard classification & regression – No more structured models Focus to achieve highest possible accuracy – Decision Trees – Bagging – Random Forests – Boosting – Ensemble Selection 4

Decision Trees 5

PersonAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena1000 Male? Age>8? Age>11? Yes No 6 x y (Binary) Decision Tree Don’t overthink this, it is literally what it looks like.

(Binary) Decision Tree Male? Age>8? Age>11? Yes No Internal Nodes Leaf Nodes Root Node Every internal node has a binary query function q(x). Every leaf node has a prediction, e.g., 0 or 1. Prediction starts at root node. Recursively calls query function. Positive response  Left Child. Negative response  Right Child. Repeat until Leaf Node. Alice Gender: Female Age: 14 Input: Prediction: Height > 55” 7

Queries Decision Tree defined by Tree of Queries Binary query q(x) maps features to 0 or 1 Basic form: q(x) = 1[x d > c] – 1[x 3 > 5] – 1[x 1 > 0] – 1[x 55 > 1.2] Axis aligned partitioning of input space 8

9

Basic Decision Tree Function Class “Piece-wise Static” Function Class – All possible partitionings over feature space. – Each partition has a static prediction. Partitions axis-aligned – E.g., No Diagonals (Extensions next week) 10

Decision Trees vs Linear Models Decision Trees are NON-LINEAR Models! Example: 11 No Linear Model Can Achieve 0 Error Simple Decision Tree Can Achieve 0 Error x 1 >0 1 1 x 2 >

Decision Trees vs Linear Models Decision Trees are NON-LINEAR Models! Example: 12 No Linear Model Can Achieve 0 Error Simple Decision Tree Can Achieve 0 Error x 1 >0 1 1 x 1 >

Decision Trees vs Linear Models Decision Trees are AXIS-ALIGNED! – Cannot easily model diagonal boundaries Example: 13 Simple Linear SVM can Easily Find Max Margin Decision Trees Require Complex Axis-Aligned Partitioning Wasted Boundary

More Extreme Example 14 Decision Tree wastes most of model capacity on useless boundaries. (Depicting useful boundaries)

Decision Trees vs Linear Models Decision Trees are often more accurate! Non-linearity is often more important – Just use many axis-aligned boundaries to approximate diagonal boundaries – (It’s OK to waste model capacity.) Catch: requires sufficient training data – Will become clear later in lecture 15

Real Decision Trees 16 Image Source: Can get much larger!

Male? Age>9? Age>11? Yes No 17 Decision Tree Training Every node = partition/subset of S Every Layer = complete partitioning of S Children = complete partitioning of parent NameAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena1000 S xy

Thought Experiment What if just one node? – (I.e., just root node) – No queries – Single prediction for all data 18 NameAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena1000 S xy 1 1

19 Simplest Decision Tree is just single Root Node (Also a Leaf Node) Corresponds to Entire Training Set Makes a Single Prediction: Majority class in training set Simplest Decision Tree is just single Root Node (Also a Leaf Node) Corresponds to Entire Training Set Makes a Single Prediction: Majority class in training set

Thought Experiment Continued What if 2 Levels? – (I.e., root node + 2 children) – Single query (which one?) – 2 predictions – How many possible queries? 20 NameAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena1000 S xy Query? ? ? ? ? YesNo

21 #Possible Queries = #Possible Splits =D*N (D = #Features) (N = #training points) How to choose “best” query? #Possible Queries = #Possible Splits =D*N (D = #Features) (N = #training points) How to choose “best” query?

Impurity Define impurity function: – E.g., 0/1 Loss: 22 S L(S) = 1 L(S 1 ) = 0 S1S1 S2S2 L(S 2 ) = 1 Impurity Reduction = 0 Classification Error of best single prediction No Benefit From This Split!

Impurity Define impurity function: – E.g., 0/1 Loss: 23 S L(S) = 1 L(S 1 ) = 0 S1S1 S2S2 L(S 2 ) = 1 Impurity Reduction = 0 Classification Error of best single prediction No Benefit From This Split!

Impurity Define impurity function: – E.g., 0/1 Loss: 24 S L(S) = 1 L(S 1 ) = 0 S1S1 S2S2 L(S 2 ) = 0 Impurity Reduction = 1 Classification Error of best single prediction Choose Split with largest impurity reduction!

Impurity = Loss Function Training Goal: – Find decision tree with low impurity. Impurity Over Leaf Nodes = Training Loss 25 S’ iterates over leaf nodes Union of S’ = S (Leaf Nodes = partitioning of S) Classification Error on S’

Problems with 0/1 Loss What split best reduces impurity? 26 S L(S) = 1L(S 1 ) = 0 All Partitionings Give Same Impurity Reduction! S1S1 S2S2 S1S1 S2S2 L(S 2 ) = 1 L(S 1 ) = 0L(S 2 ) = 1

Problems with 0/1 Loss 0/1 Loss is discontinuous A good partitioning may not improve 0/1 Loss… – E.g., leads to an accurate model with subsequent split… 27 S L(S) = 1 L(S) = 0 S1S1 S2S2 S1S1 S2S2 S3S3 

Surrogate Impurity Measures Want more continuous impurity measure First try: Bernoulli Variance: 28 p S’ L(S’) P = 1/2, L(S’) = |S’|*1/4 P = 1, L(S’) = |S’|*0 P = 0, L(S’) = |S’|*0 Perfect Purity Worst Purity p S’ = fraction of S’ that are positive examples Assuming |S’|=1

Bernoulli Variance as Impurity What split best reduces impurity? 29 S L(S) = 5/6L(S 1 ) = 0 S1S1 S2S2 S1S1 S2S2 L(S 2 ) = 1/2 L(S 1 ) = 0L(S 2 ) = 3/4 Best! p S’ = fraction of S’ that are positive examples

Interpretation of Bernoulli Variance Assume each partition = distribution over y – y is Bernoulli distributed with expected value p S’ – Goal: partitioning where each y has low variance 30 S L(S) = 5/6L(S 1 ) = 0 S1S1 S2S2 S1S1 S2S2 L(S 2 ) = 1/2 L(S 1 ) = 0L(S 2 ) = 3/4 Best!

Other Impurity Measures Entropy: – aka: Information Gain: – (aka: Entropy Impurity Reduction) – Most popular. Gini Index: 31 See also: (Terminology is slightly different.) p S’ L(S’) Define: 0*log(0) = 0 p S’ L(S’) Most Good Impurity Measures Look Qualitatively The Same! Most Good Impurity Measures Look Qualitatively The Same!

Top-Down Training Define impurity measure L(S’) – E.g., L(S’) = Bernoulli Variance 32 NameAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena S xy Step 1: L(S) = 12/7 See TreeGrowing (Fig 9.2) in Loop: Choose split with greatest impurity reduction (over all leaf nodes). Repeat: until stopping condition.

Top-Down Training Define impurity measure L(S’) – E.g., L(S’) = Bernoulli Variance 33 NameAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena1000 Male? S xy Step 1: L(S) = 12/7 See TreeGrowing (Fig 9.2) in Loop: Choose split with greatest impurity reduction (over all leaf nodes). Repeat: until stopping condition L(S’)=1 Step 2: L(S) = 5/3 L(S’)=2/3

Top-Down Training Define impurity measure L(S’) – E.g., L(S’) = Bernoulli Variance 34 NameAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena1000 Male? S xy Step 1: L(S) = 12/7 See TreeGrowing (Fig 9.2) in Loop: Choose split with greatest impurity reduction (over all leaf nodes). Repeat: until stopping condition Step 2: L(S) = 5/3 Step 3: Loop over all leaves, find best split. L(S’)=2/3L(S’)=1

Top-Down Training Define impurity measure L(S’) – E.g., L(S’) = Bernoulli Variance 35 NameAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena1000 Male? S xy Step 1: L(S) = 12/7 See TreeGrowing (Fig 9.2) in Loop: Choose split with greatest impurity reduction (over all leaf nodes). Repeat: until stopping condition. Age>8? 0 0 Step 2: L(S) = 5/ Step 3: L(S) = 1 L(S’)=0 L(S’)=1 L(S’)=0 Try

Top-Down Training Define impurity measure L(S’) – E.g., L(S’) = Bernoulli Variance 36 NameAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena1000 Male? S xy Step 1: L(S) = 12/7 See TreeGrowing (Fig 9.2) in Loop: Choose split with greatest impurity reduction (over all leaf nodes). Repeat: until stopping condition. 1 1 Age>11? Step 2: L(S) = 5/ Step 3: L(S) = 2/3 L(S’)=0 L(S’)=2/3 Try

Top-Down Training Define impurity measure L(S’) – E.g., L(S’) = Bernoulli Variance 37 NameAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena1000 Male? S xy Step 1: L(S) = 12/7 See TreeGrowing (Fig 9.2) in Loop: Choose split with greatest impurity reduction (over all leaf nodes). Repeat: until stopping condition. Age>8? Age>11? Step 2: L(S) = 5/ Step 3: L(S) = 2/3 L(S’)= Step 4: L(S) = 0

38

39

Properties of Top-Down Training Every intermediate step is a decision tree – You can stop any time and have a model Greedy algorithm – Doesn’t backtrack – Cannot reconsider different higher-level splits. 40 SS1S1 S2S2 S1S1 S2S2 S3S3 

When to Stop? In kept going, can learn tree with zero training error. – But such tree is probably overfitting to training set. How to stop training tree earlier? – I.e., how to regularize? 41 Which one has better test error?

Stopping Conditions (Regularizers) Minimum Size: do not split if resulting children are smaller than a minimum size. – Most common stopping condition. Maximum Depth: do not split if the resulting children are beyond some maximum depth of tree. Maximum #Nodes: do not split if tree already has maximum number of allowable nodes. Minimum Reduction in Impurity: do not split if resulting children do not reduce impurity by at least δ%. 42 See also, Section 5 in:

43 Pseudocode for Training See TreeGrowing (Fig 9.2) in Stopping condition is minimum leaf node size: N min Select from Q

Classification vs Regression 44 ClassificationRegression Labels are {0,1}Labels are Real Valued Predict Majority Class in Leaf Node Predict Mean of Labels in Leaf Node Piecewise Constant Function Class Goal: minimize 0/1 LossGoal: minimize squared loss Impurity based on fraction of positives vs negatives Impurity = Squared Loss

Recap: Decision Tree Training Train Top-Down – Iteratively split existing leaf node into 2 leaf nodes Minimize Impurity (= Training Loss) – E.g., Entropy Until Stopping Condition (= Regularization) – E.g., Minimum Node Size Finding optimal tree is intractable – E.g., tree satisfying minimal leaf sizes with lowest impurity. 45

Recap: Decision Trees Piecewise Constant Model Class – Non-linear! – Axis-aligned partitions of feature space Train to minimize impurity of training data in leaf partitions – Top-Down Greedy Training Often more accurate than linear models – If enough training data 46

Bagging (Bootstrap Aggregation) 47

Outline Recap: Bias/Variance Tradeoff Bagging – Method for minimizing variance – Not specific to Decision Trees Random Forests – Extension of Bagging – Specific to Decision Trees 48

Outline Recap: Bias/Variance Tradeoff Bagging – Method for minimizing variance – Not specific to Decision Trees Random Forests – Extension of Bagging – Specific to Decision Trees 49

Test Error “True” distribution: P(x,y) – Unknown to us Train: h S (x) = y – Using training data: – Sampled from P(x,y) Test Error: Overfitting: Test Error >> Training Error 50

PersonAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena800 PersonAgeMale?Height > 55” James1111 Jessica1401 Alice1401 Amy1201 Bob1011 Xavier910 Cathy901 Carol1301 Eugene1310 Rafael1211 Dave810 Peter910 Henry1310 Erin1100 Rose700 Iain811 Paulo1210 Margare t 1001 Frank911 Jill1300 Leon1010 Sarah1200 Gena800 Patrick511 … L (h) = E (x,y)~P(x,y) [ L(h(x),y) ] Test Error: h(x) y Training Set S True Distribution P(x,y) 51

Bias-Variance Decomposition For squared error: “Average prediction on x” Variance Term Bias Term 52

Example P(x,y) x y 53

h S (x) Linear 54

h S (x) Quadratic 55

h S (x) Cubic 56

Bias-Variance Trade-off Variance Bias Variance Bias Variance Bias 57

Overfitting vs Underfitting High variance implies overfitting – Model class unstable – Variance increases with model complexity – Variance reduces with more training data. High bias implies underfitting – Even with no variance, model class has high error – Bias decreases with model complexity – Independent of training data size 58

59

Decision Trees are Low Bias, High Variance Models Unless you Regularize a lot… …but then often worse than Linear Models Highly Non-Linear, Can Easily Overfit Different Training Samples Can Lead to Very Different Trees Decision Trees are Low Bias, High Variance Models Unless you Regularize a lot… …but then often worse than Linear Models Highly Non-Linear, Can Easily Overfit Different Training Samples Can Lead to Very Different Trees 60

Bagging Goal: reduce variance Ideal setting: many training sets S’ – Train model using each S’ – Average predictions E S [(h S (x) - y) 2 ] = E S [(Z-ž) 2 ] + ž 2 VarianceBias Expected Error On single (x,y) Z = h S (x) – y ž = E S [Z] “Bagging Predictors” [Leo Breiman, 1994] Variance reduces linearly Bias unchanged sampled independently S’ P(x,y) 61

Bagging Goal: reduce variance In practice: resample S’ with replacement – Train model using each S’ – Average predictions from S “Bagging Predictors” [Leo Breiman, 1994] Variance reduces sub-linearly (Because S’ are correlated) Bias often increases slightly S’S Bagging = Bootstrap Aggregation 62 “Bootstrapping” P(x,y) E S [(h S (x) - y) 2 ] = E S [(Z-ž) 2 ] + ž 2 VarianceBias Expected Error On single (x,y) Z = h S (x) – y ž = E S [Z]

Recap: Bagging for DTs Given: Training Set S Bagging: Generate Many Bootstrap Samples S’ – Sampled with replacement from S |S’| = |S| – Train Minimally Regularized DT on S’ High Variance, Low Bias Final Predictor: Average of all DTs – Averaging reduces variance 63

“An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants” Eric Bauer & Ron Kohavi, Machine Learning 36, 105–139 (1999) Variance DTBagged DT Better 64 Bias

Why Bagging Works Define Ideal Aggregation Predictor h A (x): – Each S’ drawn from true distribution P We will first compare the error of h A (x) vs h S (x) Then show how to adapt comparison to Bagging 65 “Bagging Predictors” [Leo Breiman, 1994] Decision Tree Trained on S

Analysis of Ideal Aggregate Predictor (Squared Loss) 66 “Bagging Predictors” [Leo Breiman, 1994] Decision Tree Trained on S Expected Loss of h S on single (x,y) Linearity of Expectation E[Z 2 ] ≥ E[Z] 2 ( Z=h S’ (x) ) Definition of h A Loss of h A

Key Insight Ideal Aggregate Predictor Improves if: Bagging Predictor Improves if: 67 “Bagging Predictors” [Leo Breiman, 1994] Large improvement if h S (x) is “unstable” (high variance) h A (x) is guranteed to be at least as good as h S (x). Improves if h B (x) is much more stable than h S (x) h B (x) can sometimes be more unstable than h S (x) Bias of h B (x) can be worse than h S (x).

Random Forests 68

Random Forests Goal: reduce variance – Bagging can only do so much – Resampling training data asymptotes Random Forests: sample data & features! – Sample S’ – Train DT At each node, sample features – Average predictions “Random Forests – Random Features” [Leo Breiman, 1997] Further de-correlates trees

Top-Down Random Forest Training 70 NameAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena S’ xy Loop: Sample T random splits at each Leaf. Choose split with greatest impurity reduction. Repeat: until stopping condition. Step 1: “Random Forests – Random Features” [Leo Breiman, 1997]

Top-Down Random Forest Training 71 NameAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena1000 Age>9? S’ xy Step 1: Loop: Sample T random splits at each Leaf. Choose split with greatest impurity reduction. Repeat: until stopping condition Step 2: Randomly decide only look at age, Not gender. “Random Forests – Random Features” [Leo Breiman, 1997] 1 1

Top-Down Random Forest Training 72 NameAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena1000 Age>9? S’ xy Step 1: Loop: Sample T random splits at each Leaf. Choose split with greatest impurity reduction. Repeat: until stopping condition. Male? 0 0 Step 2: “Random Forests – Random Features” [Leo Breiman, 1997] Randomly decide only look at gender Step 3: Try

Top-Down Random Forest Training 73 NameAgeMale?Height > 55” Alice1401 Bob1011 Carol1301 Dave810 Erin1100 Frank911 Gena1000 Age>9? S’ xy Step 1: Loop: Sample T random splits at each Leaf. Choose split with greatest impurity reduction. Repeat: until stopping condition. 1 1 Age>8? Step 2: “Random Forests – Random Features” [Leo Breiman, 1997] Randomly decide only look at age Step 3: Try

Recap: Random Forests Extension of Bagging to sampling Features Generate Bootstrap S’ from S – Train DT Top-Down on S’ – Each node, sample subset of features for splitting Can also sample a subset of splits as well Average Predictions of all DTs 74 “Random Forests – Random Features” [Leo Breiman, 1997]

Better Average performance over many datasets Random Forests perform the best “An Empirical Evaluation of Supervised Learning in High Dimensions” Caruana, Karampatziakis & Yessenalina, ICML 2008 Random Forests

Next Lecture Boosting – Method for reducing bias Ensemble Selection – Very general method for combining classifiers – Multiple-time winner of ML competitions Recitation Tomorrow: – Brief Tutorial of Kaggle – Brief Tutorial of Decision Tree Scikit-Learn Package Although many decision tree packages available online. 76