Presentation is loading. Please wait.

Presentation is loading. Please wait.

STT : Intro. to Statistical Learning

Similar presentations


Presentation on theme: "STT : Intro. to Statistical Learning"— Presentation transcript:

1 STT592-002: Intro. to Statistical Learning
Decision Trees Chapter 08 (part 01) Disclaimer: This PPT is modified based on IOM 530: Intro. to Statistical Learning "Some of the figures in this presentation are taken from "An Introduction to Statistical Learning, with applications in R"  (Springer, 2013) with permission from the authors:  G. James, D. Witten,  T. Hastie and R. Tibshirani " 

2 STT592-002: Intro. to Statistical Learning
Outline The Basics of Decision Trees Regression Trees Classification Trees Pruning Trees Trees vs. Linear Models Advantages and Disadvantages of Trees

3 STT592-002: Intro. to Statistical Learning
Regression Trees

4 STT592-002: Intro. to Statistical Learning
Regression Trees One way to make predictions in a regression problem is to divide predictor space (i.e. all possible values for for X1,X2,…,Xp) into distinct Regions, say R1, R2,…,Rk Then for every X that falls in a particular region (say Rj) we make same prediction. Eg: Suppose we have two regions R1 and R2 with Then for any value of X such that we would predict 10, otherwise if we would predict 20. R1 R2

5 Splitting the X Variables
STT : Intro. to Statistical Learning Splitting the X Variables Generally we create partitions by iteratively splitting one of X variables into two regions. First split on X1=t1

6 Splitting the X Variable
STT : Intro. to Statistical Learning Splitting the X Variable 1. First split on X1=t1 2. If X1<t1, split on X2=t2 3. If X1>t1, split on X1=t3

7 Splitting the X Variable
STT : Intro. to Statistical Learning Splitting the X Variable First split on X1=t1 If X1<t1, split on X2=t2 If X1>t1, split on X1=t3 If X1>t3, split on X2=t4

8 Splitting the X Variable
STT : Intro. to Statistical Learning Splitting the X Variable In creating partitions this way, we can always represent them using a tree structure. This provides a very simple way to explain model to a non-expert eg. your boss!

9 Three Elements in tree construction
STT : Intro. to Statistical Learning Three Elements in tree construction Construction of a tree involves three elements: 1. The selection of splits. 2. The decisions when to declare a node terminal or to continue splitting it (internal node). 3. Assignment of each terminal node.

10 Example: Baseball Players’ Salaries
STT : Intro. to Statistical Learning Example: Baseball Players’ Salaries Predicted Salary is the # in each leaf node: mean of response for observations fall there. Note that Salary is measured in 1000s, and log-transformed Eg: Predicted salary for a player for more than 4.5 years and had less than hits last year is Another way of visualizing the decision tree…

11 Some Natural Questions
STT : Intro. to Statistical Learning Some Natural Questions Where to split? i.e. how do we decide on what regions to use i.e. R1, R2,…,Rk or equivalently what tree structure should we use? What values should we use for ?

12 1. Where to Split?  Simulation
STT : Intro. to Statistical Learning 1. Where to Split?  Simulation Consider splitting into two regions, Xj>s and Xj<s for all possible values of s and j. Choose s and j that results in lowest MSE (or SSE) on training data. Q: Now let’s try to split at the first point and find SSE. set.seed(1) x=sample(1:10, 5) y=sample(1:10, 5) plot(x,y, col="red", pch=15) cbind(x,y) #plot(1:20, 1:20, pch=1:20) A=c(1,9,10,6,5) SSE=NULL for (i in 1:4) { out=sum((A[1:i]-mean(A[1:i]))^2)+sum((A[(i+1):5]-mean(A[(i+1):5]))^2) SSE=c(SSE, out) } print(SSE)

13 STT592-002: Intro. to Statistical Learning
Where to Split? Optimal split on X1 at point t1. Repeat process for next best split except that we must also consider whether to split first region or second region up. Again criteria is smallest MSE. Optimal split was the left region on X2 at point t2. Continues until our regions have too few observations to continue. e.g. all regions have 5 or fewer points.

14 2. What values should we use for ?
STT : Intro. to Statistical Learning 2. What values should we use for ? Simple! For region Rj, the best prediction is simply the average of all the responses from our training data that fell in region Rj.

15 STT592-002: Intro. to Statistical Learning
Classification Trees

16 Growing a Classification Tree
STT : Intro. to Statistical Learning Growing a Classification Tree A classification tree is very similar to a regression tree except that we try to make a prediction for a categorical rather than continuous Y. For each region (or node), we predict most common category among the training data within that region, by simple majority vote. set.seed(3) x=sample(1:10, 10, replace = FALSE) y=sample(1:10, 10, replace = FALSE) COL=sample(c("blue", "red"), 20, replace = TRUE) plot(x,y, col=COL, pch=15)

17 Growing a Classification Tree
STT : Intro. to Statistical Learning Growing a Classification Tree Tree is grown (i.e. splits are chosen) in exactly same way as with a regression tree except that minimizing MSE/SSE no longer makes sense. There are several possible different criteria to use such as “gini index” and “cross-entropy”, but easiest one to think about is to minimize the error rate. Classification error rate, Gini index, or cross-entropy For cross-entropy:

18 Eg1: Classification error rate
STT : Intro. to Statistical Learning Eg1: Classification error rate The classification error rate is simply the fraction of the training observation in that region that do not belong to the most common class. R1 R2 m: m-th region; k: k-th class Eg: m=2; k=2 m=1: Region 1 m=2: Region 2 k=1: RED class k=2: BLUE class hat(p11) = prop of obs of RED in region 1 = 1/3 hat(p12) = prop of obs of BLUE in region 1 = 2/3 hat(p21) = prop of obs of RED in region 2 = 4/7 hat(p22) = prop of obs of BLUE in region 2 = 3/7 E1=1-max(1/3, 2/3) = 1/3 E2=1-max(4/7, 3/7) = 3/7 Total Error = 1/3+3/7 = 13/21

19 Eg1: Classification error rate
STT : Intro. to Statistical Learning Eg1: Classification error rate The classification error rate is simply the fraction of the training observation in that region that do not belong to the most common class. R1 R2 m: m-th region; k: k-th class Eg: m=2; k=2 m=1: Region 1 m=2: Region 2 k=1: RED class k=2: BLUE class

20 Gini Index and Cross-entropy
STT : Intro. to Statistical Learning Gini Index and Cross-entropy The classification error rate is simply the fraction of the training observation in that region that do not belong to the most common class.

21 Eg2: Growing a Classification Tree: simulation
STT : Intro. to Statistical Learning Eg2: Growing a Classification Tree: simulation R1 R2 m: m-th region; k: k-th class Eg: m=2; k=2 m=1: Region 1 m=2: Region 2 k=1: RED class k=2: BLUE class Gini Index: For m=1: Region 1 G1=(1/3)*(2/3) [RED] +(2/3)*(1/3) [BLUE] G2=(3/7)*(4/7) +(4/7)*(3/7)

22 Eg3: Growing a Classification Tree: simulation
STT : Intro. to Statistical Learning Eg3: Growing a Classification Tree: simulation Node impurity: a small value indicates that a node contains predominantly observations from a single class. Both Gini Index and Cross-entropy measures the node impurity. Both measurements take a small value if the m-th node is pure.

23 Eg3: Growing a Classification Tree: simulation
STT : Intro. to Statistical Learning Eg3: Growing a Classification Tree: simulation Eg1: hat(p_mk) = 0.9 Eg2: hat(p_mk) = 0.5 Node impurity: a small value indicates that a node contains predominantly observations from a single class. Both Gini Index and Cross-entropy measures the node impurity. Both measurements take a small value if the m-th node is pure.

24 Eg3: Growing a Classification Tree: simulation
STT : Intro. to Statistical Learning Eg3: Growing a Classification Tree: simulation Node impurity: a small value indicates that a node contains predominantly observations from a single class. Both Gini Index and Cross-entropy measures the node impurity. Both measurements take a small value if the m-th node is pure. (0, 3/7) sum=3/7 (1/2, 3/6) sum=1 (1/3, 2/5) sum=11/15 (1/2, 1/2) sum=1 (2/5, 1/3) sum=11/15 (3/7, 0) sum=3/7

25 Example: Orange Juice Preference
STT : Intro. to Statistical Learning Example: Orange Juice Preference Training Error Rate = 14.75% Test Error Rate = 23.6%

26 Can have more complex questions
Decision Tree Assume each object x is represented by a 2-dim vector 𝑥 1 𝑥 2 𝑥 2 x1 < 0.5 𝑥 2 =0.7 yes no 𝑥 2 =0.3 x2 < 0.3 x2 < 0.7 𝑥 1 𝑥 1 =0.5 yes RF: decision tree with bagging. Cam use mutlple variables simultaneously. no yes no The questions in training ….. Class 1 Class 2 Class 2 Class 1 number of branches, Branching criteria, termination criteria, base hypothesis Can have more complex questions

27 STT592-002: Intro. to Statistical Learning
Tree Pruning

28 Improving Tree Accuracy
STT : Intro. to Statistical Learning Improving Tree Accuracy A large tree (i.e. one with many terminal nodes) may tend to over fit training data in a similar way to neural networks without a weight decay. Generally, we can improve accuracy by “pruning” the tree i.e. cutting off some of terminal nodes. How do we know how far back to prune tree? We use cross validation to see which tree has lowest error rate.

29 Cost complexity pruning —also known as weakest link pruning
STT : Intro. to Statistical Learning Cost complexity pruning —also known as weakest link pruning

30 Cost complexity pruning —also known as weakest link pruning
STT : Intro. to Statistical Learning Cost complexity pruning —also known as weakest link pruning

31 Example: Baseball Players’ Salaries
STT : Intro. to Statistical Learning Example: Baseball Players’ Salaries The minimum cross validation error occurs at a tree size of 3 (# of terminal nodes)

32 Example: Baseball Players’ Salaries
STT : Intro. to Statistical Learning Example: Baseball Players’ Salaries

33 Example: Baseball Players’ Salaries
STT : Intro. to Statistical Learning Example: Baseball Players’ Salaries Cross Validation indicated that the minimum MSE is when the tree size is three (i.e. the number of leaf nodes is 3)

34 Example: Orange Juice Preference
STT : Intro. to Statistical Learning Example: Orange Juice Preference Pruned Tree CV Tree Error Rate = 22.5% Full Tree Training Error Rate = 14.75% Full Tree Test Error Rate = 23.6%

35 STT592-002: Intro. to Statistical Learning
Trees vs. Linear models

36 STT592-002: Intro. to Statistical Learning
Trees vs. Linear Models Which model is better? If relationship b/w predictors and response is linear, then classical linear models such as linear regression would outperform regression trees. On the other hand, if relationship between the predictors is non-linear, then decision trees would outperform classical approaches

37 Trees vs. Linear Model: Classification Example
STT : Intro. to Statistical Learning Trees vs. Linear Model: Classification Example Top row: the true decision boundary is linear Left: linear model (good) Right: decision tree Bottom row: the true decision boundary is non-linear Left: linear model Right: decision tree (good)

38 Advantages and disadvantages of trees
STT : Intro. to Statistical Learning Advantages and disadvantages of trees

39 Pros and Cons of Decision Trees
STT : Intro. to Statistical Learning Pros and Cons of Decision Trees Pros: Trees are very easy to explain to people (probably even easier than linear regression) Trees can be plotted graphically, and are easily interpreted even by non-expert They work fine on both classification and regression problems Cons: Trees don’t have the same prediction accuracy as some of more complicated approaches that we examine in this course Trees can be very non-robust. Small changes in data can cause large change in final estimated tree.


Download ppt "STT : Intro. to Statistical Learning"

Similar presentations


Ads by Google