Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Machine Learning Decision Trees Some of these slides are courtesy of R.Mooney, UT Austin and E. Keogh, UC Riverside.

Similar presentations


Presentation on theme: "1 Machine Learning Decision Trees Some of these slides are courtesy of R.Mooney, UT Austin and E. Keogh, UC Riverside."— Presentation transcript:

1 1 Machine Learning Decision Trees Some of these slides are courtesy of R.Mooney, UT Austin and E. Keogh, UC Riverside

2 2 Decision Tree Decision trees are powerful and popular tools for classification and prediction. Decision trees represent rules, which can be understood by humans and used in knowledge system such as database.

3 3 Key Requirements Attribute-value description: object or case must be expressible in terms of a fixed collection of properties or attributes (e.g., hot, mild, cold). Predefined classes (target values): the target function has discrete output values (bollean or multiclass)

4 4 Decision Tree Classifier Ross Quinlan Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Abdomen Length Abdomen Length Abdomen Length > 7.1? no yes Katydid Antenna Length Antenna Length > 6.0? no yes Katydid Grasshopper

5 5 Antennae shorter than body? Cricket Foretiba has ears? KatydidsCamel Cricket Yes No 3 Tarsi? No Decision trees predate computers

6 6 outlook sunnyovercastrain yes humiditywind high normalstrong weak yes no attribute attribute value decision How to choose the root? How to proceed? When to stop?

7 7 Decision tree is a classifier in the form of a tree structure – Decision node: specifies a test on a single attribute – Leaf node: indicates the value of the target attribute – Arc/edge: split of one attribute – Path: a disjunction of test to make the final decision Decision trees classify instances or examples by starting at the root of the tree and moving through it until a leaf node. Definition

8 8 Decision tree –A flow-chart-like tree structure –Internal node denotes a test on an attribute –Branch represents an outcome of the test –Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases –Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes –Tree pruning Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample –Test the attribute values of the sample against the decision tree Decision Tree Classification

9 9 Decision Tree Representation Each internal node tests an attribute Each branch corresponds to attribute value Each leaf node assigns a classification How would we represent: – Λ, V, XOR –(A Λ B) V (C Λ ¬D Λ E) –M of N outlook sunnyovercastrain yes humiditywind high normalstrong weak yes no

10 10 Basic algorithm (a greedy algorithm) –Tree is constructed in a top-down recursive divide-and-conquer manner –At start, all the training examples are at the root –Attributes are categorical (if continuous-valued, they can be discretized in advance) –Examples are partitioned recursively based on selected attributes. –Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning –All samples for a given node belong to the same class –There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf –There are no samples left How do we construct the decision tree?

11 11 Top-Down Decision Tree Induction Main loop: 1.A  the “best” decision attribute for next node 2.Assign A as decision attribute for node 3.For each value of A, create new descendant of node 4.Sort training examples to leaf nodes 5.If training examples perfectly classified, Then STOP, Else iterate over new leaf nodes

12 12 outlook sunnyovercastrain yes humiditywind high normalstrong weak yes no attribute attribute value decision

13 13R. Mooney, UT Austin Decision Tree Induction Pseudocode DTree(examples, features) returns a tree If all examples are in one category, return a leaf node with that category label. Else if the set of features is empty, return a leaf node with the category label that is the most common in examples. Else pick a feature F and create a node R for it For each possible value v i of F: Let examples i be the subset of examples that have value v i for F Add an out-going edge E to node R labeled with the value v i. If examples i is empty then attach a leaf node to edge E labeled with the category that is the most common in examples. else call DTree(examples i, features – {F}) and attach the resulting tree as the subtree under edge E. Return the subtree rooted at R.

14 14 Picking a Good Split Feature Goal is to have the resulting tree be as small as possible, per Occam’s razor. Finding a minimal decision tree (nodes, leaves, or depth) is an NP- hard optimization problem. Top-down divide-and-conquer method does a greedy search for a simple tree but does not guarantee to find the smallest. –General lesson in ML: “Greed is good.” Want to pick a feature that creates subsets of examples that are relatively “pure” in a single class so they are “closer” to being leaf nodes. There are a variety of heuristics for picking a good test, a popular one is based on information gain that originated with the ID3 system of Quinlan (1979). R. Mooney, UT Austin

15 15 Random split The tree can grow huge These trees are hard to understand. Larger trees are typically less accurate than smaller trees.

16 16 Principled Criterion Selection of an attribute to test at each node - choosing the most useful attribute for classifying examples. How? Information gain –measures how well a given attribute separates the training examples according to their target classification –This measure is used to select among the candidate attributes at each step while growing the tree

17 17 Information Theory Think of playing "20 questions": I am thinking of an integer between 1 and 1,000 -- what is it? What is the first question you would ask? What question will you ask? Why? Entropy measures how much more information you need before you can identify the integer. Initially, there are 1000 possible values, which we assume are equally likely. What is the maximum number of question you need to ask?

18 18 Information Gain as A Splitting Criteria Select the attribute with the highest information gain ( information gain is the expected reduction in entropy ). Assume there are two classes, P and N –Let the set of examples S contain p elements of class P and n elements of class N –The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined as 0 log(0) is defined as 0

19 19 R. Mooney, UT Austin Entropy Entropy (disorder, impurity) of a set of examples, S, relative to a binary classification is: where p 1 is the fraction of positive examples in S and p 0 is the fraction of negatives. If all examples are in one category, entropy is zero (we define 0  log(0)=0) If examples are equally mixed (p 1 =p 0 =0.5), entropy is a maximum of 1. Entropy can be viewed as the number of bits required on average to encode the class of an example in S where data compression (e.g. Huffman coding) is used to give shorter codes to more likely cases. For multi-class problems with c categories, entropy generalizes to:

20 20 Entropy Plot for Binary Classification The entropy is 0 if the outcome is certain. The entropy is maximum if we have no knowledge of the system (or any outcome is equally possible). Entropy of a 2-class problem with regard to the portion of one of the two groups

21 21 Information Gain Is the expected reduction in entropy caused by partitioning the examples according to this attribute. is the number of bits saved when encoding the target value of an arbitrary member of S, by knowing the value of attribute A.

22 22 Information Gain in Decision Tree Induction Assume that using attribute A, a current set will be partitioned into some number of child sets The encoding information that would be gained by branching on A Note: entropy is at its minimum if the collection of objects is completely uniform

23 23 Continuous Attribute? Each non-leaf node is a test, its edge partitioning the attribute into subsets (easy for discrete attribute). For continuous attribute –Partition the continuous value of attribute A into a discrete set of intervals –Create a new boolean attribute A c, looking for a threshold c, How to choose c ?

24 24 Person Hair Length WeightAgeClass Homer0”25036M Marge10”15034F Bart2”9010M Lisa6”788F Maggie4”201F Abe1”17070M Selma8”16041F Otto10”18038M Krusty6”20045M Comic8”29038?

25 25 Hair Length <= 5? yes no Entropy(4F,5M) = -(4/9)log 2 (4/9) - (5/9)log 2 (5/9) = 0.9911 Entropy(1F,3M) = -(1/4)log 2 (1/4) - (3/4)log 2 (3/4) = 0.8113 Entropy(3F,2M) = -(3/5)log 2 (3/5) - (2/5)log 2 (2/5) = 0.9710 Gain(Hair Length <= 5) = 0.9911 – (4/9 * 0.8113 + 5/9 * 0.9710 ) = 0.0911 Let us try splitting on Hair length (whether it is <=5) Let us try splitting on Hair length (whether it is <=5)

26 26 Weight <= 160? yes no Entropy(4F,5M) = -(4/9)log 2 (4/9) - (5/9)log 2 (5/9) = 0.9911 Entropy(4F,1M) = -(4/5)log 2 (4/5) - (1/5)log 2 (1/5) = 0.7219 Entropy(0F,4M) = -(0/4)log 2 (0/4) - (4/4)log 2 (4/4) = 0 Gain(Weight <= 160) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900 Let us try splitting on Weight (whether it is <=160) Let us try splitting on Weight (whether it is <=160)

27 27 age <= 40? yes no Entropy(4F,5M) = -(4/9)log 2 (4/9) - (5/9)log 2 (5/9) = 0.9911 Entropy(3F,3M) = -(3/6)log 2 (3/6) - (3/6)log 2 (3/6) = 1 Entropy(1F,2M) = -(1/3)log 2 (1/3) - (2/3)log 2 (2/3) = 0.9183 Gain(Age <= 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183 Let us try splitting on Age (whether it is <=40) Let us try splitting on Age (whether it is <=40)

28 28 Weight <= 160? yes no Hair Length <= 2? yes no Of the 3 features we had, Weight was best. But while people who weigh over 160 are perfectly classified (as males), the under 160 people are not perfectly classified… RECURSION! This time we find that we can split on Hair length, and we are done! ID3 uses information gain to select the best attribute at each step in growing the tree

29 29 Weight <= 160? yesno Hair Length <= 2? yes no We don’t need to keep the data around, just the test conditions. Male Female How would these people be classified?

30 30 It is trivial to convert Decision Trees to rules… Weight <= 160? yesno Hair Length <= 2? yes no Male Female Rules to Classify Males/Females If Weight greater than 160, classify as Male Elseif Hair Length less than or equal to 2, classify as Male Else classify as Female Rules to Classify Males/Females If Weight greater than 160, classify as Male Elseif Hair Length less than or equal to 2, classify as Male Else classify as Female

31 31 Computational Complexity Worst case builds a complete tree where every path test every feature. Assume n examples and m features. At each level, i, in the tree, must examine the remaining m  i features for each instance at the level to calculate info gains. However, learned tree is rarely complete (number of leaves is  n). In practice, complexity is linear in both number of features (m) and number of training examples (n). F1F1 FmFm      Maximum of n examples spread across all nodes at each of the m levels

32 32 ID3 (Interactive Dichotomiser 3) Recursively construct a decision tree. Each time choose a attribute that best classifies the training samples. Until all training samples are correctly classified or use up all attributes

33 ID3 pseudo-code: Input: A data set, S Output: A decision tree If all the instances have the same value for the target attribute then return a decision tree that is simply this value (not really a tree – more of a stump). Else 1.Compute Gain values for all attributes and select an attribute with the highest value and create a node for that attribute. 2.Make a branch from this node for every value of the attribute 3.Assign all possible values of the attribute to branches. 4.Follow each branch by partitioning the dataset to be only instances whereby the value of the branch is present and then go back to 1. 33

34 34 Hypothesis Space Search ID3 is an inductive learning method –Its hypothesis space of decision trees is a complete space of finite discrete- valued functions, relative to the attribute values –Searchers a space of hypotheses for one that fits the training examples –Performs batch learning that processes all training instances at once rather than incremental learning that updates a hypothesis after each example. –Performs hill-climbing (greedy search) that may only find a locally-optimal solution. Guaranteed to find a tree consistent with any conflict-free training set (i.e. identical feature vectors always assigned the same class), but not necessarily the simplest tree. –Finds a single discrete hypothesis, so there is no way to provide confidences or create useful queries.

35 ID3 and C-E Maintains only a single current hypothesis Does not have an ability to determine how many alternative decision trees are consistent with the training data  converges to locally optimal solution (not globally optimal) Uses all training examples at each step Can be easily extended to handle noisy data Searchers complete hypothesis space but it searches incompletely Maintains the set of all hypotheses consistent with the training examples Based on individual examples Searchers through incomplete hypothesis space but searches this incomplete space completely 35

36 36 Bias in Decision Tree Learning Definition: Learning bias == tendency of an algorithm to find one class of solution out of H in preference to another ID3’s Inductive Bias: Shorter trees are preferred over longer trees. Trees that place high information gain attributes close to the root are preferred over those that do not. May miss some other, better trees that are: –Larger –Require a non-greedy split at the root Note: this type of bias is different from the type of bias used by Candidate-Elimination: the inductive bias of ID3 follows from its search strategy (preference or search bias) whereas the inductive bias of the Candidate-Elimination algorithm follows from the definition of its hypothesis space (restriction or language bias).

37 37 Wears green? Yes No The worked examples we have seen were performed on small datasets. However with small datasets there is a great danger of overfitting the data… When you have few data points, there are many possible splitting rules that perfectly classify the data, but will not generalize to future datasets. For example, the rule “Wears green?” perfectly classifies the data, so does “Mothers name is Jacqueline?”, so does “Has blue shoes”… Male Female

38 38 Avoid Overfitting in Classification The generated tree may overfit the training data –Too many branches, some may reflect anomalies due to noise or outliers –Result is in poor accuracy for unseen samples Two approaches to avoid overfitting –Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold –Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the “best pruned tree”

39 DTs in practice... Growing to purity is bad (overfitting) 39

40 A Famous Problem R. A. Fisher’s Iris Dataset. 3 classes 50 of each class The task is to classify Iris plants into one of 3 varieties using the Petal Length and Petal Width. Iris SetosaIris VersicolorIris Virginica Setosa Versicolor Virginica 40

41 Setosa Versicolor Virginica We can generalize the piecewise linear classifier to N classes, by fitting N-1 lines. In this case we first learned the line to (perfectly) discriminate between Setosa and Virginica/Versicolor, then we learned to approximately discriminate between Virginica and Versicolor. If petal width > 3.272 – (0.325 * petal length) then class = Virginica Elseif petal width… 41

42 DTs in practice... Growing to purity is bad (overfitting) x1: petal length x2: sepal width 42

43 DTs in practice... Growing to purity is bad (overfitting) x1: petal length x2: sepal width 43

44 DTs in practice... Growing to purity is bad (overfitting) –Terminate growth early –Grow to purity, then prune back 44

45 DTs in practice... Growing to purity is bad (overfitting) x1: petal length x2: sepal width Not statistically supportable leaf Remove split & merge leaves 45

46 Overfitting Learning a tree that classifies the training data perfectly may not lead to the tree with the best generalization to unseen data. –There may be noise in the training data that the tree is erroneously fitting. –The algorithm may be making poor decisions towards the leaves of the tree that are based on very little data and may not reflect reliable trends. A hypothesis, h, is said to overfit the training data is there exists another hypothesis which, h´, such that h has less error than h´ on the training data but greater error on independent test data. hypothesis complexity/size of the tree (number of nodes) accuracy on training data on test data 46

47 Overfitting Example voltage (V) current (I) Testing Ohms Law: V = IR (I = (1/R)V) Ohm was wrong, we have found a more accurate function! Perfect fit to training data with an 9 th degree polynomial (can fit n points exactly with an n-1 degree polynomial) Experimentally measure 10 points Fit a curve to the Resulting data. 47

48 Overfitting Example voltage (V) current (I) Testing Ohms Law: V = IR (I = (1/R)V) Better generalization with a linear function that fits training data less accurately. 48

49 49 outlook sunnyovercastrain yes humiditywind high normalstrong weak yes no attribute attribute value decision Sunny Hot Normal Strong No The tree requires further reinforcement

50 How to avoid overfitting? 1.Stop growing the tree before it reaches the point where it perfectly classifies the training data –Such estimation is difficult 2.Allow the tree to overfit the data, and then post-prune the tree –Is used Both cases: the question is how to determine the correct size of the final tree 50

51 Overfitting Prevention (Pruning) Methods Two basic approaches for decision trees –Prepruning: Stop growing tree as some point during top-down construction when there is no longer sufficient data to make reliable decisions. –Postpruning: Grow the full tree, then remove subtrees that do not have sufficient evidence. Label leaf resulting from pruning with the majority class of the remaining data, or a class probability distribution. Method for determining which subtrees to prune: –Cross-validation: Reserve some training data as a hold-out set (validation set, tuning set) to evaluate utility of subtrees. –Statistical test: Use a statistical test on the training data to determine if any observed regularity can be dismisses as likely due to random chance. –Minimum description length (MDL): Determine if the additional complexity of the hypothesis is less complex than just explicitly remembering any exceptions resulting from pruning. 51

52 Reduced Error Pruning A post-pruning, cross-validation approach. Partition training data in “grow” and “validation” sets. Build a complete tree from the “grow” data. Until accuracy on validation set decreases do: For each non-leaf node, n, in the tree do: Temporarily prune the subtree below n and replace it with a leaf labeled with the current majority class at that node. Measure and record the accuracy of the pruned tree on the validation set. Permanently prune the node that results in the greatest increase in accuracy on the validation set. 52

53 53 Issues with Reduced Error Pruning The problem with this approach is that it potentially “wastes” training data on the validation set. Severity of this problem depends where we are on the learning curve: test accuracy number of training examples

54 54 Rule Post-Pruning (C4.5) Convert the decision tree into an equivalent set of rules. Prune (generalize) each rule by removing any preconditions so that the estimated accuracy is improved. Sort the prune rules by their estimate accuracy, and apply them in this order when classifying new samples.

55 55 10 12345678 9 1 2 3 4 5 6 7 8 9 12345678 9 1 2 3 4 5 6 7 8 9 Which of the “Simple Problems” can be solved by a Decision Tree? 1)Deep Bushy Tree 2)Useless 3)Deep Bushy Tree The Decision Tree has a hard time with correlated attributes ?

56 Cross-Validating without Losing Training Data If the algorithm is modified to grow trees breadth-first rather than depth-first, we can stop growing after reaching any specified tree complexity. First, run several trials of reduced error-pruning using different random splits of grow and validation sets. Record the complexity of the pruned tree learned in each trial. Let C be the average pruned-tree complexity. Grow a final tree breadth-first from all the training data but stop when the complexity reaches C. Similar cross-validation approach can be used to set arbitrary algorithm parameters in general. 56

57 57 Advantages: –Easy to understand (Doctors love them!) –Easy to generate rules Disadvantages: –May suffer from overfitting. –Classifies by rectangular partitioning (so does not handle correlated features very well). –Can be quite large – pruning is necessary. –Does not handle streaming data easily Advantages/Disadvantages of Decision Trees

58 58 Additional Decision Tree Issues Better splitting criteria –Information gain prefers features with many values. Continuous features Predicting a real-valued function (regression trees) Missing feature values Features with costs Misclassification costs Incremental learning –ID4 –ID5 Mining large databases that do not fit in main memory


Download ppt "1 Machine Learning Decision Trees Some of these slides are courtesy of R.Mooney, UT Austin and E. Keogh, UC Riverside."

Similar presentations


Ads by Google