Presentation is loading. Please wait.

Presentation is loading. Please wait.

ETHEM ALPAYDIN © The MIT Press, 2010 Lecture Slides for.

Similar presentations


Presentation on theme: "ETHEM ALPAYDIN © The MIT Press, 2010 Lecture Slides for."— Presentation transcript:

1 ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml2e Lecture Slides for

2

3 Tree Uses Nodes, and Leaves 3

4 Decision Tree Representation 4 Internal decision nodes test on the values of different attributes. Univariate test: Uses a single attribute of training example x; x i Numeric x i : Binary split : x i > w m Discrete x i : n-way split for n possible values Multivariate test: Uses all attributes of sample x = (x 1, x 2, …, x n ) Branches correspond to an outcomes of a test Leaves correspond to a prediction For classification: Class labels, or proportion (probabilities) For regression: Numeric; r average, or local fit Learning: build a Decision Tree consistent with the training set Greedy learning: find the best split recursively

5 Ex: Decision Tree for PlayTennis Data What is the prediction for training example (Sunny, Mild, High, Strong) ? Each training example fall precisely into exactly one leaf

6 Ex: Decision Tree for Wisconsin Cancer Data 1. For each sample, test the corresponding attribute at each node 2. Follow the appropriate branch according to the outcome of the test 3. At a leaf, we return the majority class, or the class probabilities, as the prediction for the sample

7 DTs as Logical Representations 7 Decision Tree = Set of If-Then-Rules: rule = path from root to leaf

8 Representation Power and Efficiency of DTs 8 Any Boolean function can be represented by a DT Any function f: A 1 × A 2 × … × A n → C can be represented by a DT DT Construction Algorithms 1. Number of tests between O(n) and O(2 n ); n = number of attributes 1. We want an efficient DT learning algorithm generating the simplest consistent tree 2. Produce comprehensible results 3. Often the first method to try on new data set.

9 Learning DTs and Complexity 9 Many DTs are consistent with the same training set D Simplest DTs are preferred over complex DTs 1. Simplest = smallest: least number of nodes 2. Simplest = takes fewest bit to encode, least space, least memory 3. Finding the simplest DT consistent with set D is NP-Hard 1. Hypothesis space H = space of all possible trees; exponential in size We could enumerate all possible tree, keep the consistent ones, and then return the smallest ones. Combinatorial problem and overfitting issue 2. Solutions 1. Heuristic search methods, or 2. Limit H to a subset of simple trees: limit the size, or the #of tests Learning is greedy; find the best split recursively (Breiman et al, 1984; Quinlan, 1986, 1993)

10 Learning a Decision Tree (Univariate) 10 Given a subset of N of training examples: Find the best test attribute A to split N on Split N according to each value of the test. Recurse on each subset of N Stop when no more splits is possible, or the tree generalizes well Will give the full algorithm next…

11 Top-Down Induction of Decision Trees 11 Given training set T with classes C 1, …, C k, then: For Classification: 1. If all elements of T are of same class C i : the DT for T is a leaf with class label i 2. Else: Select the best test attribute A T to split T on 3. Partition T into m subsets T 1, …, T m according to the values of the m outcomes of the test; the DT for T is the test node A T with its m childrens 4. Recurse on each T 1, …, T m For Regression: Same as classification, except 1. The label of a leaf is a mean value or a value obtained by linear regression. 2. The definition of best test attribute is different.

12 Which is the Best Test ? 12 The test should provide information about the class label Suppose we have 40 examples (30+ and 10-), and we consider two test attributes that give the following splits below: Intuitively, we want a test attribute that separates the training set as well as possible If each leaf was pure, the test attribute would provide maximal information about the label at the leaf What is purity and how to we measure purity ?

13 Information and Uncertainty 13 Examples: 1. You are about to observe the outcome of a fair dice roll 2. You are about to observe the outcome of a biased coin flip 3. I am about to tell you my name You know a priori: I have a fair dice, a biased coin, your name, … I am sending you a message; that is, the outcome of … If the message you receive is totally expected with 100% certainty, then the information content of the message is 0, nothing surprising and very boring. In each situation, you have a different amount of uncertainty as to what outcome/message you will observe/receive Information = reduction in uncertainty, is relative to what you know a priori, is related to surprise, and depends on context

14 Information = Reduction of Uncertainty 14

15 Entropy 15 Suppose we have a transmitter S that conveys the results of a random experiment with m possible discrete outcomes s 1, …, s m with probability distribution P = p 1, …, p m Entropy is the average amount of information when observing S, that is the expected information content of the probability distribution of S Provided p i ≠ 0. For p i = 0, we have I(s i ) = 0 by convention. Note: the entropy depends only on the probability distribution of S, not on the outcomes, thus we can write I(p i ) instead of I(s i ). So we can really writes H(P) instead of H(S) Entropy = uncertainty the observer has apriori, average number of bits to communicate/encode a message, …tc.

16 Entropy and Coding Theory 16 H(P) = - Σ p i log 2 p i is the average number of bits needed to communicate a symbol Communicating a message from an alphabet of 4 symbols with equal probabilities requires 2 bits encoding for each symbol Conveying an outcome that is certain takes 0 bits. Suppose now p 0 = 0.5, p 1 = 0.25, p2 = p 3 = 0.125. What is the best encoding for each symbol ? z 0 = 0, z 1 = 10, z 2 = 110, z 3 = 111 … bits What is the length of the message over time ? Shannon Entropy: There are codes that will communicate the symbols with efficiency arbitrarily close to H(P) bits/symbols. There are no codes that will do it with efficiency greater than H(P) bits/symbols.

17 Shannon Entropy as Measure of Information 17 For any distribution P, H(P) is the optimal number of binary questions required on average to determine an outcome drawn from P.

18 Properties of Entropy 18 The further P is from uniform, the lower the entropy So we can use entropy as a measure of impurity of a set D We want to select a test attribute that reduces the entropy of D to the maximum extent

19 Entropy applied to Binary Classification 19 Consider a training set D with two classes C + and C -, and let p + be the proportion of positive examples in D p - be the proportion of negative examples in D Entropy measures the impurity of D based on the empirical probabilities of the two classes, and is the expected number of bits needed to encode the classes + and - H(D) = -(p + log 2 p + + p - log 2 p - ) So we can use entropy H(D) to measure purity of D ! D has maximum purity when p + or p - = 0 or 1. D has maximal impurity if it has uniform probability distribution: that is when p + = p - = 0.5

20 For node m, N m instances reach m, N i m belong to C i Node m is pure if p i m is 0 or 1 Measure of impurity is entropy Classification Trees (ID3, CART, C4.5) 20Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

21 If node m is pure, generate a leaf and stop, otherwise split and continue recursively Impurity after split: N mj of N m take branch j. N i mj belong to C i Find the variable and split that minimize impurity (among all variables -- and split positions for numeric variables) Best Split 21Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

22 Conditional Entropy and Information Gain 22 Conditional Entropy of a training set D given an attribute A is: H(D | A) = Σ v ϵ Values(A) P(A = v)H(D | A = v) H(D | A) = Σ v ϵ Values(A) P v H(D v ) Where Values(A) = set of all possible values of attribute A P v = P(A = v) = proportions of examples for which A = v H(D v ) = H(D | a = v) = entropy of the subset for which A = v Information Gain: expected reduction in entropy of D given A IG(D, A) = H(D) – H(D | A) This is what we need in constructing a DT Select a test attribute A that gives highest information gain

23 Which Attribute is the Best Classifier ? 23

24 Selecting the Best Test Attribute 24 Choose Outlook as the top test attribute, because: IG(D, Outlook) = 0.246 IG(D, Humidity) = 0.151 IG(D, Wind) = 0.048 IG(D, Temperature) = 0.029 Then repeat for each children of outlook

25 Which Attribute is the Best Classifier ? 25

26 Decision Tree Learning Algorithm: ID3 26 Given training set T with classes C 1, …, C k, there are 3 cases: 1. If all elements of T are of same class C i : the DT for T is a leaf with label i 2. If T is empty: the DT of T is a leaf with label equal to the majority class in the parent of this node 3. If T contains elements of different classes: A. Select the test attribute A T which has the highest Information Gain B. Partition T into m subsets T 1, …, T m, one for each outcome of the test with A T C. The DT of T is a decision node with label A T and one branch for each outcome of the test D. Repeat from Step 1 for each subset T j For Regression: Same as above but the label of a leaf is a mean value or a value obtained by regression. Impurity measure is also different.

27 27Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

28 Hypothesis Space of ID3 28 Space H = Set of all possible DTs Search heuristic based on Information Gain Simple-to-Complex hill-climbing search through H

29 Hypothesis Space Search by ID3 29 Space H is complete: Target function surely is in H Output a single consistent hypothesis, Can’t query the teacher Can’t consider alternative DT’s Do not backtrack May get stuck in local optimum Statistically-based search choices Use all examples in each step, robust to noisy data Less sensitive to errors in individual examples Inductive Bias: preference for shorter DTs and for those with high IG on the root

30 Attributes with Many Values 30 Problem: Information Gain tends to select attributes with many values first Example: attribute Date has too many possible values Solution: use Information Gain Ratio instead of Information Gain Information Gain Ratio: proportion of information generated by the split that is useful for classification Split Information: measures potential information generated by partitioning S into c classes, i.e. entropy of S with respect to values of A Where, S 1, …, S c are the subsets resulting from partitioning S by the c-valued attribute A

31 Attributes with Real Values 31 Create a discrete attribute to test continuous attribute Simply take the midpoints between two consecutive values Example: Temperature = 85, or 54, … Which is the best split threshold? Temp > 54 or Temp > 85 ? 1. Sort examples by value of real attribute to test, and obtain all candidate thresholds. E.g., Temp >54 and Temp >85 are the candidate discrete attributes 2. Compute the Information Gain for each discrete attribute 3. Choose the best discrete attribute and … Replace the continuous attribute and … use it as any other attribute for learning the decision tree

32 Attributes with Costs 32

33 Regression Trees 33 Error at node m: After splitting: Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

34 34 Model Selection in Trees Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

35 Pruning Trees Remove subtrees for better generalization (decrease variance) Prepruning: Early stopping Postpruning: Grow the whole tree then prune subtrees which overfit on the pruning set Prepruning is faster, postpruning is more accurate (requires a separate pruning set) 35Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

36 Rule Extraction from Trees 36 C4.5Rules (Quinlan, 1993 ) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

37 Learning Rules Rule induction is similar to tree induction but tree induction is breadth-first, rule induction is depth-first; one rule at a time Rule set contains rules; rules are conjunctions of terms Rule covers an example if all terms of the rule evaluate to true for the example Sequential covering: Generate rules one at a time until all positive examples are covered IREP (Fürnkrantz and Widmer, 1994), Ripper (Cohen, 1995) 37Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

38 38Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

39 39Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

40 Multivariate Trees 40Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)


Download ppt "ETHEM ALPAYDIN © The MIT Press, 2010 Lecture Slides for."

Similar presentations


Ads by Google