Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Decision Trees Greg Grudic (Notes borrowed from Thomas G. Dietterich and Tom Mitchell) [Edited by J. Wiebe] Decision Trees.

Similar presentations


Presentation on theme: "1 Decision Trees Greg Grudic (Notes borrowed from Thomas G. Dietterich and Tom Mitchell) [Edited by J. Wiebe] Decision Trees."— Presentation transcript:

1 1 Decision Trees Greg Grudic (Notes borrowed from Thomas G. Dietterich and Tom Mitchell) [Edited by J. Wiebe] Decision Trees

2 2 Outline Decision Tree Representations –ID3 and C4.5 learning algorithms (Quinlan 1986) –CART learning algorithm (Breiman et al. 1985) Entropy, Information Gain Overfitting Decision Trees

3 3 Training Data Example: Goal is to Predict When This Player Will Play Tennis? Decision Trees

4 4

5 5

6 6 \ 5

7 7Decision Trees

8 8 Learning Algorithm for Decision Trees Decision Trees

9 9 Choosing the Best Attribute Decision Trees - Many different frameworks for choosing BEST have been proposed! - We will look at Entropy Gain. Number + and – examples before and after a split. A1 and A2 are “attributes” (i.e. features or inputs).

10 10 Entropy Decision Trees

11 11Decision Trees Entropy is like a measure of purity…

12 12 Entropy Decision Trees

13 Entropy & Bits You are watching a set of independent random sample of X X has 4 possible values: P(X=A)=1/4, P(X=B)=1/4, P(X=C)=1/4, P(X=D)=1/4 You get a string of symbols ACBABBCDADDC… To transmit the data over binary link you can encode each symbol with bits (A=00, B=01, C=10, D=11) You need 2 bits per symbol

14 Fewer Bits – example Now someone tells you the probabilities are not equal P(X=A)=1/2, P(X=B)=1/4, P(X=C)=1/8, P(X=D)=1/8 Now, it is possible to find coding with expectation of 1.75 bits on the average. >>> -1 * logBaseB(2,0.5) 1.0 >>> -1 * logBaseB(2,0.25) 2.0 >>> -1 * logBaseB(2,1/8.0) 3.0 >>> 1 * 0.5 + 2 * 0.25 + 3 * 1/8.0 + 3 * 1/8.0 1.75 >>> Use more bits for less probable ones. We expect those to appear less often.

15 Reality Of course, we can’t use partial bits, so the specific numbers are theoretical numbers only Common encoding method: Huffman coding (from 1951 class project at MIT!) In 1951 David A. Huffman and his classmates in an electrical engineering graduate course on information theory were given the choice of a term paper or a final exam. For the term paper, Huffman’s professor, Robert M. Fano, had assigned what at first appeared to be a simple problem. Students were asked to find the most efficient method of representing numbers, letters or other symbols using a binary code. Besides being a nimble intellectual exercise, finding such a code would enable information to be compressed for transmission over a computer network or for storage in a computer’s memory. Huffman worked on the problem for months, developing a number of approaches, but none that he could prove to be the most efficient. Finally, he despaired of ever reaching a solution and decided to start studying for the final. Just as he was throwing his notes in the garbage, the solution came to him. “It was the most singular moment of my life,” Huffman says. “There was the absolute lightning of sudden realization.” Huffman says he might never have tried his hand at the problem—much less solved it at the age of 25—if he had known that Fano, his professor, and Claude E. Shannon, the creator of information theory, had struggled with it. “It was my luck to be there at the right time and also not have my professor discourage me by telling me that other good people had struggled with this problem,” he says. Optimal encoding exists such that average expected value X is between H(X) and H(X) + 1 Decision Trees15

16 A simple example Suppose we have a message consisting of 5 symbols, e.g. [ ►♣♣♠☻►♣☼►☻ ] How can we code this message using 0/1 so the coded message will have minimum length (for transmission or saving!) 5 symbols  at least 3 bits For a simple encoding, length of code is 10*3=30 bits

17 A simple example – cont. Intuition: Those symbols that are more frequent should have smaller codes, yet since their length is not the same, there must be a way of distinguishing each code For Huffman code, length of encoded message will be ►♣♣♠☻►♣☼►☻ =3*2 +3*2+2*2+3+3=24bits

18 Huffman Coding We won’t cover the algorithm here (perhaps you covered it in a systems course?) This was to give you an idea Information theory comes up in (all?) many areas of CS Decision Trees18

19 19 Information Gain Decision Trees

20 20 Training Example Decision Trees

21 21 Selecting the Next Attribute Decision Trees

22 22Decision Trees

23 23 Non-Boolean Features Features with multiple discrete values –Multi-way splits Real-valued features –Use thresholds Regression –Segaran considers variance from the mean Mean = sum(data)/len(data) Return sum([(d-mean) **2 for d in data])/len(data) –Idea: A high variance means the numbers are widely dispersed; low variance means the numbers are close together. We’ll look at how this is used in his code later. Decision Trees

24 24 Hypothesis Space Search Decision Trees You do not get the globally optimal tree! - Search space is exponential.

25 25 Overfitting Decision Trees

26 26 Overfitting in Decision Trees Decision Trees

27 27 Development Data is Used to Control Overfitting Prune tree to reduce error on validation set Segaran: Start at the leafs Create a combined data set of sibling leafs Suppose there are two, called tb and fb Delta = (entropy(tb+fb) – (entropy(tb) + entropy(fb))/2 If Delta < mingain (a parameter): –Merge the branches Then, return up one level of recursion, and consider further merging branches Note: Segaran just uses the training data Decision Trees

28 Segaran’s Trees Same ideas, but different structure Each node corresponds to: –Attr_i == Value_j? Yes, No. In lecture: examples of his trees and how they are built Decision Trees28


Download ppt "1 Decision Trees Greg Grudic (Notes borrowed from Thomas G. Dietterich and Tom Mitchell) [Edited by J. Wiebe] Decision Trees."

Similar presentations


Ads by Google