Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning III Decision Tree Induction CSE 473.

Similar presentations


Presentation on theme: "Machine Learning III Decision Tree Induction CSE 473."— Presentation transcript:

1 Machine Learning III Decision Tree Induction CSE 473

2 © Daniel S. Weld 2 Machine Learning Outline Machine learning: √ Function approximation √ Bias Supervised learning √ Classifiers & concept learning √ Version Spaces (restriction bias) Decision-trees induction (pref bias) Overfitting Ensembles of classifiers

3 © Daniel S. Weld 3 Two Strategies for ML Restriction bias: use prior knowledge to specify a restricted hypothesis space. Version space algorithm over conjunctions. Preference bias: use a broad hypothesis space, but impose an ordering on the hypotheses. Decision trees.

4 © Daniel S. Weld 4 Decision Trees Convenient Representation Developed with learning in mind Deterministic Expressive Equivalent to propositional DNF Handles discrete and continuous parameters Simple learning algorithm Handles noise well Classify as follows Constructive (build DT by adding nodes) Eager Batch (but incremental versions exist)

5 © Daniel S. Weld 5 Concept Learning E.g. Learn concept “Edible mushroom” Target Function has two values: T or F Represent concepts as decision trees Use hill climbing search Thru space of decision trees Start with simple concept Refine it into a complex concept as needed

6 © Daniel S. Weld 6 Experience: “Good day for tennis” Day OutlookTempHumidWindPlayTennis? d1shhwn d2shhsn d3ohhwy d4rmhw y d5rcnwy d6rcnsy d7ocnsy d8smhwn d9scnwy d10rmnwy d11smnsy d12omhsy d13ohnwy d14rmhsn

7 © Daniel S. Weld 7 Decision Tree Representation Outlook Humidity Wind Yes No Sunny Overcast Rain High Strong Normal Weak Good day for tennis? Leaves = classification Arcs = choice of value for parent attribute Decision tree is equivalent to logic in disjunctive normal form G-Day  (Sunny  Normal)  Overcast  (Rain  Weak)

8 © Daniel S. Weld 8

9 9

10 10

11 © Daniel S. Weld 11 DT Learning as Search Nodes Operators Initial node Heuristic? Goal? Decision Trees Tree Refinement: Sprouting the tree Smallest tree possible: a single leaf Information Gain Best tree possible (???)

12 © Daniel S. Weld 12 What is the Simplest Tree? Day OutlookTempHumidWindPlay? d1shhwn d2shhsn d3ohhwy d4rmhw y d5rcnwy d6rcnsy d7ocnsy d8smhwn d9scnwy d10rmnwy d11smnsy d12omhsy d13ohnwy d14rmhsn How good? [10+, 4-] Means: correct on 10 examples incorrect on 4 examples

13 © Daniel S. Weld 13 Successors Yes Outlook Temp Humid Wind Which attribute should we use to split?

14 © Daniel S. Weld 14 To be decided: How to choose best attribute? Information gain Entropy (disorder) When to stop growing tree?

15 © Daniel S. Weld 15 Disorder is bad Homogeneity is good No Better Good Bad

16 © Daniel S. Weld 16 Entropy.00.50 1.00 1.0 0.5 P of as %

17 © Daniel S. Weld 17 Entropy (disorder) is bad Homogeneity is good Let S be a set of examples Entropy(S) = -P log 2 (P) - N log 2 (N) where P is proportion of pos example and N is proportion of neg examples and 0 log 0 == 0 Example: S has 10 pos and 4 neg Entropy([10+, 4-]) = -(10/14) log 2 (10/14) - (4/14)log 2 (4/14) = 0.863

18 © Daniel S. Weld 18 Information Gain Measure of expected reduction in entropy Resulting from splitting along an attribute Gain(S,A) = Entropy(S) - (|S v | / |S|) Entropy(S v ) Where Entropy(S) = -P log 2 (P) - N log 2 (N)  v  Values(A)

19 © Daniel S. Weld 19 Gain of Splitting on Wind Day WindTennis? d1weakn d2sn d3weakyes d4weak yes d5weakyes d6syes d7syes d8weakn d9weakyes d10weakyes d11syes d12syes d13weakyes d14sn Values(wind)=weak, strong S = [10+, 4-] Gain(S, wind) = Entropy(S) - (|S v | / |S|) Entropy(S v ) = Entropy(S) - 8/14 Entropy(S weak ) - 6/14 Entropy(S s ) = 0.863 - (8/14) 0.811 - (6/14) 1.00 = -0.029  v  {weak, s} S weak = [6+, 2-] S s = [3+, 3-]

20 © Daniel S. Weld 20 Gain of Split on Humidity Gain(S,A) = Entropy(S) - (|S v | / |S|) Entropy(S v ) Where Entropy(S) = -P log 2 (P) - N log 2 (N)  v  Values(A) Day OutlookTempHumidWindPlay? d1shhwn d2shhsn d3ohhwy d4rmhw y d5rcnwy d6rcnsy d7ocnsy d8smhwn d9scnwy d10rmnwy d11smnsy d12omhsy d13ohnwy d14rmhsn

21 © Daniel S. Weld 21 Is… Entropy([4+,3-]) =.985 Entropy([7,0]) = Gain = 0.863-.985/2 = 0.375

22 © Daniel S. Weld 22 Evaluating Attributes Yes Outlook Temp Humid Wind Gain(S,Humid) =0.375 Gain(S,Outlook) =0.401 Gain(S,Temp) =0.029 Gain(S,Wind) =-0.029 Value is actually different than this, but ignore this detail

23 © Daniel S. Weld 23 Resulting Tree …. Outlook Sunny Overcast Rain Good day for tennis? No [2+, 3-] Yes [4+] No [2+, 3-]

24 © Daniel S. Weld 24 Recurse! Day Temp Humid WindTennis? d1hhweak n d2hhs n d8mhweak n d9cnweak yes d11mns yes Outlook Sunny

25 © Daniel S. Weld 25 One Step Later… Outlook Humidity Sunny Overcast Rain High Normal Yes [2+] Yes [4+] No [2+, 3-] No [3-]

26 © Daniel S. Weld 26 Decision Tree Algorithm BuildTree(TraingData) Split(TrainingData) Split(D) If (all points in D are of the same class) Then Return For each attribute A Evaluate splits on attribute A Use best split to partition D into D1, D2 Split (D1) Split (D2)

27 © Daniel S. Weld 27 Movie Recommendation Features? Rambo Matrix Rambo 2

28 © Daniel S. Weld 28 Issues Content vs. Social Non-Boolean Attributes Missing Data Scaling up

29 © Daniel S. Weld 29 Missing Data 1 Assign attribute … most common value, … given classification Day Temp Humid WindTennis? d1hhweak n d2hhs n d8mhweak n d9c weak yes d11mns yes Don’t use this instance for learning? most common value at node, or

30 © Daniel S. Weld 30 Fractional Values 75% h and 25% n Use in gain calculations Further subdivide if other missing attributes Same approach to classify test ex with missing attr Classification is most probable classification Summing over leaves where it got divided Day Temp Humid WindTennis? d1hhweak n d2hhs n d8mhweak n d9c weak yes d11mns yes [0.75+, 3-] [1.25+, 0-]

31 © Daniel S. Weld 31 Non-Boolean Features Features with multiple discrete values Real-valued Features Construct a multi-way split Test for one value vs. all of the others? Group values into two disjoint subsets? Discretize? Consider a threshold split using observed values?

32 © Daniel S. Weld 32 Attributes with many values So many values that it Divides examples into tiny sets Each set likely uniform  high info gain But poor predictor… Need to penalize these attributes

33 © Daniel S. Weld 33 One approach: Gain ratio SplitInfo  entropy of S wrt values of A (Contrast with entropy of S wrt target value)  attribs with many uniformly distrib values e.g. if A splits S uniformly into n sets SplitInformation = log 2 (n)… = 1 for Boolean

34 © Daniel S. Weld 34 Machine Learning Outline Machine learning: Supervised learning Overfitting What is the problem? Reduced error pruning Ensembles of classifiers

35 © Daniel S. Weld 35 Overfitting Number of Nodes in Decision tree Accuracy 0.9 0.8 0.7 0.6 On training data On test data

36 © Daniel S. Weld 36 Overfitting 2 Figure from w.w.cohen

37 © Daniel S. Weld 37 Overfitting… DT is overfit when exists another DT’ and DT has smaller error on training examples, but DT has bigger error on test examples Causes of overfitting Noisy data, or Training set is too small

38 © Daniel S. Weld 38

39 © Daniel S. Weld 39

40 © Daniel S. Weld 40 Effect of Reduced-Error Pruning Cut tree back to here

41 © Daniel S. Weld 41 Machine Learning Outline Machine learning: Supervised learning Overfitting Ensembles of classifiers Bagging Cross-validated committees Boosting


Download ppt "Machine Learning III Decision Tree Induction CSE 473."

Similar presentations


Ads by Google