Decision Tree Learning R&N: Chap. 18, Sect. 18.1–3.

Decision Tree Learning R&N: Chap. 18, Sect. 18.1–3

Predicate as a Decision Tree The predicate CONCEPT(x)  A(x)  (  B(x) v C(x)) can be represented by the following decision tree: A? B? C? True FalseTrue False Example: A mushroom is poisonous iff it is yellow and small, or yellow, big and spotted x is a mushroom CONCEPT = POISONOUS A = YELLOW B = BIG C = SPOTTED

Predicate as a Decision Tree The predicate CONCEPT(x)  A(x)  (  B(x) v C(x)) can be represented by the following decision tree: A? B? C? True FalseTrue False Example: A mushroom is poisonous iff it is yellow and small, or yellow, big and spotted x is a mushroom CONCEPT = POISONOUS A = YELLOW B = BIG C = SPOTTED D = FUNNEL-CAP E = BULKY

Training Set Ex. #ABCDECONCEPT 1False TrueFalseTrueFalse 2 TrueFalse 3 True False 4 TrueFalse 5 True False 6TrueFalseTrueFalse True 7 False TrueFalseTrue 8 FalseTrueFalseTrue 9 FalseTrue 10True 11True False 12True False TrueFalse 13TrueFalseTrue

Possible Decision Tree D CE B E AA A T F F FF F T T T TT

D CE B E AA A T F F FF F T T T TT CONCEPT  (D  (  E v A)) v (  D  (C  (B v (  B  ((E  A) v (  E  A)))))) A? B? C? True FalseTrue False CONCEPT  A  (  B v C)

Possible Decision Tree D CE B E AA A T F F FF F T T T TT A? B? C? True FalseTrue False CONCEPT  A  (  B v C) KIS bias  Build smallest decision tree Computationally intractable problem  greedy algorithm CONCEPT  (D  (  E v A)) v (  D  (C  (B v (  B  ((E  A) v (  E  A))))))

Getting Started: Top-Down Induction of Decision Tree True: 6, 7, 8, 9, 10,13 False: 1, 2, 3, 4, 5, 11, 12 The distribution of training set is: Ex. #ABCDECONCEPT 1False TrueFalseTrueFalse 2 TrueFalse 3 True False 4 TrueFalse 5 True False 6TrueFalseTrueFalse True 7 False TrueFalseTrue 8 FalseTrueFalseTrue 9 FalseTrue 10True 11True False 12True False TrueFalse 13TrueFalseTrue

Getting Started: Top-Down Induction of Decision Tree True: 6, 7, 8, 9, 10,13 False: 1, 2, 3, 4, 5, 11, 12 The distribution of training set is: Without testing any observable predicate, we could report that CONCEPT is False (majority rule) with an estimated probability of error P(E) = 6/13 Assuming that we will only include one observable predicate in the decision tree, which predicate should we test to minimize the probability of error (i.e., the # of misclassified examples in the training set)?  Greedy algorithm

Assume It’s A A True: False: 6, 7, 8, 9, 10, 13 11, 12 1, 2, 3, 4, 5 T F If we test only A, we will report that CONCEPT is True if A is True (majority rule) and False otherwise  The number of misclassified examples from the training set is 2

Assume It’s B B True: False: 9, 10 2, 3, 11, 12 1, 4, 5 T F If we test only B, we will report that CONCEPT is False if B is True and True otherwise  The number of misclassified examples from the training set is 5 6, 7, 8, 13

Assume It’s C C True: False: 6, 8, 9, 10, 13 1, 3, 4 1, 5, 11, 12 T F If we test only C, we will report that CONCEPT is True if C is True and False otherwise  The number of misclassified examples from the training set is 4 7

Assume It’s D D T F If we test only D, we will report that CONCEPT is True if D is True and False otherwise  The number of misclassified examples from the training set is 5 True: False: 7, 10, 13 3, 5 1, 2, 4, 11, 12 6, 8, 9

Assume It’s E E True: False: 8, 9, 10, 13 1, 3, 5, 12 2, 4, 11 T F If we test only E we will report that CONCEPT is False, independent of the outcome  The number of misclassified examples from the training set is 6 6, 7

Assume It’s E E True: False: 8, 9, 10, 13 1, 3, 5, 12 2, 4, 11 T F If we test only E we will report that CONCEPT is False, independent of the outcome  The number of misclassified examples from the training set is 6 6, 7 So, the best predicate to test is A

Choice of Second Predicate A T F C True: False: 6, 8, 9, 10, 13 11, 12 7 T F False  The number of misclassified examples from the training set is 1

Choice of Third Predicate C T F B True: False: 11,12 7 T F A T F False True

Final Tree A C True B False CONCEPT  A  (C v  B) CONCEPT  A  (  B v C) A? B? C? True False True False

Top-Down Induction of a DT DTL( , Predicates) 1.If all examples in  are positive then return True 2.If all examples in  are negative then return False 3.If Predicates is empty then return failure 4.A  error-minimizing predicate in Predicates 5.Return the tree whose: - root is A, - left branch is DTL(  +A,Predicates-A), - right branch is DTL(  -A,Predicates-A) A C True B False Subset of examples that satisfy A

Top-Down Induction of a DT DTL( , Predicates) 1.If all examples in  are positive then return True 2.If all examples in  are negative then return False 3.If Predicates is empty then return failure 4.A  error-minimizing predicate in Predicates 5.Return the tree whose: - root is A, - left branch is DTL(  +A,Predicates-A), - right branch is DTL(  -A,Predicates-A) A C True B False Noise in training set! May return majority rule, instead of failure

Comments Widely used algorithm Greedy Robust to noise (incorrect examples) Not incremental

Using Information Theory Rather than minimizing the probability of error, many existing learning procedures minimize the expected number of questions needed to decide if an object x satisfies CONCEPT This minimization is based on a measure of the “quantity of information” contained in the truth value of an observable predicate See R&N p. 659-660

Miscellaneous Issues Assessing performance: –Training set and test set –Learning curve size of training set % correct on test set 100 Typical learning curve

Miscellaneous Issues Assessing performance: –Training set and test set –Learning curve size of training set % correct on test set 100 Typical learning curve Some concepts are unrealizable within a machine’s capacity

Miscellaneous Issues Assessing performance: –Training set and test set –Learning curve Overfitting Risk of using irrelevant observable predicates to generate an hypothesis that agrees with all examples in the training set size of training set % correct on test set 100 Typical learning curve

Miscellaneous Issues Assessing performance: –Training set and test set –Learning curve Overfitting –Tree pruning Terminate recursion when # errors / information gain is small Risk of using irrelevant observable predicates to generate an hypothesis that agrees with all examples in the training set

Miscellaneous Issues Assessing performance: –Training set and test set –Learning curve Overfitting –Tree pruning Terminate recursion when # errors / information gain is small Risk of using irrelevant observable predicates to generate an hypothesis that agrees with all examples in the training set The resulting decision tree + majority rule may not classify correctly all examples in the training set

Miscellaneous Issues Assessing performance: –Training set and test set –Learning curve Overfitting –Tree pruning Incorrect examples Missing data Multi-valued and continuous attributes

Continuous Attributes Continuous attributes can be converted into logical ones via thresholds –X => X<a When considering splitting on X, pick the threshold to minimize # of errors 7765654543454567

Learnable Concepts Some simple concepts cannot be represented compactly in DTs –Parity(x) = X 1 xor X 2 xor … xor X n –Majority(x) = 1 if most of X i ’s are 1, 0 otherwise Exponential size in # of attributes Need exponential # of examples to learn exactly The ease of learning is dependent on shrewdly (or luckily) chosen attributes that correlate with CONCEPT

Applications of Decision Tree Medical diagnostic / Drug design Evaluation of geological systems for assessing gas and oil basins Early detection of problems (e.g., jamming) during oil drilling operations Automatic generation of rules in expert systems

Human-Readability DTs also have the advantage of being easily understood by humans Legal requirement in many areas –Loans & mortgages –Health insurance –Welfare

Ensemble Learning (Boosting) R&N 18.4

Idea It may be difficult to search for a single hypothesis that explains the data Construct multiple hypotheses (ensemble), and combine their predictions “Can a set of weak learners construct a single strong learner?” – Michael Kearns, 1988

Motivation 5 classifiers with 60% accuracy On a new example, run them all, and pick the prediction using majority voting If errors are independent, new classifier has 94% accuracy! –(In reality errors will not be independent, but we hope they will be mostly uncorrelated)

Boosting Weighted training set Ex. #WeightABCDECONCEPT 1w1w1 False TrueFalseTrueFalse 2w2w2 TrueFalse 3w3w3 True False 4w4w4 TrueFalse 5w5w5 True False 6w6w6 TrueFalseTrueFalse True 7w7w7 False TrueFalseTrue 8w8w8 FalseTrueFalseTrue 9w9w9 FalseTrue 10w 10 True 11w 11 True False 12w 12 True False TrueFalse 13w 13 TrueFalseTrue

Boosting Start with equal weights w i =1/N Use learner 1 to generate hypothesis h 1 Adjust weights to give higher importance to misclassified examples Use learner 2 to generate hypothesis h 2 … Weight hypotheses according to performance, and return weighted majority

Mushroom Example “Decision stumps” - single attribute DT Ex. #WeightABCDECONCEPT 11/13False TrueFalseTrueFalse 21/13FalseTrueFalse 31/13FalseTrue False 41/13False TrueFalse 51/13False True False 61/13TrueFalseTrueFalse True 71/13TrueFalse TrueFalseTrue 81/13TrueFalseTrueFalseTrue 91/13True FalseTrue 101/13True 111/13True False 121/13True False TrueFalse 131/13TrueFalseTrue

Mushroom Example Pick C first, learns CONCEPT = C Ex. #WeightABCDECONCEPT 11/13False TrueFalseTrueFalse 21/13FalseTrueFalse 31/13FalseTrue False 41/13False TrueFalse 51/13False True False 61/13TrueFalseTrueFalse True 71/13TrueFalse TrueFalseTrue 81/13TrueFalseTrueFalseTrue 91/13True FalseTrue 101/13True 111/13True False 121/13True False TrueFalse 131/13TrueFalseTrue

Mushroom Example Update weights Ex. #WeightABCDECONCEPT 1.125False TrueFalseTrueFalse 2.056FalseTrueFalse 3.125FalseTrue False 4.125False TrueFalse 5.056False True False 6.056TrueFalseTrueFalse True 7.125TrueFalse TrueFalseTrue 8.056TrueFalseTrueFalseTrue 9.056True FalseTrue 10.056True 11.056True False 12.056True False TrueFalse 13.056TrueFalseTrue

Mushroom Example Next try A, learn CONCEPT=A Ex. #WeightABCDECONCEPT 1.125False TrueFalseTrueFalse 2.056FalseTrueFalse 3.125FalseTrue False 4.125False TrueFalse 5.056False True False 6.056TrueFalseTrueFalse True 7.125TrueFalse TrueFalseTrue 8.056TrueFalseTrueFalseTrue 9.056True FalseTrue 10.056True 11.056True False 12.056True False TrueFalse 13.056TrueFalseTrue

Mushroom Example Update weights Ex. #WeightABCDECONCEPT 10.07False TrueFalseTrueFalse 20.03FalseTrueFalse 30.07FalseTrue False 40.07False TrueFalse 50.03False True False 60.03TrueFalseTrueFalse True 70.07TrueFalse TrueFalseTrue 80.03TrueFalseTrueFalseTrue 90.03True FalseTrue 100.03True 110.25True False 120.25True False TrueFalse 130.03TrueFalseTrue

Mushroom Example Next try E, learn CONCEPT=E Ex. #WeightABCDECONCEPT 10.07False TrueFalseTrueFalse 20.03FalseTrueFalse 30.07FalseTrue False 40.07False TrueFalse 50.03False True False 60.03TrueFalseTrueFalse True 70.07TrueFalse TrueFalseTrue 80.03TrueFalseTrueFalseTrue 90.03True FalseTrue 100.03True 110.25True False 120.25True False TrueFalse 130.03TrueFalseTrue

Mushroom Example Next try E, learn CONCEPT=  E Ex. #WeightABCDECONCEPT 10.07False TrueFalseTrueFalse 20.03FalseTrueFalse 30.07FalseTrue False 40.07False TrueFalse 50.03False True False 60.03TrueFalseTrueFalse True 70.07TrueFalse TrueFalseTrue 80.03TrueFalseTrueFalseTrue 90.03True FalseTrue 100.03True 110.25True False 120.25True False TrueFalse 130.03TrueFalseTrue

Mushroom Example Update Weights… Ex. #WeightABCDECONCEPT 10.07False TrueFalseTrueFalse 20.03FalseTrueFalse 30.07FalseTrue False 40.07False TrueFalse 50.03False True False 60.03TrueFalseTrueFalse True 70.07TrueFalse TrueFalseTrue 80.03TrueFalseTrueFalseTrue 90.03True FalseTrue 100.03True 110.25True False 120.25True False TrueFalse 130.03TrueFalseTrue

Mushroom Example Final classifier, order C,A,E,D,B –Weights on hypotheses determined by overall error –Weighted majority weights A=2.1,  B=0.9, C=0.8, D=1.4,  E=0.09 100% accuracy on training set

Boosting Strategies Prior weighting strategy was the popular AdaBoost algorithm see R&N pp. 667 Many other strategies Typically as the number of hypotheses increases, accuracy increases as well –Does this conflict with Occam’s razor?

Next Time Neural networks HW6 due

Decision Tree Learning R&N: Chap. 18, Sect. 18.1–3.

Similar presentations

Presentation on theme: "Decision Tree Learning R&N: Chap. 18, Sect. 18.1–3."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Decision Tree Learning R&N: Chap. 18, Sect. 18.1–3.

Similar presentations

Presentation on theme: "Decision Tree Learning R&N: Chap. 18, Sect. 18.1–3."— Presentation transcript:

Similar presentations

About project

Feedback