CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting.

Slides:



Advertisements
Similar presentations
On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.
Advertisements

Is Random Model Better? -On its accuracy and efficiency-
Artificial Intelligence 11. Decision Tree Learning Course V231 Department of Computing Imperial College, London © Simon Colton.
Chapter 7 Sampling and Sampling Distributions
Learning from Observations
Chapter 16 Goodness-of-Fit Tests and Contingency Tables
Learning from Observations Chapter 18 Section 1 – 3.
ICS 178 Intro Machine Learning
DECISION TREES. Decision trees  One possible representation for hypotheses.
Decision Trees Decision tree representation ID3 learning algorithm
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Classification Algorithms
Classification Techniques: Decision Tree Learning
CMPUT 466/551 Principal Source: CMU
Sparse vs. Ensemble Approaches to Supervised Learning
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2005.
Learning From Observations
1 Chapter 18 Learning from Observations Decision tree examples Additional source used in preparing the slides: Jean-Claude Latombe’s CS121 slides: robotics.stanford.edu/~latombe/cs121.
Ensemble Learning: An Introduction
ICS 273A Intro Machine Learning
Three kinds of learning
Inductive Learning (1/2) Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3.
Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.
Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.
ICS 273A Intro Machine Learning
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Sparse vs. Ensemble Approaches to Supervised Learning
Experimental Evaluation
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
Ensemble Learning (2), Tree and Forest
Ensembles of Classifiers Evgueni Smirnov
Machine Learning CS 165B Spring 2012
Inductive Learning (1/2) Decision Tree Method
CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco.
For Friday No reading No homework. Program 4 Exam 2 A week from Friday Covers 10, 11, 13, 14, 18, Take home due at the exam.
I NTRODUCTION TO M ACHINE L EARNING. L EARNING Agent has made observations (data) Now must make sense of it (hypotheses) Hypotheses alone may be important.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Benk Erika Kelemen Zsolt
Learning from Observations Chapter 18 Through
CHAPTER 18 SECTION 1 – 3 Learning from Observations.
I NTRODUCTION TO M ACHINE L EARNING. L EARNING Agent has made observations (data) Now must make sense of it (hypotheses) Hypotheses alone may be important.
Decision Tree Learning R&N: Chap. 18, Sect. 18.1–3.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
CS B351: D ECISION T REES. A GENDA Decision trees Learning curves Combatting overfitting.
CS690L Data Mining: Classification
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Ensemble Methods in Machine Learning
Decision Tree Learning
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
Chapter 18 Section 1 – 3 Learning from Observations.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 4-Inducción de árboles de decisión (1/2) Eduardo Poggi.
Learning From Observations Inductive Learning Decision Trees Ensembles.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Machine Learning: Ensemble Methods
Learning from Observations
Learning from Observations
DECISION TREES An internal node represents a test on an attribute.
Introduce to machine learning
Presented By S.Yamuna AP/CSE
Learning from Observations
Learning from Observations
Decision trees One possible representation for hypotheses
A task of induction to find patterns
A task of induction to find patterns
Presentation transcript:

CS B551: D ECISION T REES

A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting

R ECAP Still in supervised setting with logical attributes  Find a representation of CONCEPT in the form: CONCEPT(x)  S(A,B, …) where S(A,B,…) is a sentence built with the observable attributes, e.g.: CONCEPT(x)  A(x)  (  B(x) v C(x))

P REDICATE AS A D ECISION T REE The predicate CONCEPT(x)  A(x)  (  B(x) v C(x)) can be represented by the following decision tree: A? B? C? True FalseTrue False Example: A mushroom is poisonous iff it is yellow and small, or yellow, big and spotted x is a mushroom CONCEPT = POISONOUS A = YELLOW B = BIG C = SPOTTED

P REDICATE AS A D ECISION T REE The predicate CONCEPT(x)  A(x)  (  B(x) v C(x)) can be represented by the following decision tree: A? B? C? True FalseTrue False Example: A mushroom is poisonous iff it is yellow and small, or yellow, big and spotted x is a mushroom CONCEPT = POISONOUS A = YELLOW B = BIG C = SPOTTED D = FUNNEL-CAP E = BULKY

T RAINING S ET Ex. #ABCDECONCEPT 1False TrueFalseTrueFalse 2 TrueFalse 3 True False 4 TrueFalse 5 True False 6TrueFalseTrueFalse True 7 False TrueFalseTrue 8 FalseTrueFalseTrue 9 FalseTrue 10True 11True False 12True False TrueFalse 13TrueFalseTrue

P OSSIBLE D ECISION T REE D CE B E AA A T F F FF F T T T TT

D CE B E AA A T F F FF F T T T TT CONCEPT  (D  (  E v A)) v (  D  (C  (B v (  B  ((E  A) v (  E  A)))))) A? B? C? True FalseTrue False CONCEPT  A  (  B v C)

P OSSIBLE D ECISION T REE D CE B E AA A T F F FF F T T T TT A? B? C? True FalseTrue False CONCEPT  A  (  B v C) KIS bias  Build smallest decision tree Computationally intractable problem  greedy algorithm CONCEPT  (D  (  E v A)) v (  D  (C  (B v (  B  ((E  A) v (  E  A))))))

T OP -D OWN I NDUCTION OF A DT DTL( , Predicates) 1. If all examples in  are positive then return True 2. If all examples in  are negative then return False 3. If Predicates is empty then return majority rule 4. A  error-minimizing predicate in Predicates 5. Return the tree whose: - root is A, - left branch is DTL(  +A,Predicates-A), - right branch is DTL(  -A,Predicates-A) A C True B False

L EARNABLE C ONCEPTS Some simple concepts cannot be represented compactly in DTs Parity(x) = X 1 xor X 2 xor … xor X n Majority(x) = 1 if most of X i ’s are 1, 0 otherwise Exponential size in # of attributes Need exponential # of examples to learn exactly The ease of learning is dependent on shrewdly (or luckily) chosen attributes that correlate with CONCEPT

P ERFORMANCE I SSUES Assessing performance: Training set and test set Learning curve size of training set % correct on test set 100 Typical learning curve

P ERFORMANCE I SSUES Assessing performance: Training set and test set Learning curve size of training set % correct on test set 100 Typical learning curve Some concepts are unrealizable within a machine’s capacity

P ERFORMANCE I SSUES Assessing performance: Training set and test set Learning curve Overfitting Risk of using irrelevant observable predicates to generate an hypothesis that agrees with all examples in the training set size of training set % correct on test set 100 Typical learning curve

P ERFORMANCE I SSUES Assessing performance: Training set and test set Learning curve Overfitting Tree pruning Risk of using irrelevant observable predicates to generate an hypothesis that agrees with all examples in the training set Terminate recursion when # errors / information gain is small

P ERFORMANCE I SSUES Assessing performance: Training set and test set Learning curve Overfitting Tree pruning Terminate recursion when # errors / information gain is small Risk of using irrelevant observable predicates to generate an hypothesis that agrees with all examples in the training set The resulting decision tree + majority rule may not classify correctly all examples in the training set

P ERFORMANCE I SSUES Assessing performance: Training set and test set Learning curve Overfitting Tree pruning Incorrect examples Missing data Multi-valued and continuous attributes

U SING I NFORMATION T HEORY Rather than minimizing the probability of error, minimize the expected number of questions needed to decide if an object x satisfies CONCEPT Use the information-theoretic quantity known as information gain Split on variable with highest information gain

E NTROPY / I NFORMATION GAIN Entropy: encodes the quantity of uncertainty in a random variable H(X) = -  x  Val(X) P(x) log P(x) Properties H(X) = 0 if X is known, i.e. P(x)=1 for some value x H(X) > 0 if X is not known with certainty H(X) is maximal if P(X) is uniform distribution Information gain: measures the reduction in uncertainty in X given knowledge of Y I(X,Y) = E y [H(X) – H(X|Y)] =  y P(y)  x [P(x|y) log P(x|y) – P(x)log P(x)] Properties Always nonnegative = 0 if X and Y are independent If Y is a choice, maximizing IG = > minimizing E y [H(X|Y)]

M AXIMIZING IG / MINIMIZING CONDITIONAL ENTROPY IN DECISION TREES E y [H(X|Y)] =  y P(y)  x P(x|y) log P(x|y) Let n be # of examples Let n +,n - be # of examples on T/F branches of Y Let p +,p - be accuracy on true/false branches of Y P(Y) = (p + n + +p - n - )/n P(correct|Y) = p +, P(correct|-Y) = p - E y [H(X|Y)]  n + [p + log p + + (1-p + )log (1-p + )] + n - [p - log p - + (1-p - ) log (1-p - )]

C ONTINUOUS A TTRIBUTES Continuous attributes can be converted into logical ones via thresholds X => X<a When considering splitting on X, pick the threshold a to minimize # of errors / entropy

M ULTI -V ALUED A TTRIBUTES Simple change: consider splits on all values A can take on Caveat: the more values A can take on, the more important it may appear to be, even if it is irrelevant More values => dataset split into smaller example sets when picking attributes Smaller example sets => more likely to fit well to spurious noise

S TATISTICAL M ETHODS FOR A DDRESSING O VERFITTING / N OISE There may be few training examples that match the path leading to a deep node in the decision tree More susceptible to choosing irrelevant/incorrect attributes when sample is small Idea: Make a statistical estimate of predictive power (which increases with larger samples) Prune branches with low predictive power Chi-squared pruning

T OP - DOWN DT PRUNING Consider an inner node X that by itself (majority rule) predicts p examples correctly and n examples incorrectly At k leaf nodes, number of correct/incorrect examples are p 1 /n 1,…,p k /n k Chi-squared statistical significance test: Null hypothesis: example labels randomly chosen with distribution p/(p+n) (X is irrelevant) Alternate hypothesis: examples not randomly chosen (X is relevant) Prune X if testing X is not statistically significant

C HI -S QUARED TEST Let Z =  i (p i – p i ’) 2 /p i ’ + (n i – n i ’) 2 /n i ’ Where p i ’ = p i (p i +n i )/(p+n), n i ’ = n i (p i +n i )/(p+n) are the expected number of true/false examples at leaf node i if the null hypothesis holds Z is a statistic that is approximately drawn from the chi-squared distribution with k degrees of freedom Look up p-Value of Z from a table, prune if p- Value >  for some  (usually ~.05)

E NSEMBLE L EARNING (B OOSTING )

I DEA It may be difficult to search for a single hypothesis that explains the data Construct multiple hypotheses (ensemble), and combine their predictions “Can a set of weak learners construct a single strong learner?” – Michael Kearns, 1988

M OTIVATION 5 classifiers with 60% accuracy On a new example, run them all, and pick the prediction using majority voting If errors are independent, new classifier has 94% accuracy! (In reality errors will not be independent, but we hope they will be mostly uncorrelated)

B OOSTING Main idea: If learner 1 fails to learn an example correctly, this example is more important for learner 2 If learner 1 and 2 fail to learn an example correctly, this example is more important for learner 3 … Weighted training set Weights encode importance

B OOSTING Weighted training set Ex. #WeightABCDECONCEPT 1w1w1 False TrueFalseTrueFalse 2w2w2 TrueFalse 3w3w3 True False 4w4w4 TrueFalse 5w5w5 True False 6w6w6 TrueFalseTrueFalse True 7w7w7 False TrueFalseTrue 8w8w8 FalseTrueFalseTrue 9w9w9 FalseTrue 10w 10 True 11w 11 True False 12w 12 True False TrueFalse 13w 13 TrueFalseTrue

B OOSTING Start with uniform weights w i =1/N Use learner 1 to generate hypothesis h 1 Adjust weights to give higher importance to misclassified examples Use learner 2 to generate hypothesis h 2 … Weight hypotheses according to performance, and return weighted majority

M USHROOM E XAMPLE “Decision stumps” - single attribute DT Ex. #WeightABCDECONCEPT 11/13False TrueFalseTrueFalse 21/13FalseTrueFalse 31/13FalseTrue False 41/13False TrueFalse 51/13False True False 61/13TrueFalseTrueFalse True 71/13TrueFalse TrueFalseTrue 81/13TrueFalseTrueFalseTrue 91/13True FalseTrue 101/13True 111/13True False 121/13True False TrueFalse 131/13TrueFalseTrue

M USHROOM E XAMPLE Pick C first, learns CONCEPT = C Ex. #WeightABCDECONCEPT 11/13False TrueFalseTrueFalse 21/13FalseTrueFalse 31/13FalseTrue False 41/13False TrueFalse 51/13False True False 61/13TrueFalseTrueFalse True 71/13TrueFalse TrueFalseTrue 81/13TrueFalseTrueFalseTrue 91/13True FalseTrue 101/13True 111/13True False 121/13True False TrueFalse 131/13TrueFalseTrue

M USHROOM E XAMPLE Pick C first, learns CONCEPT = C Ex. #WeightABCDECONCEPT 11/13False TrueFalseTrueFalse 21/13FalseTrueFalse 31/13FalseTrue False 41/13False TrueFalse 51/13False True False 61/13TrueFalseTrueFalse True 71/13TrueFalse TrueFalseTrue 81/13TrueFalseTrueFalseTrue 91/13True FalseTrue 101/13True 111/13True False 121/13True False TrueFalse 131/13TrueFalseTrue

M USHROOM E XAMPLE Update weights (precise formula given in R&N) Ex. #WeightABCDECONCEPT 1.125False TrueFalseTrueFalse 2.056FalseTrueFalse 3.125FalseTrue False 4.125False TrueFalse 5.056False True False 6.056TrueFalseTrueFalse True 7.125TrueFalse TrueFalseTrue 8.056TrueFalseTrueFalseTrue 9.056True FalseTrue True True False True False TrueFalse TrueFalseTrue

M USHROOM E XAMPLE Next try A, learn CONCEPT=A Ex. #WeightABCDECONCEPT 1.125False TrueFalseTrueFalse 2.056FalseTrueFalse 3.125FalseTrue False 4.125False TrueFalse 5.056False True False 6.056TrueFalseTrueFalse True 7.125TrueFalse TrueFalseTrue 8.056TrueFalseTrueFalseTrue 9.056True FalseTrue True True False True False TrueFalse TrueFalseTrue

M USHROOM E XAMPLE Next try A, learn CONCEPT=A Ex. #WeightABCDECONCEPT 1.125False TrueFalseTrueFalse 2.056FalseTrueFalse 3.125FalseTrue False 4.125False TrueFalse 5.056False True False 6.056TrueFalseTrueFalse True 7.125TrueFalse TrueFalseTrue 8.056TrueFalseTrueFalseTrue 9.056True FalseTrue True True False True False TrueFalse TrueFalseTrue

M USHROOM E XAMPLE Update weights Ex. #WeightABCDECONCEPT 10.07False TrueFalseTrueFalse 20.03FalseTrueFalse 30.07FalseTrue False 40.07False TrueFalse 50.03False True False 60.03TrueFalseTrueFalse True 70.07TrueFalse TrueFalseTrue 80.03TrueFalseTrueFalseTrue 90.03True FalseTrue True True False True False TrueFalse TrueFalseTrue

M USHROOM E XAMPLE Next try E, learn CONCEPT=E Ex. #WeightABCDECONCEPT 10.07False TrueFalseTrueFalse 20.03FalseTrueFalse 30.07FalseTrue False 40.07False TrueFalse 50.03False True False 60.03TrueFalseTrueFalse True 70.07TrueFalse TrueFalseTrue 80.03TrueFalseTrueFalseTrue 90.03True FalseTrue True True False True False TrueFalse TrueFalseTrue

M USHROOM E XAMPLE Next try E, learn CONCEPT=  E Ex. #WeightABCDECONCEPT 10.07False TrueFalseTrueFalse 20.03FalseTrueFalse 30.07FalseTrue False 40.07False TrueFalse 50.03False True False 60.03TrueFalseTrueFalse True 70.07TrueFalse TrueFalseTrue 80.03TrueFalseTrueFalseTrue 90.03True FalseTrue True True False True False TrueFalse TrueFalseTrue

M USHROOM E XAMPLE Update Weights… Ex. #WeightABCDECONCEPT 10.07False TrueFalseTrueFalse 20.03FalseTrueFalse 30.07FalseTrue False 40.07False TrueFalse 50.03False True False 60.03TrueFalseTrueFalse True 70.07TrueFalse TrueFalseTrue 80.03TrueFalseTrueFalseTrue 90.03True FalseTrue True True False True False TrueFalse TrueFalseTrue

M USHROOM E XAMPLE Final classifier, order C,A,E,D,B Weights on hypotheses determined by overall error Weighted majority weights A=2.1,  B=0.9, C=0.8, D=1.4,  E= % accuracy on training set

B OOSTING S TRATEGIES Prior weighting strategy was the popular AdaBoost algorithm see R&N pp. 667 Many other strategies Typically as the number of hypotheses increases, accuracy increases as well Does this conflict with Occam’s razor?

A NNOUNCEMENTS Next class: Neural networks & function learning R&N