CS B351: D ECISION T REES. A GENDA Decision trees Learning curves Combatting overfitting.

Slides:



Advertisements
Similar presentations
Learning from Observations Chapter 18 Section 1 – 3.
Advertisements

CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting.
DECISION TREES. Decision trees  One possible representation for hypotheses.
1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Classification Algorithms
Decision Tree Approach in Data Mining
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Classification Techniques: Decision Tree Learning
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Decision Tree Rong Jin. Determine Milage Per Gallon.
Learning From Observations
1 Chapter 18 Learning from Observations Decision tree examples Additional source used in preparing the slides: Jean-Claude Latombe’s CS121 slides: robotics.stanford.edu/~latombe/cs121.
ICS 273A Intro Machine Learning
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Inductive Learning (1/2) Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3.
LEARNING DECISION TREES
Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.
Learning decision trees
Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.
Learning….in a rather broad sense: improvement of performance on the basis of experience Machine learning…… improve for task T with respect to performance.
ICS 273A Intro Machine Learning
Classification.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Inductive Learning (1/2) Decision Tree Method
CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
Inductive learning Simplest form: learn a function from examples
Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides.
Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco.
I NTRODUCTION TO M ACHINE L EARNING. L EARNING Agent has made observations (data) Now must make sense of it (hypotheses) Hypotheses alone may be important.
LEARNING DECISION TREES Yılmaz KILIÇASLAN. Definition - I Decision tree induction is one of the simplest, and yet most successful forms of learning algorithm.
Learning from Observations Chapter 18 Through
CHAPTER 18 SECTION 1 – 3 Learning from Observations.
Learning from Observations Chapter 18 Section 1 – 3, 5-8 (presentation TBC)
I NTRODUCTION TO M ACHINE L EARNING. L EARNING Agent has made observations (data) Now must make sense of it (hypotheses) Hypotheses alone may be important.
Decision Tree Learning R&N: Chap. 18, Sect. 18.1–3.
CS690L Data Mining: Classification
Decision Trees. What is a decision tree? Input = assignment of values for given attributes –Discrete (often Boolean) or continuous Output = predicated.
Today’s Topics Learning Decision Trees (Chapter 18) –We’ll use d-trees to introduce/motivate many general issues in ML (eg, overfitting reduction) “Forests”
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Decision Tree Learning
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
Chapter 18 Section 1 – 3 Learning from Observations.
Inductive Learning (2/2) Version Space and PAC Learning Russell and Norvig: Chapter 18, Sections 18.5 through 18.7 Chapter 18, Section 18.5 Chapter 19,
Learning From Observations Inductive Learning Decision Trees Ensembles.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Learning from Observations
Learning from Observations
DECISION TREES An internal node represents a test on an attribute.
Introduce to machine learning
Presented By S.Yamuna AP/CSE
Chapter 6 Classification and Prediction
Classification and Prediction
CS 4700: Foundations of Artificial Intelligence
Learning from Observations
©Jiawei Han and Micheline Kamber
Learning from Observations
Decision trees One possible representation for hypotheses
Inductive Learning (2/2) Version Space and PAC Learning
Presentation transcript:

CS B351: D ECISION T REES

A GENDA Decision trees Learning curves Combatting overfitting

C LASSIFICATION T ASKS  Supervised learning setting  The target function f(x) takes on values True and False  A example is positive if f is True, else it is negative  The set X of all possible examples is the example set  The training set is a subset of X a small one!

L OGICAL C LASSIFICATION D ATASET Here, examples (x, f(x)) take on discrete values

L OGICAL C LASSIFICATION D ATASET Here, examples (x, f(x)) take on discrete values Concept Note that the training set does not say whether an observable predicate is pertinent or not

L OGICAL C LASSIFICATION T ASK  Find a representation of CONCEPT in the form: CONCEPT(x)  S(A,B, …) where S(A,B,…) is a sentence built with the observable attributes, e.g.: CONCEPT(x)  A(x)  (  B(x) v C(x))

P REDICATE AS A D ECISION T REE The predicate CONCEPT(x)  A(x)  (  B(x) v C(x)) can be represented by the following decision tree: A? B? C? True FalseTrue False Example: A mushroom is poisonous iff it is yellow and small, or yellow, big and spotted x is a mushroom CONCEPT = POISONOUS A = YELLOW B = BIG C = SPOTTED

P REDICATE AS A D ECISION T REE The predicate CONCEPT(x)  A(x)  (  B(x) v C(x)) can be represented by the following decision tree: A? B? C? True FalseTrue False Example: A mushroom is poisonous iff it is yellow and small, or yellow, big and spotted x is a mushroom CONCEPT = POISONOUS A = YELLOW B = BIG C = SPOTTED D = FUNNEL-CAP E = BULKY

T RAINING S ET Ex. #ABCDECONCEPT 1False TrueFalseTrueFalse 2 TrueFalse 3 True False 4 TrueFalse 5 True False 6TrueFalseTrueFalse True 7 False TrueFalseTrue 8 FalseTrueFalseTrue 9 FalseTrue 10True 11True False 12True False TrueFalse 13TrueFalseTrue

P OSSIBLE D ECISION T REE D CE B E AA A T F F FF F T T T TT

D CE B E AA A T F F FF F T T T TT CONCEPT  (D  (  E v A)) v (  D  (C  (B v (  B  ((E  A) v (  E  A)))))) A? B? C? True FalseTrue False CONCEPT  A  (  B v C)

P OSSIBLE D ECISION T REE D CE B E AA A T F F FF F T T T TT A? B? C? True FalseTrue False CONCEPT  A  (  B v C) KIS bias  Build smallest decision tree Computationally intractable problem  greedy algorithm CONCEPT  (D  (  E v A)) v (  D  (C  (B v (  B  ((E  A) v (  E  A))))))

G ETTING S TARTED : T OP -D OWN I NDUCTION OF D ECISION T REE Ex. #ABCDECONCEPT 1False TrueFalseTrueFalse 2 TrueFalse 3 True False 4 TrueFalse 5 True False 6TrueFalseTrueFalse True 7 False TrueFalseTrue 8 FalseTrueFalseTrue 9 FalseTrue 10True 11True False 12True False TrueFalse 13TrueFalseTrue True: 6, 7, 8, 9, 10,13 False: 1, 2, 3, 4, 5, 11, 12 The distribution of training set is:

G ETTING S TARTED : T OP -D OWN I NDUCTION OF D ECISION T REE True: 6, 7, 8, 9, 10,13 False: 1, 2, 3, 4, 5, 11, 12 The distribution of training set is: Without testing any observable predicate, we could report that CONCEPT is False (majority rule) with an estimated probability of error P(E) = 6/13 Assuming that we will only include one observable predicate in the decision tree, which predicate should we test to minimize the probability of error (i.e., the # of misclassified examples in the training set)?  Greedy algorithm

A SSUME I T ’ S A A True: False: 6, 7, 8, 9, 10, 13 11, 12 1, 2, 3, 4, 5 T F If we test only A, we will report that CONCEPT is True if A is True (majority rule) and False otherwise  The number of misclassified examples from the training set is 2

A SSUME I T ’ S B B True: False: 9, 10 2, 3, 11, 12 1, 4, 5 T F If we test only B, we will report that CONCEPT is False if B is True and True otherwise  The number of misclassified examples from the training set is 5 6, 7, 8, 13

A SSUME I T ’ S C C True: False: 6, 8, 9, 10, 13 1, 3, 4 1, 5, 11, 12 T F If we test only C, we will report that CONCEPT is True if C is True and False otherwise  The number of misclassified examples from the training set is 4 7

A SSUME I T ’ S D D T F If we test only D, we will report that CONCEPT is True if D is True and False otherwise  The number of misclassified examples from the training set is 5 True: False: 7, 10, 13 3, 5 1, 2, 4, 11, 12 6, 8, 9

A SSUME I T ’ S E E True: False: 8, 9, 10, 13 1, 3, 5, 12 2, 4, 11 T F If we test only E we will report that CONCEPT is False, independent of the outcome  The number of misclassified examples from the training set is 6 6, 7

A SSUME I T ’ S E E True: False: 8, 9, 10, 13 1, 3, 5, 12 2, 4, 11 T F If we test only E we will report that CONCEPT is False, independent of the outcome  The number of misclassified examples from the training set is 6 6, 7 So, the best predicate to test is A

C HOICE OF S ECOND P REDICATE A T F C True: False: 6, 8, 9, 10, 13 11, 12 7 T F False  The number of misclassified examples from the training set is 1

C HOICE OF T HIRD P REDICATE C T F B True: False: 11,12 7 T F A T F False True

F INAL T REE A C True B False CONCEPT  A  (C v  B) CONCEPT  A  (  B v C) A? B? C? True False True False

T OP -D OWN I NDUCTION OF A DT DTL( , Predicates) 1. If all examples in  are positive then return True 2. If all examples in  are negative then return False 3. If Predicates is empty then return failure 4. A  error-minimizing predicate in Predicates 5. Return the tree whose: - root is A, - left branch is DTL(  +A,Predicates-A), - right branch is DTL(  -A,Predicates-A) A C True B False Subset of examples that satisfy A

T OP -D OWN I NDUCTION OF A DT DTL( , Predicates) 1. If all examples in  are positive then return True 2. If all examples in  are negative then return False 3. If Predicates is empty then return failure 4. A  error-minimizing predicate in Predicates 5. Return the tree whose: - root is A, - left branch is DTL(  +A,Predicates-A), - right branch is DTL(  -A,Predicates-A) A C True B False Noise in training set! May return majority rule, instead of failure

T OP -D OWN I NDUCTION OF A DT DTL( , Predicates) 1. If all examples in  are positive then return True 2. If all examples in  are negative then return False 3. If Predicates is empty then return majority rule 4. A  error-minimizing predicate in Predicates 5. Return the tree whose: - root is A, - left branch is DTL(  +A,Predicates-A), - right branch is DTL(  -A,Predicates-A) A C True B False

C OMMENTS Widely used algorithm Easy to extend to k-class classification Greedy Robust to noise (incorrect examples) Not incremental

H UMAN -R EADABILITY DTs also have the advantage of being easily understood by humans Legal requirement in many areas Loans & mortgages Health insurance Welfare

L EARNABLE C ONCEPTS Some simple concepts cannot be represented compactly in DTs Parity(x) = X 1 xor X 2 xor … xor X n Majority(x) = 1 if most of X i ’s are 1, 0 otherwise Exponential size in # of attributes Need exponential # of examples to learn exactly The ease of learning is dependent on shrewdly (or luckily) chosen attributes that correlate with CONCEPT

P ERFORMANCE I SSUES Assessing performance: Training set and test set Learning curve size of training set % correct on test set 100 Typical learning curve

P ERFORMANCE I SSUES Assessing performance: Training set and test set Learning curve size of training set % correct on test set 100 Typical learning curve Some concepts are unrealizable within a machine’s capacity

P ERFORMANCE I SSUES Assessing performance: Training set and test set Learning curve Overfitting Risk of using irrelevant observable predicates to generate an hypothesis that agrees with all examples in the training set size of training set % correct on test set 100 Typical learning curve

P ERFORMANCE I SSUES Assessing performance: Training set and test set Learning curve Overfitting Tree pruning Risk of using irrelevant observable predicates to generate an hypothesis that agrees with all examples in the training set Terminate recursion when # errors (or information gain) is small

P ERFORMANCE I SSUES Assessing performance: Training set and test set Learning curve Overfitting Tree pruning Terminate recursion when # errors (or information gain) is small Risk of using irrelevant observable predicates to generate an hypothesis that agrees with all examples in the training set The resulting decision tree + majority rule may not classify correctly all examples in the training set

S TATISTICAL M ETHODS FOR A DDRESSING O VERFITTING / N OISE There may be few training examples that match the path leading to a deep node in the decision tree More susceptible to choosing irrelevant/incorrect attributes when sample is small Idea: Make a statistical estimate of predictive power (which increases with larger samples) Prune branches with low predictive power Chi-squared pruning

T OP - DOWN DT PRUNING Consider an inner node X that by itself (majority rule) predicts p examples correctly and n examples incorrectly At k leaf nodes, number of correct/incorrect examples are p 1 /n 1,…,p k /n k Chi-squared statistical significance test: Null hypothesis: example labels randomly chosen with distribution p/(p+n) (X is irrelevant) Alternate hypothesis: examples not randomly chosen (X is relevant) Prune X if testing X is not statistically significant

C HI -S QUARED TEST Let Z =  i (p i – p i ’) 2 /p i ’ + (n i – n i ’) 2 /n i ’ Where p i ’ = p i (p i +n i )/(p+n), n i ’ = n i (p i +n i )/(p+n) are the expected number of true/false examples at leaf node i if the null hypothesis holds Z is a statistic that is approximately drawn from the chi-squared distribution with k degrees of freedom Look up p-Value of Z from a table, prune if p- Value >  for some  (usually ~.05)

P ERFORMANCE I SSUES Assessing performance: Training set and test set Learning curve Overfitting Tree pruning Incorrect examples Missing data Multi-valued and continuous attributes

M ULTI -V ALUED A TTRIBUTES Simple change: consider splits on all values A can take on Caveat: the more values A can take on, the more important it may appear to be, even if it is irrelevant More values => dataset split into smaller example sets when picking attributes Smaller example sets => more likely to fit well to spurious noise

C ONTINUOUS A TTRIBUTES Continuous attributes can be converted into logical ones via thresholds X => X<a When considering splitting on X, pick the threshold a to minimize # of errors / entropy

D ECISION B OUNDARIES With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples x1>=20 x2 x1 TF FT

D ECISION B OUNDARIES With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples x1>=20 x2 x1 F F x2>=10 T F F T

D ECISION B OUNDARIES With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples x1>=20 x2 x1 F x2>=10 T F F T x2>=15 TF T

D ECISION B OUNDARIES With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples

E XERCISE With 2 attributes, what kinds of decision boundaries can be achieved by a decision tree with arbitrary splitting threshold and maximum depth: 1? 2? 3? Describe the appearance and the complexity of these decision boundaries

R EADING Next class: Neural networks & function learning R&N