Decision Trees Advanced Statistical Methods in NLP Ling572 January 10, 2012.

Slides:



Advertisements
Similar presentations
Decision Trees Decision tree representation ID3 learning algorithm
Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
ICS320-Foundations of Adaptive and Learning Systems
Classification Techniques: Decision Tree Learning
Machine Learning Decision Trees. Exercise Solutions.
Decision Tree Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Machine learning learning... a fundamental aspect of intelligent systems – not just a short-cut to kn acquisition / complex behaviour.
Information Extraction Lecture 6 – Decision Trees (Basic Machine Learning) CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Lecture outline Classification Decision-tree classification.
Decision Tree Rong Jin. Determine Milage Per Gallon.
Part 7.3 Decision Trees Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting.
Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics.
Decision Tree Learning
Ensemble Learning: An Introduction
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Classification.
Ensemble Learning (2), Tree and Forest
Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.
Decision Tree Learning
Decision tree LING 572 Fei Xia 1/16/06.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Mohammad Ali Keyvanrad
Chapter 9 – Classification and Regression Trees
CpSc 810: Machine Learning Decision Tree Learning.
Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionSplitting Function Issues in Decision-Tree LearningIssues in Decision-Tree Learning.
CS Decision Trees1 Decision Trees Highly used and successful Iteratively split the Data Set into subsets one attribute at a time, using most informative.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
Learning with Decision Trees Artificial Intelligence CMSC February 20, 2003.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
Decision Trees. What is a decision tree? Input = assignment of values for given attributes –Discrete (often Boolean) or continuous Output = predicated.
Bab /57 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 2 Model Overfitting & Classifier Evaluation.
Searching by Authority Artificial Intelligence CMSC February 12, 2008.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length Abdomen Length.
Decision Tree Learning
Decision Tree Learning Presented by Ping Zhang Nov. 26th, 2007.
Machine Learning: Decision Trees Homework 4 assigned courtesy: Geoffrey Hinton, Yann LeCun, Tan, Steinbach, Kumar.
Learning with Decision Trees Artificial Intelligence CMSC February 18, 2003.
Decision Trees.
Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.
Medical Decision Making Learning: Decision Trees Artificial Intelligence CMSC February 10, 2005.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Decision Tree Learning
Ch9: Decision Trees 9.1 Introduction A decision tree:
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Decision Tree Saed Sayad 9/21/2018.
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Learning with Identification Trees
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning: Lecture 3
Machine Learning Chapter 3. Decision Tree Learning
Presentation transcript:

Decision Trees Advanced Statistical Methods in NLP Ling572 January 10, 2012

Information Gain InfoGain(S,A): expected reduction in entropy due to A 2

Information Gain InfoGain(S,A): expected reduction in entropy due to A 3

Information Gain InfoGain(S,A): expected reduction in entropy due to A 4

Information Gain InfoGain(S,A): expected reduction in entropy due to A Select A with max InfoGain Resulting in lowest average entropy 5

Computing Average Entropy Disorder of class distribution on branch i Fraction of samples down branch i |S| instances Branch1 Branch 2 S a1 a S a1 b S a2 a S a2 b 6

Sunburn Example 7

Picking a Test Hair Color Blonde Red Brown Alex: N Pete: N John: N Emily: B Sarah: B Dana: N Annie: B Katie: N HeightWeightLotion Short Average Tall Alex:N Annie:B Katie:N Sarah:B Emily:B John:N Dana:N Pete:N Sarah:B Katie:N Light Average Heavy Dana:N Alex:N Annie:B Emily:B Pete:N John:N No Yes Sarah:B Annie:B Emily:B Pete:N John:N Dana:N Alex:N Katie:N 8

Entropy in Sunburn Example 9

S = [3B,5N] 10

Entropy in Sunburn Example S = [3B,5N] 11

Entropy in Sunburn Example Hair color = (4/8(-2/4 log 2/4 - 2/4log2/4) + 1/8*0 + 3/8 *0) = = Height = = Weight = = Lotion = = S = [3B,5N] 12

Picking a Test HeightWeightLotion Short Average Tall Annie:B Katie:N Sarah:B Dana:N Sarah:B Katie:N Light Average Heavy Dana:N Annie:B No Yes Sarah:B Annie:B Dana:N Katie:N 13

Entropy in Sunburn Example Height = 1-2/4(-1/2log1/2-1/2log1/2) + 1/4*0+1/4*0 = =0.5 Weight = 1-2/4(-1/2log1/2-1/2log1/2) +2/4(-1/2log1/2-1/2log1/2) = 1 Lotion = 1- 0 = 1 S=[2B,2N] 14

Building Decision Trees with Information Gain Until there are no inhomogeneous leaves 15

Building Decision Trees with Information Gain Until there are no inhomogeneous leaves Select an inhomogeneous leaf node 16

Building Decision Trees with Information Gain Until there are no inhomogeneous leaves Select an inhomogeneous leaf node Replace that leaf node by a test node creating subsets that yield highest information gain 17

Building Decision Trees with Information Gain Until there are no inhomogeneous leaves Select an inhomogeneous leaf node Replace that leaf node by a test node creating subsets that yield highest information gain Effectively creates set of rectangular regions Repeatedly draws lines in different axes 18

Alternate Measures Issue with Information Gain: 19

Alternate Measures Issue with Information Gain: Favors features with more values Option: 20

Alternate Measures Issue with Information Gain: Favors features with more values Option: Gain Ratio 21

Alternate Measures Issue with Information Gain: Favors features with more values Option: Gain Ratio S a : elements of S with value A=a 22

Overfitting Overfitting: Model fits the training data TOO well Fits noise, irrelevant details 23

Overfitting Overfitting: Model fits the training data TOO well Fits noise, irrelevant details Why is this bad? 24

Overfitting Overfitting: Model fits the training data TOO well Fits noise, irrelevant details Why is this bad? Harms generalization Fits training data too well, fits new data badly 25

Overfitting Overfitting: Model fits the training data TOO well Fits noise, irrelevant details Why is this bad? Harms generalization Fits training data too well, fits new data badly For model m, training_error(m), D_error(m) – D=all data 26

Overfitting Overfitting: Model fits the training data TOO well Fits noise, irrelevant details Why is this bad? Harms generalization Fits training data too well, fits new data badly For model m, training_error(m), D_error(m) – D=all data If overfit, for another model m’, training_error(m) < training_error(m’), but D_error(m) > D_error(m’) 27

Avoiding Overfitting Strategies to avoid overfitting: 28

Avoiding Overfitting Strategies to avoid overfitting: Early stopping: 29

Avoiding Overfitting Strategies to avoid overfitting: Early stopping: Stop when InfoGain < threshold Stop when number of instances < threshold Stop when tree depth > threshold Post-pruning 30

Avoiding Overfitting Strategies to avoid overfitting: Early stopping: Stop when InfoGain < threshold Stop when number of instances < threshold Stop when tree depth > threshold Post-pruning Grow full tree and remove branches Which is better? 31

Avoiding Overfitting Strategies to avoid overfitting: Early stopping: Stop when InfoGain < threshold Stop when number of instances < threshold Stop when tree depth > threshold Post-pruning Grow full tree and remove branches Which is better? Unclear, both used. For some applications, post-pruning better 32

Post-Pruning Divide data into Training set: used to build the original tree Validation set: used to perform pruning 33

Post-Pruning Divide data into Training set: used to build the original tree Validation set: used to perform pruning Build decision tree based on training data 34

Post-Pruning Divide data into Training set: used to build the original tree Validation set: used to perform pruning Build decision tree based on training data Until pruning does not reduce validation set performance Compute perf. for pruning each nodes (& its children) Greedily remove nodes that do not reduce VS performance 35

Post-Pruning Divide data into Training set: used to build the original tree Validation set: used to perform pruning Build decision tree based on training data Until pruning does not reduce validation set performance Compute perf. for pruning each nodes (& its children) Greedily remove nodes that do not reduce VS performance Yields smaller tree with best performance 36

Performance Measures Compute accuracy on: 37

Performance Measures Compute accuracy on: Validation set k-fold cross-validation 38

Performance Measures Compute accuracy on: Validation set k-fold cross-validation Weighted classification error cost: Weight some types of errors more heavily 39

Performance Measures Compute accuracy on: Validation set k-fold cross-validation Weighted classification error cost: Weight some types of errors more heavily Minimum description length: 40

Performance Measures Compute accuracy on: Validation set k-fold cross-validation Weighted classification error cost: Weight some types of errors more heavily Minimum description length: Favor good accuracy on compact models MDL = error(tree) + model_size(tree) 41

Rule Post-Pruning Convert tree to rules 42

Rule Post-Pruning Convert tree to rules Prune rules independently 43

Rule Post-Pruning Convert tree to rules Prune rules independently Sort final rule set 44

Rule Post-Pruning Convert tree to rules Prune rules independently Sort final rule set Probably most widely used method (toolkits) 45

Modeling Features Different types of features need different tests Binary: Test branches on 46

Modeling Features Different types of features need different tests Binary: Test branches on true/false Discrete: Branches 47

Modeling Features Different types of features need different tests Binary: Test branches on true/false Discrete: Branches for each discrete value Continuous? 48

Modeling Features Different types of features need different tests Binary: Test branches on true/false Discrete: Branches for each discrete value Continuous? Need to discretize 49

Modeling Features Different types of features need different tests Binary: Test branches on true/false Discrete: Branches for each discrete value Continuous? Need to discretize Enumerate all values 50

Modeling Features Different types of features need different tests Binary: Test branches on true/false Discrete: Branches for each discrete value Continuous? Need to discretize Enumerate all values  not possible or desirable Pick value x Branches: value = x 51

Modeling Features Different types of features need different tests Binary: Test branches on true/false Discrete: Branches for each discrete value Continuous? Need to discretize Enumerate all values  not possible or desirable Pick value x Branches: value = x How can we pick split points? 52

Picking Splits Need split useful, sufficient split points What’s a good strategy? 53

Picking Splits Need split useful, sufficient split points What’s a good strategy? Approach: Sort all values for the feature in training data 54

Picking Splits Need split useful, sufficient split points What’s a good strategy? Approach: Sort all values for the feature in training data Identify adjacent instances of different classes Candidate split points between those instances 55

Picking Splits Need split useful, sufficient split points What’s a good strategy? Approach: Sort all values for the feature in training data Identify adjacent instances of different classes Candidate split points between those instances Select candidate with highest information gain 56

Advanced Topics Missing features: What do you do if an instance lacks a feature value?

Advanced Topics Missing features: What do you do if an instance lacks a feature value? Feature costs: How do you model different costs for features?

Advanced Topics Missing features: What do you do if an instance lacks a feature value? Feature costs: How do you model different costs for features? Regression trees: How do you build trees with real-valued predictions?

Missing Features Problem: What if you don’t know the value for a feature?

Missing Features Problem: What if you don’t know the value for a feature? Not binary presence/absence

Missing Features Problem: What if you don’t know the value for a feature? Not binary presence/absence Create synthetic value:

Missing Features Problem: What if you don’t know the value for a feature? Not binary presence/absence Create synthetic value: ‘blank’: allow a distinguished value ‘blank’

Missing Features Problem: What if you don’t know the value for a feature? Not binary presence/absence Create synthetic value: ‘blank’: allow a distinguished value ‘blank’ most common value: assign most common value of feature in training set at that node

Missing Features Problem: What if you don’t know the value for a feature? Not binary presence/absence Create synthetic value: ‘blank’: allow a distinguished value ‘blank’ most common value: assign most common value of feature in training set at that node common value by class: assign most common value of feature in training set at that node for that class

Missing Features Problem: What if you don’t know the value for a feature? Not binary presence/absence Create synthetic value: ‘blank’: allow a distinguished value ‘blank’ most common value: assign most common value of feature in training set at that node common value by class: assign most common value of feature in training set at that node for that class Assign prob p i to each possible value v i of A Assign a fraction (p i ) of example to each descendant in tree

Features with Cost Issue: Obtaining a value for a feature can be expensive

Features with Cost Issue: Obtaining a value for a feature can be expensive i.e. Medical diagnosis: Feature value is result of some diagnostic test

Features with Cost Issue: Obtaining a value for a feature can be expensive i.e. Medical diagnosis: Feature value is result of some diagnostic test Goal: Build best tree with lowest expected cost Approach: Modify feature selection

Features with Cost Issue: Obtaining a value for a feature can be expensive i.e. Medical diagnosis: Feature value is result of some diagnostic test Goal: Build best tree with lowest expected cost Approach: Modify feature selection Replace information gain with measure including cost Tan & Schlimmer (1990)

Regression Trees Leaf nodes provide real-valued predictions

Regression Trees Leaf nodes provide real-valued predictions i.e. level of sunburn, rather than binary Height of pitch accent, rather than +/-

Regression Trees Leaf nodes provide real-valued predictions i.e. level of sunburn, rather than binary Height of pitch accent, rather than +/- Leaf nodes provide Value or linear function E.g. mean of nodes on that branch

Regression Trees Leaf nodes provide real-valued predictions i.e. level of sunburn, rather than binary Height of pitch accent, rather than +/- Leaf nodes provide Value or linear function E.g. mean of nodes on that branch What measure of inhomogeneity? Variance, standard deviation,…

Decision Trees: Strengths Simplicity (conceptual) Feature selection Handling of diverse features: Binary, discrete, continuous Fast decoding Perspicuousness (Interpretability) 75

Decision Trees: Weaknesses Features Assumed independent If want group effect, must model explicitly E.g. make new feature AorB Feature tests conjunctive Inefficiency of training: complex, multiple calculations Lack of formal guarantees: greedy training, non-optimal trees Inductive bias: Rectangular decision boundaries Sparse data problems: splits at each node Lack of stability/robustness 76

Decision Trees Train: Build tree by forming subsets of least disorder Predict: Traverse tree based on feature tests Assign leaf node sample label Pros: Robust to irrelevant features, some noise, fast prediction, perspicuous rule reading Cons: Poor feature combination, dependency, optimal tree build intractable 77