Decision Tree Learning

Slides:



Advertisements
Similar presentations
DECISION TREES. Decision trees  One possible representation for hypotheses.
Advertisements

1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
Decision Trees Decision tree representation ID3 learning algorithm
CHAPTER 9: Decision Trees
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
1er. Escuela Red ProTIC - Tandil, de Abril, Decision Tree Learning 3.1 Introduction –Method for approximation of discrete-valued target functions.
ICS320-Foundations of Adaptive and Learning Systems
Classification Techniques: Decision Tree Learning
Decision Tree Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Part 7.3 Decision Trees Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting.
Decision Tree Algorithm
CS 590M Fall 2001: Security Issues in Data Mining Lecture 4: ID3.
Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics.
Induction of Decision Trees
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Decision Trees Decision tree representation Top Down Construction
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Classification.
Decision tree LING 572 Fei Xia 1/16/06.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Decision tree learning
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Mohammad Ali Keyvanrad
Decision tree learning Maria Simi, 2010/2011 Inductive inference with decision trees  Decision Trees is one of the most widely used and practical methods.
Machine Learning Lecture 10 Decision Tree Learning 1.
CpSc 810: Machine Learning Decision Tree Learning.
Learning from Observations Chapter 18 Through
Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionSplitting Function Issues in Decision-Tree LearningIssues in Decision-Tree Learning.
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
CS690L Data Mining: Classification
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心. 课程基本信息  主讲教师:陈昱 Tel :  助教:程再兴, Tel :  课程网页:
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
Decision Tree Learning
Decision Tree Learning Presented by Ping Zhang Nov. 26th, 2007.
CSC 8520 Spring Paula Matuszek DecisionTreeFirstDraft Paula Matuszek Spring,
Decision Trees.
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
DECISION TREES An internal node represents a test on an attribute.
CS 9633 Machine Learning Decision Tree Learning
Decision Tree Learning
Machine Learning Lecture 2: Decision Tree Learning.
Ch9: Decision Trees 9.1 Introduction A decision tree:
Data Science Algorithms: The Basic Methods
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Decision Tree Saed Sayad 9/21/2018.
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning: Lecture 3
Machine Learning Chapter 3. Decision Tree Learning
INTRODUCTION TO Machine Learning
INTRODUCTION TO Machine Learning 2nd Edition
INTRODUCTION TO Machine Learning
Presentation transcript:

Decision Tree Learning Machine Learning, T. Mitchell Chapter 3

Decision Trees One of the most widely used and practical methods for inductive inference Approximates discrete-valued functions (including disjunctions) Can be used for classification (most common) or regression problems

Decision Tree Example Each internal node corresponds to a test Each branch corresponds to a result of the test Each leaf node assigns a classification

Decision Regions

Decision Trees for Regression

Divide and Conquer Internal decision nodes Univariate: Uses a single attribute, xi Discrete xi : n-way split for n possible values Continuous xi : Binary split : xi > wm Multivariate: Uses more than one attributes Leaves Classification: Class labels, or proportions Regression: Numeric; r average, or local fit Once the tree is trained, a new instance is classified by starting at the root and following the path as dictated by the test results for this instance.

Expressiveness A decision tree can represent a disjunction of conjunctions of constraints on the attribute values of instances. Each path corresponds to a conjunction The tree itself corresponds to a disjunction

Decision Tree If (O=Sunny AND H=Normal) OR (O=Overcast) OR (O=Rain AND W=Weak) then YES “A disjunction of conjunctions of constraints on attribute values”

How expressive is this representation? How would we represent: (A AND B) OR C A XOR B It can represent any Boolean function

Decision tree learning algorithm For a given training set, there are many trees that code it without any error Finding the smallest tree is NP-complete (Quinlan 1986), hence we are forced to use some (local) search algorithm to find reasonable solutions

Learning is greedy; find the best split recursively (Breiman et al, 1984; Quinlan, 1986, 1993) If the decisions are binary, then in the best case, each decision eliminates half of the regions (leaves). If there are b regions, the correct region can be found in log2b decisions, in the best case.

The basic decision tree learning algorithm A decision tree can be constructed by considering attributes of instances one by one. Which attribute should be considered first? The height of a decision tree depends on the order attributes that are considered.

Top-Down Induction of Decision Trees

Entropy Entropy of a random variable with multiple possible values x is defined as: Measure of uncertainty Show high school form example with gender field

Entropy Example from Coding theory: Random variable x discrete with 8 possible states; how many bits are needed to transmit the state of x? All states equally likely We have the following distribution for x?

Entropy In order to save on transmission costs, we would design codes that reflect this distribution

Use of Entropy in Choosing the Next Attribute

In summary, we will consider We will use the entropy of the remaining tree as our measure to prefer one attribute over another. In summary, we will consider the entropy over the distribution of samples falling under each leaf node and we will take a weighted average of that entropy – weighted by the proportion of samples falling under that leaf. We will then choose the attribute that brings us the biggest information gain, or equivalently, results in a tree with the lower weighted entropy.

Training Examples

Selecting the Next Attribute We would select the Humidity attribute to split the root node as it has a higher Information Gain (the example could be more pronunced – small protest for ML book here )

Selecting the Next Attribute Computing the information gain for each attribute, we selected the Outlook attribute as the first test, resulting in the following partially learned tree: We can repeat the same process recursively, until Stopping conditions are satisfied.

Partially learned tree

Until stopped: Select one of the unused attributes to partition the remaining examples at each non-terminal node using only the training samples associated with that node Stopping criteria: each leaf-node contains examples of one type algorithm ran out of attributes …

Other measures of impurity Entropy is not the only measure of impurity. If a function satisfies certain criteria, it can be used as a measure of impurity. Gini index: 2p(p-1) P=0.5 Gini Index=0.5 P=0.9 Gini Index=0.18 P=1 Gini Index=0 P=0 Gini Index=0 Misclassification error: 1 – max(p,1-p) P=0.5 Misclassification error=0.5 P=0.9 Misclassification error=0.1 P=1 Misclassification error=0 P=0 Misclassification error=0

Over fitting in Decision Trees Why “over”-fitting? A model can become more complex than the true target function (concept) when it tries to satisfy noisy data as well.

Sky; Temp; Humidity; Wind; PlayTennis Consider adding the following training example which is incorrectly labeled as negative: Sky; Temp; Humidity; Wind; PlayTennis Sunny; Hot; Normal; Strong; PlayTennis = No

ID3 (the Greedy algorithm that was outlined) will make a new split and will classify future examples following the new path as negative. Problem is due to ”overfitting” the training data which may be thought as insufficient generalization of the training data Coincidental regularities in the data Insufficient data Differences between training and test distributions Definition of overfitting A hypothesis is said to overfit the training data if there exists some other hypothesis that has larger error over the training data but smaller error over the entire instances.

From: http://kogs-www. informatik. uni-hamburg

Over fitting in Decision Trees

Avoiding over-fitting the data How can we avoid overfitting? There are 2 approaches: Early stopping: stop growing the tree before it perfectly classifies the training data Pruning: grow full tree, then prune Reduced error pruning Rule post-pruning Pruning approach is found more useful in practice.

SKIP the REST

Whether we are pre or post-pruning, the important question is how to select “best” tree: Measure performance over separate validation data set Measure performance over training data apply a statistical test to see if expanding or pruning would produce an improvement beyond the training set (Quinlan 1986) MDL: minimize size(tree) + size(misclassifications(tree)) …

Reduced-Error Pruning (Quinlan 1987) Split data into training and validation set Do until further pruning is harmful: 1. Evaluate impact of pruning each possible node (plus those below it) on the validation set 2. Greedily remove the one that most improves validation set accuracy Produces smallest version of the (most accurate) tree What if data is limited? We would not want to separate a validation set.

Reduced error pruning Examine each decision node to see if pruning decreases the tree’s performance over the evaluation data. “Pruning” here means replacing a subtree with a leaf with the most common classification in the subtree.

Rule Extraction from Trees C4.5Rules (Quinlan, 1993)

Other Issues With Decision Trees Continuous Values Missing Attributes …

Continuous Valued Attributes Create a discrete attribute to test continuous Temperature = 82:5 (Temperature > 72:3) = t; f How to find the threshold? Temperature: 40 48 60 72 80 90 PlayTennis: No No Yes Yes Yes No

Incorporating continuous-valued attributes Where to cut? Continuous valued attribute

Split Information? A2 A1 100 examples 100 examples 50 positive In each tree, the leaves contain samples of only one kind (e.g. 50+, 10+, 10- etc). Hence, the remaining entropy is 0 in each one. Which is better? In terms of information gain In terms of gain ratio 100 examples 100 examples A2 A1 10 positive 50 positive 50 negative 10 positive 10 positive 10 negative

Attributes with Many Values One way to penalize such attributes is to use the following alternative measure: S Entropy of the attribute A: Experimentally determined by the training samples

Handling training examples with missing attribute values What if an example x is missing the value an attribute A? Simple solution: Use the most common value among examples at node n. Or use the most common value among examples at node n that have classification c(x) More complex, probabilistic approach Assign a probability to each of the possible values of A based on the observed frequencies of the various values of A Then, propagate examples down the tree with these probabilities. The same probabilities can be used in classification of new instances (used in C4.5)

Handling attributes with differing costs Sometimes, some attribute values are more expensive or difficult to prepare. medical diagnosis, BloodTest has cost $150 In practice, it may be desired to postpone acquisition of such attribute values until they become necessary. To this purpose, one may modify the attribute selection measure to penalize expensive attributes. Tan and Schlimmer (1990) Nunez (1988)

C4.5 By Ross Quinlan Latest code available at http://www.cse.unsw.edu.au/~quinlan/ How to use it? Download it Unpack it Build it (“make all”) Read accompanying manual files groff –T ps c4.5.1 > c4.5.ps Use it c4.5 : tree generator c4.5rules : rule generator consult : use a generated tree to classify an instance consultr : use a generated set of rules to classify an instance

Model Selection in Trees:

Strengths and Advantages of Decision Trees Rule extraction from trees A decision tree can be used for feature extraction (e.g. seeing which features are useful) Interpretability: human experts may verify and/or discover patterns It is a compact and fast classification method