Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.

Slides:



Advertisements
Similar presentations
COMP3740 CR32: Knowledge Management and Adaptive Systems
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
Decision Trees Decision tree representation ID3 learning algorithm
CHAPTER 9: Decision Trees
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
1er. Escuela Red ProTIC - Tandil, de Abril, Decision Tree Learning 3.1 Introduction –Method for approximation of discrete-valued target functions.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Classification Techniques: Decision Tree Learning
Decision Tree Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Chapter 7 – Classification and Regression Trees
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Part 7.3 Decision Trees Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting.
Decision Tree Algorithm
Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics.
Classification Continued
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
ICS 273A Intro Machine Learning
Classification.
Decision Tree Learning
Decision tree LING 572 Fei Xia 1/16/06.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Decision Trees Advanced Statistical Methods in NLP Ling572 January 10, 2012.
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Mohammad Ali Keyvanrad
Decision tree learning Maria Simi, 2010/2011 Inductive inference with decision trees  Decision Trees is one of the most widely used and practical methods.
Chapter 9 – Classification and Regression Trees
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
CpSc 810: Machine Learning Decision Tree Learning.
Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionSplitting Function Issues in Decision-Tree LearningIssues in Decision-Tree Learning.
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.
Decision Tree Learning
Decision Tree Learning Presented by Ping Zhang Nov. 26th, 2007.
CSC 8520 Spring Paula Matuszek DecisionTreeFirstDraft Paula Matuszek Spring,
Machine Learning: Decision Trees Homework 4 assigned courtesy: Geoffrey Hinton, Yann LeCun, Tan, Steinbach, Kumar.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Decision Trees.
Classification and Regression Trees
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
DECISION TREES An internal node represents a test on an attribute.
Decision Tree Learning
Ch9: Decision Trees 9.1 Introduction A decision tree:
Data Science Algorithms: The Basic Methods
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Introduction to Data Mining, 2nd Edition by
Classification and Prediction
Introduction to Data Mining, 2nd Edition by
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning: Lecture 3
Machine Learning Chapter 3. Decision Tree Learning
INTRODUCTION TO Machine Learning
INTRODUCTION TO Machine Learning 2nd Edition
Presentation transcript:

Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML Berrin Yanıkoğlu

In these slides;  you will strengthen your understanding of the introductory concepts of machine learning  learn about the Decision tree approach which is one of the fundamental approaches in machine learning Decision trees are also the basis of a new method called “random forest” which we will see towards the end of the course.

Decision Trees One of the most widely used and practical methods for inductive inference Can be used for classification (most common use) or regression problems

Decision Tree for “Good credit applicant?” Each internal node corresponds to a test Each branch corresponds to a result of the test Each leaf node assigns a classification Current Debt? None Large Small Assets? Good appl. Medium appl. ….. Bad appl. High None Yes No Bad appl. Employed?

Decision Regions

Divide and Conquer Internal decision nodes  Univariate: Uses a single attribute, x i Discrete x i : n-way split for n possible values Continuous x i : Binary split : x i > w m  Multivariate: Uses more than one attributes Leaves  Classification: Class labels, or proportions  Regression: Numeric value, or local fit Once the tree is trained, a new instance is classified by starting at the root and following the path as dictated by the test results for this instance.

Multivariate Trees

Expressiveness A decision tree can represent a disjunction of conjunctions of constraints on the attribute values of instances.  Each path corresponds to a conjunction  The tree itself corresponds to a disjunction

Decision Tree If ( Current Debt= Large) OR ( Current Debt= Small AND Employed? = No) then Bad Applicant “A disjunction of conjunctions of constraints on attribute values”

Decision tree learning algorithm For a given training set, there are many trees that code it without any error Finding the smallest tree is NP-complete (Quinlan 1986), hence we are forced to use some (local) search algorithm to find reasonable solutions

ID3 Decision Tree Learning Algorithm Learning is greedy; find the best split recursively (Breiman et al, 1984; Quinlan, 1986, 1993) At each internal node:  Choose an attribute to question about  According to the answers given, the problem is split into sub- problems. Continue recursively. Which attribute should be considered first?  The height of a decision tree depends on the order attributes that are considered.

Top-Down Induction of Decision Trees

Which attribute is best? t f (25+,25-) (17+, 20-)(8+,5-) t f (25+,25-) (24+, 5-)(1+,20-) A1 < …? A2 < …?

Entropy Measure of uncertainty Expected number of bits to resolve uncertainty Entropy measures the information amount in a message Important quantity in  coding theory  statistical physics  machine learning  … High school form example with gender field

Entropy of a Binary Random Variable Entropy measures the impurity of S: Entropy(S) = -p x log 2 p + - (1-p) x log 2 (1-p) Example: Consider a binary random variable X s.t. Pr{X = 0} = 0.1 Entropy(X) =

Entropy – General Case When the random variable has multiple possible outcomes, its entropy becomes:

Entropy Example from Coding theory: Random variable x discrete with 8 possible states; how many bits are needed to transmit the state of x? 1. All states equally likely 2. We have the following distribution for x?

Use of Entropy in Choosing the Next Attribute

We will use the entropy of the remaining tree as our measure to prefer one attribute over another. In summary, we will consider  the entropy over the distribution of samples falling under each leaf node and  we will take a weighted average of that entropy – weighted by the proportion of samples falling under that leaf. We will then choose the attribute that brings us the biggest information gain, or equivalently, results in a tree with the lower weighted entropy.

Please note: Gain is [what you started with – Remaining entropy]. So we can simply choose the tree with the smallest remaining entropy!

Training Examples Many Customers

Selecting the Next Attribute We would select the Humidity attribute to split the root node as it has a higher Information Gain.

Selecting the Next Attribute Computing the information gain for each attribute, we selected the Outlook attribute as the first test, resulting in the following partially learned tree: We can repeat the same process recursively, until Stopping conditions are satisfied.

Partially learned tree

Until stopped: Select one of the unused attributes to partition the remaining examples at each non-terminal node using only the training samples associated with that node Stopping criteria: each leaf-node contains examples of one type algorithm ran out of attributes …

Advanced: Other measures of impurity Entropy is not the only measure of impurity. If a function satisfies certain criteria, it can be used as a measure of impurity. For binary random variables (only p value is indicated), we have:  Gini index: 2p(1-p) p=0.5Gini Index=0.5 p=0.9Gini Index=0.18 p=1Gini Index=0 p=0Gini Index=0  Misclassification error: 1 – max(p,1-p) p=0.5Misclassification error=0.5 p=0.9Misclassification error=0.1 P=1Misclassification error=0 P=0Misclassification error=0

Other Issues With Decision Trees Continuous Values Missing Attributes …

Continuous Valued Attributes Create a discrete attribute to test continuous variables Temperature = 82:5 (Temperature > 72:3) = t; f How to find the threshold? Temperature: PlayTennis: No No Yes Yes Yes No

Incorporating continuous-valued attributes Where to cut?  We can show that the threshold is always the the transitions (shown in red boundaries between the two classes) Continuous valued attribute

Split Information? In each tree, the leaves contain samples of only one kind (e.g. 50+, 10+, 10- etc).  Hence, the remaining entropy is 0 in each one. But we would often prefer A1.  How to indicate this preference? A1 100 examples 50 positive50 negative A2 100 examples 10 positive10 negative 10 positive

Attributes with Many Values – Formula Non-Essential One way to penalize such attributes is to use the following alternative measure: Entropy of the attribute A: Experimentally determined by the training samples  Split Information is 3.3 versus 1, for the previous example. -10*0.1*log_2(0.1) vs. - 2*0.5*log_2(0.5)

Handling training examples with missing attribute values What if an example x is missing the value an attribute A? Simple solution:  Use the most common value among examples at node n.  Or use the most common value among examples at node n that have classification c(x) More complex, probabilistic approach  Assign a probability to each of the possible values of A based on the observed frequencies of the various values of A  Then, propagate examples down the tree with these probabilities.  The same probabilities can be used in classification of new instances (used in C4.5)

Handling attributes with differing costs Sometimes, some attribute values are more expensive or difficult to prepare.  medical diagnosis, BloodTest has cost $150 In practice, it may be desired to postpone acquisition of such attribute values until they become necessary. To this purpose, one may modify the attribute selection measure to penalize expensive attributes.  Tan and Schlimmer (1990)  Nunez (1988) 

Regression with Decision Trees

Model Selection in Trees:

We do not know which tree is better, one simple way to select the model (tree) is to use a validation set.

Strengths and Advantages of Decision Trees Rule extraction from trees  A decision tree can be used for feature extraction (e.g. seeing which features are useful) Interpretability: human experts may verify and/or discover patterns It is a compact and fast classification method

Back to Decision Trees

A typical behaviour in machine learning is that as the complexity of the model increases, it has more chance to better represent the training data. But then, often the generalization performance suffer.

Avoiding over-fitting the data How can we avoid overfitting? There are 2 approaches: 1. Early stopping: stop growing the tree before it perfectly classifies the training data 2. Pruning: grow full tree, then prune  Pruning: replacing a subtree with a leaf with the most common classification in the subtree.  Pruning approach is found more useful in practice.

Whether we are pre or post-pruning, the important question is how to select “best” tree:  Measure performance over separate validation set ……

Reduced-Error Pruning (Quinlan 1987) Split data into training and validation set Starting with the leaves, do until further pruning is harmful:  1. Evaluate impact of pruning each possible node (plus those below it) on the validation set In pruning a node, we return the majority decision before the considered split  2. Greedily remove the one that most improves validation set accuracy Produces smallest version of the (most accurate) tree What if data is limited?  We would not want to separate a validation set.

Reduced error pruning

Rule Extraction from Trees C4.5Rules (Quinlan, 1993)