Decision Tree Problems CSE-391: Artificial Intelligence University of Pennsylvania Matt Huenerfauth April 2005.

Slides:



Advertisements
Similar presentations
Artificial Intelligence 11. Decision Tree Learning Course V231 Department of Computing Imperial College, London © Simon Colton.
Advertisements

Imbalanced data David Kauchak CS 451 – Fall 2013.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Decision Trees with Numeric Tests
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Hunt’s Algorithm CIT365: Data Mining & Data Warehousing Bajuna Salehe
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Decision Tree.
Classification Techniques: Decision Tree Learning
Final Exam: May 10 Thursday. If event E occurs, then the probability that event H will occur is p ( H | E ) IF E ( evidence ) is true THEN H ( hypothesis.
Decision Tree under MapReduce Week 14 Part II. Decision Tree.
Decision Trees.
Regression. So far, we've been looking at classification problems, in which the y values are either 0 or 1. Now we'll briefly consider the case where.
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
ID3 Algorithm Abbas Rizvi CS157 B Spring What is the ID3 algorithm? ID3 stands for Iterative Dichotomiser 3 Algorithm used to generate a decision.
Decision Tree Learning Learning Decision Trees (Mitchell 1997, Russell & Norvig 2003) –Decision tree induction is a simple but powerful learning paradigm.
Induction of Decision Trees
Decision Trees an Introduction.
Three kinds of learning
LEARNING DECISION TREES
Decision Trees (2). Numerical attributes Tests in nodes are of the form f i > constant.
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Learning decision trees
Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.
ICS 273A Intro Machine Learning
Learning Chapter 18 and Parts of Chapter 20
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Machine Learning Queens College Lecture 2: Decision Trees.
Scaling up Decision Trees. Decision tree learning.
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.
Decision Trees. MS Algorithms Decision Trees The basic idea –creating a series of splits, also called nodes, in the tree. The algorithm adds a node to.
2-3 Tree. Slide 2 Outline  Balanced Search Trees 2-3 Trees Trees.
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
File Organization and Processing Week Tree Tree.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 206 Introduction to Computer Science II 04 / 22 / 2009 Instructor: Michael Eckmann.
ID3 Algorithm Michael Crawford.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.
Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.
Lecture Notes for Chapter 4 Introduction to Data Mining
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 4-Inducción de árboles de decisión (1/2) Eduardo Poggi.
CSE343/543 Machine Learning: Lecture 4.  Chapter 3: Decision Trees  Weekly assignment:  There are lot of applications and systems using machine learning.
Data Mining Chapter 4 Algorithms: The Basic Methods - Constructing decision trees Reporter: Yuen-Kuei Hsueh Date: 2008/7/24.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
10. Decision Trees and Markov Chains for Gene Finding.
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
Classification Algorithms
Decision Trees.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Artificial Intelligence
Decision Trees (suggested time: 30 min)
Ch9: Decision Trees 9.1 Introduction A decision tree:
Data Science Algorithms: The Basic Methods
Chapter 8 Tutorial.
Learning Chapter 18 and Parts of Chapter 20
Artificial Intelligence 6. Decision Tree Learning
Data Mining CSCI 307, Spring 2019 Lecture 15
Presentation transcript:

Decision Tree Problems CSE-391: Artificial Intelligence University of Pennsylvania Matt Huenerfauth April 2005

Homework 7 Perform some entropy and information gain calculations. –We’ll also do some information gain ratio calculations in class today. You don’t need to do these on the midterm, but you should understand generally how it’s calculated and you should know when we should use this metric. Using the C4.5 decision tree learning software. –You’ll learn trees to do word sense disambiguation. Read chapter 18.1 – 18.3.

Looking at some data ColorSizeShapeEdible? YellowSmallRound+ YellowSmallRound- GreenSmallIrregular+ GreenLargeIrregular- YellowLargeRound+ YellowSmallRound+ YellowSmallRound+ YellowSmallRound+ GreenSmallRound- YellowLargeRound- YellowLargeRound+ YellowLargeRound- YellowLargeRound- YellowLargeRound- YellowSmallIrregular+ YellowLargeIrregular+

Calculate Entropy For many of the tree-building calculations we do today, we’ll need to know the entropy of a data set. –Entropy is the degree to which a dataset is mixed up. That is, how much variety of classifications (+/-) are still in the set. –For example, a set that is still 50/50 +/- classified will have an Entropy of 1.0. –A set that’s all + or all – will have Entropy 0.0.

Entropy Calculations: I() If we have a set with k different values in it, we can calculate the entropy as follows: Where P(value i ) is the probability of getting the i th value when randomly selecting one from the set. So, for the set R = {a,a,a,b,b,b,b,b} a-valuesb-values

Looking at some data ColorSizeShapeEdible? YellowSmallRound+ YellowSmallRound- GreenSmallIrregular+ GreenLargeIrregular- YellowLargeRound+ YellowSmallRound+ YellowSmallRound+ YellowSmallRound+ GreenSmallRound- YellowLargeRound- YellowLargeRound+ YellowLargeRound- YellowLargeRound- YellowLargeRound- YellowSmallIrregular+ YellowLargeIrregular+

Entropy for our data set 16 instances: 9 positive, 7 negative. This equals: This makes sense – it’s almost a 50/50 split; so, the entropy should be close to 1.

How do we use this? The computer needs a way to decide how to build a decision tree. –First decision: what’s the attribute it should use to ‘branch on’ at the root? –Recursively: what’s the attribute it should use to ‘branch on’ at all subsequent nodes. Guideline: Always branch on the attribute that will divide the data into subsets that have as low entropy as possible (that are as unmixed +/- as possible).

Information Gain Metric: G() When we select an attribute to use as our branching criteria at the root, then we’ve effectively split our data into two sets, the set the goes down the left branch, and the set that goes down the right. If we know the entropy before we started, and then we calculate the entropy of each of these resulting subsets, then we can calculate the information gain.

Information Gain Metric: G() Why is reducing entropy a good idea? –Eventually we’d like our tree to distinguish data items into groups that are fine-grained enough that we can label them as being either + or - –In other words, we’d like to separate our data in such a way that each group is as ‘unmixed’ in terms of +/- classifications as possible. –So, the ideal attribute to branch at the root would be the one that can separate the data into an entirely + group and an entirely – one.

Visualizing Information Gain Size Small ColorSizeShapeEdible? YellowSmallRound+ YellowSmallRound- GreenSmallIrregular+ GreenLargeIrregular- YellowLargeRound+ YellowSmallRound+ YellowSmallRound+ YellowSmallRound+ GreenSmallRound- YellowLargeRound- YellowLargeRound+ YellowLargeRound- YellowLargeRound- YellowLargeRound- YellowSmallIrregular+ YellowLargeIrregular+ Large ColorSizeShapeEdible? YellowSmallRound+ YellowSmallRound- GreenSmallIrregular+ YellowSmallRound+ YellowSmallRound+ YellowSmallRound+ GreenSmallRound- YellowSmallIrregular+ Entropy of set = (16 examples) Entropy = (from 8 examples) Entropy = (from 8 examples) ColorSizeShapeEdible? GreenLargeIrregular- YellowLargeRound+ YellowLargeRound- YellowLargeRound+ YellowLargeRound- YellowLargeRound- YellowLargeRound- YellowLargeIrregular+

Visualizing Information Gain Size SmallLarge (8 examples) (16 examples) 8 examples with ‘small’8 examples with ‘large’ The data set that goes down each branch of the tree has its own entropy value. We can calculate for each possible attribute its expected entropy. This is the degree to which the entropy would change if branch on this attribute. You add the entropies of the two children, weighted by the proportion of examples from the parent node that ended up at that child. Entropy of left child is I(size=small) = Entropy of right child is I(size=large) =

G(attrib) = I(parent) – I(attrib) G(size) = I(parent)– I(size) G(size) = – G(size) = Entropy of all data at parent node = I(parent) = Child’s expected entropy for ‘size’ split = I(size) = So, we have gained bits of information about the dataset by choosing ‘size’ as the first branch of our decision tree. We want to calculate the information gain (or entropy reduction). This is the reduction in ‘uncertainty’ when choosing our first branch as ‘size’. We will represent information gain as “G.”

Using Information Gain For each of the attributes we’re thinking about branching on, and for all of the data that will reach this node (which is all of the data when at the root), do the following: –Calculate the Information Gain if we were to split the current data on this attribute. In the end, select the attribute with the greatest Information Gain to split on. Create two subsets of the data (one for each branch of the tree), and recurse on each branch.

Showing the calculations For color, size, shape. Select the one with the greatest info gain value as the attribute we’ll branch on at the root. Now imagine what our data set will look like on each side of the branch. We would then recurse on each of these data sets to select how to branch below.

Our Data Table ColorSizeShapeEdible? YellowSmallRound+ YellowSmallRound- GreenSmallIrregular+ GreenLargeIrregular- YellowLargeRound+ YellowSmallRound+ YellowSmallRound+ YellowSmallRound+ GreenSmallRound- YellowLargeRound- YellowLargeRound+ YellowLargeRound- YellowLargeRound- YellowLargeRound- YellowSmallIrregular+ YellowLargeIrregular+

Sequence of Calculations Calculate I(parent). This is entropy of the data set before the split. Since we’re at the root, this is simply the entropy for all the data. I(all_data) = (-9/16)*log2(9/16)+(-7/16)*log2(7/16) Next, calculate the I() for the subset of the data where the color=green and for the subset of the data where color=yellow. I(color=green) = (-1/3)*log2(1/3) + (-2/3)*log2(2/3) I(color=yellow) = (-8/13)*log2(8/13) + (-5/13)*log2(5/13) Now calculate expected entropy for ‘color.’ I(color)=(3/16)*I(color=green)+(13/16)*I(color=yellow) Finally, the information gain for ‘color.’ G(color) = I(parent) – I(color)

Calculations I(all_data) =.9836 I(size)=.8829G(size) =.1007 size=small,+2,-6; I(size=small)=.8112 size=large,+3,-5; I(size=large)=.9544 I(color)=.9532G(color)=.0304 color=green,+1,-2; I(color=green)=.9183 color=yellow,+8,-5; I(color=yellow)=.9612 I(shape)=.9528G(shape)=.0308 shape=regular,+6,-6; I(shape=regular)=1.0 shape=irregular,+3,-1; I(shape=irregular)=.8113

Visualizing the Recursive Step Now we have split on a particular feature, we delete that feature from the set considered at the next layer. Since this effectively gives us a ‘new’ smaller dataset, with one less feature, at each of these child nodes, we simply apply the same entropy calculation procedures recursively for each child. Size SmallLarge ColorSizeShapeEdible? YellowSmallRound+ YellowSmallRound- GreenSmallIrregular+ YellowSmallRound+ YellowSmallRound+ YellowSmallRound+ GreenSmallRound- YellowSmallIrregular+ ColorSizeShapeEdible? GreenLargeIrregular- YellowLargeRound+ YellowLargeRound- YellowLargeRound+ YellowLargeRound- YellowLargeRound- YellowLargeRound- YellowLargeIrregular+ ColorShapeEdible? YellowRound+ YellowRound- GreenIrregular+ YellowRound+ YellowRound+ YellowRound+ GreenRound- YellowIrregular+ ColorShapeEdible? GreenIrregular- YellowRound+ YellowRound- YellowRound+ YellowRound- YellowRound- YellowRound- YellowIrregular+

Calculations Entropy of this whole set(+6,-2): I(color)=.7375G(color)=.0738 color=yellow,+5,-1; I(color=yellow)=0.65 color=green,+1,-1; I(color=green)= 1.0 I(shape)=.6887G(shape)=.1226 shape=regular,+4,-2; I(shape=regular)=.9183 shape=irregular,+2,-0; I(shape=irregular)= 0 ColorShapeEdible? YellowRound+ YellowRound- GreenIrregular+ YellowRound+ YellowRound+ YellowRound+ GreenRound- YellowIrregular+

Binary Data Sometimes most of our attributes are binary values or have a low number of possible values. (Like the berry example.) –In this case, the information gain metric is appropriate for selecting which attribute to use to branch at each node. When we have some attributes with very many values, then there is another metric which is better to use.

Information Gain Ratio: GR() The information gain metric has a bias toward branching on attributes that have very many possible values. To combat this bias, we use a different branching-attribute selection metric, which is called: “Information Gain Ratio” GR(size)

Formula for Info Gain Ratio Formula for Information Gain Ratio… P(v) is the proportion of the values of this attribute that are equal to v. –Note: we’re not counting +/- in this case. We’re counting the values in the ‘attribute’ column. Let’s use the information gain ratio metric to select the best attribute to branch on.

Calculation of GR() GR(size) = G(size) / Sum(…) GR(size) =.1007 G(size) = occurrences of size=small; 8 occurrences of size=large. Sum(…) = (-8/16)*log2(8/16) + (-8/16)*log2(8/16) =1 GR(color)=.0437 G(color)= occurrences of color=yellow; 13 of color=green. Sum(…) = (-3/16)*log2(3/16) + (-13/16)*log2(13/16) =.6962 GR(shape)=.0379 G(shape)= occurrences of shape=regular; 4 of shape=irregular Sum(…) = (-12/16)*log2(12/16) + (-4/16)*log2(4/16) =.8113

Selecting the root Same as before, but now instead of selecting the attribute with the highest information gain, we select the one with the highest information gain ratio. We will use this attribute to branch at the root.

Data Subsets / Recursive Step Same as before. After we select an attribute for the root, then partition the data set into subsets. And then remove that attribute from consideration for those subsets below its node. Now, we recurse. We calculate what each of our subsets will be down each branch. –We recursively calculate the info gain ratio for all the attributes on each of these data subsets in order to select how the tree will branch below the root.

Recursively Entropy of this whole set(+6,-2): G(color)=.0738 GR(color)=.0909 color=yellow,+5,-1; I(color=yellow)=0.65 color=green,+1,-1; I(color=green)= 1.0 G(shape)=.1226 GR(shape)=.1511 shape=regular,+4,-2; I(shape=regular)=.9183 shape=irregular,+2,-0; I(shape=irregular)= 0 ColorShapeEdible? YellowRound+ YellowRound- GreenIrregular+ YellowRound+ YellowRound+ YellowRound+ GreenRound- YellowIrregular+