LEARNING DECISION TREES

Slides:



Advertisements
Similar presentations
Learning from Observations
Advertisements

Learning from Observations Chapter 18 Section 1 – 3.
ICS 178 Intro Machine Learning
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Classification Techniques: Decision Tree Learning
Final Exam: May 10 Thursday. If event E occurs, then the probability that event H will occur is p ( H | E ) IF E ( evidence ) is true THEN H ( hypothesis.
Learning Department of Computer Science & Engineering Indian Institute of Technology Kharagpur.
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Spring 2004.
Decision making in episodic environments
Machine Learning Group University College Dublin Decision Trees What is a Decision Tree? How to build a good one…
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2005.
Cooperating Intelligent Systems
LEARNING FROM OBSERVATIONS Yılmaz KILIÇASLAN. Definition Learning takes place as the agent observes its interactions with the world and its own decision-making.
18 LEARNING FROM OBSERVATIONS
Learning From Observations
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2004.
Induction of Decision Trees
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18.
ICS 273A Intro Machine Learning
LEARNING FROM OBSERVATIONS Yılmaz KILIÇASLAN. Definition Learning takes place as the agent observes its interactions with the world and its own decision-making.
Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.
Learning decision trees
Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.
ICS 273A Intro Machine Learning
Learning: Introduction and Overview
CS 4700: Foundations of Artificial Intelligence
Induction of Decision Trees (IDT) CSE 335/435 Resources: – –
Machine learning Image source:
Machine learning Image source:
Machine Learning CPS4801. Research Day Keynote Speaker o Tuesday 9:30-11:00 STEM Lecture Hall (2 nd floor) o Meet-and-Greet 11:30 STEM 512 Faculty Presentation.
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
Inductive learning Simplest form: learn a function from examples
LEARNING DECISION TREES Yılmaz KILIÇASLAN. Definition - I Decision tree induction is one of the simplest, and yet most successful forms of learning algorithm.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Learning from observations
Learning from Observations Chapter 18 Through
CHAPTER 18 SECTION 1 – 3 Learning from Observations.
Lecture 7 : Intro to Machine Learning Rachel Greenstadt & Mike Brennan November 10, 2008.
Learning from Observations Chapter 18 Section 1 – 3, 5-8 (presentation TBC)
Learning from Observations Chapter 18 Section 1 – 3.
Machine Learning II 부산대학교 전자전기컴퓨터공학과 인공지능연구실 김민호
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
L6. Learning Systems in Java. Necessity of Learning No Prior Knowledge about all of the situations. Being able to adapt to changes in the environment.
Decision Trees. What is a decision tree? Input = assignment of values for given attributes –Discrete (often Boolean) or continuous Output = predicated.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.
DECISION TREE Ge Song. Introduction ■ Decision Tree: is a supervised learning algorithm used for classification or regression. ■ Decision Tree Graph:
CSC 8520 Spring Paula Matuszek DecisionTreeFirstDraft Paula Matuszek Spring,
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
Chapter 18 Section 1 – 3 Learning from Observations.
Learning From Observations Inductive Learning Decision Trees Ensembles.
Anifuddin Azis LEARNING. Why is learning important? So far we have assumed we know how the world works Rules of queens puzzle Rules of chess Knowledge.
Decision Tree Learning CMPT 463. Reminders Homework 7 is due on Tuesday, May 10 Projects are due on Tuesday, May 10 o Moodle submission: readme.doc and.
Machine Learning Reading: Chapter Classification Learning Input: a set of attributes and values Output: discrete valued function Learning a continuous.
Learning from Observations
Learning from Observations
Learning from Observations
DECISION TREES An internal node represents a test on an attribute.
Introduce to machine learning
Presented By S.Yamuna AP/CSE
Decision making in episodic environments
Learning from Observations
Learning from Observations
Decision trees One possible representation for hypotheses
Decision Trees Jeff Storey.
Machine Learning: Decision Tree Learning
Decision Trees - Intermediate
Presentation transcript:

LEARNING DECISION TREES Yılmaz KILIÇASLAN

Definition - I Decision tree induction is one of the simplest, and yet most successful forms of learning algorithm. A decision tree takes as input an object or situation described by a set of attributes and returns a "decision” – the predicted output value for the input. The input attributes and the output value can be discrete or continuous.

Definition - II Learning a discrete-valued function is called classification learning. Learning a continuous function is called regression. We will concentrate on Boolean classification, wherein each example is classified as true (positive) or false (negative).

Decision Trees as Performance Elements A decision tree reaches its decision by performing a sequence of tests. Each internal node in the tree corresponds to a test of the value of one of the properties, and the branches from the node are labeled with the possible values of the test. Each leaf node in the tree specifies the value to be returned if that leaf is reached.

Example Aim: To decide whether to wait for a table at a restaurant. Attributes: 1. Alternate: whether there is a suitable alternative restaurant nearby. 2. Bar: whether the restaurant has a comfortable bar area to wait in. 3. Fri/Sat: true on Fridays and Saturdays. 4. Hungry: whether we are hungry. 5. Patrons: how many people are in the restaurant (values are None, Some, and Full). 6. Price: the restaurant's price range ($, $$, $$$). 7. Raining: whether it is raining outside. 8. Reservation: whether we made a reservation. 9. Type: the kind of restaurant (French, Italian, Thai, or burger). 10. WaitEstimate: the wait estimated by the host (0-10 minutes, 10-30, 30-60, >60).

A Decision Tree to Decide Whether to Wait for a Table

Expressiveness of Decision Trees - I Logically speaking, any particular decision tree hypotlhesis for the Will Wait goal predicate can be seen as an assertion of the form s WillWait(s)  ( P1( s ) V P2( s ) V . . . V Pn( s ) ), where each condition Pi(s) is a conjunction of tests corresponding to a path from the root of the tree to a leaf with a positive outcome. Although this looks like a first-order sentence, it is, in a sense, propositional.

Expressiveness of Decision Trees - II Decision trees are fully expressive within the class of propositional languages; that is, any Boolean function can be written as a decision tree. This can be done trivially by having each row in the truth table for the function correspond to a path in the tree. This would yield an exponentially large decision tree representation because the truth table has exponentially many rows. Decision trees can represent many (but not all) functions with much smaller trees.

Inducing decision trees from examples – I An example for a Boolean decision tree consists of a vector of input attributes, X, and a single Boolean output value y. The complete set of examples is called the training set. In order to find a decision tree that agrees with the training set, we could simply construct a decision tree that has one path to a leaf for each example, where the path tests each attribute in turn and follows the value for the example and the leaf has the classification of the example. When given the same example again, the decision tree will come up with the right classification. Unfortunately, it will not have much to say about any other cases!

Inducing decision trees from examples – II See figure 18.3.

Decision-Tree Leaning Algorithm - I The basic idea behind the Decision-Tree Learning Algorithm is to test the most important attribute first. By "most important," we mean the one that makes the most difference to the classification of an example. That way, we hope to get to the correct classification with a small number of tests, meaning that all paths in the tree will be short and the tree as a whole will be small.

Decision-Tree Leaning Algorithm - II Splitting the examples by testing on attributes (i) : (a) Splitting on Type brings us no nearer to distinguishing between positive and negative examples.

Decision-Tree Leaning Algorithm - III Splitting the examples by testing on attributes (ii) : Patrons? Splitting on Patrons does a good job of separating positive and negative examples. After splitting on Patrons, Hungry is a fairly good second test.

Decision-Tree Leaning Algorithm - IV In general, after the first attribute test splits up the examples, each outcome is a new decision tree learning problem in itself, with fewer examples and one fewer attribute. There are four cases to consider for these recursive problems: 1. If there are some positive and some negative examples, then choose the best attribute to split them. Figure 18.4(b) shows Hungry being used to split the remaining examples. 2. If all the remaining examples are positive (or all. negative), then we are done: we can answer Yes or No. Figure 18.4(b) shows examples of this in the None and Some cases.

Decision-Tree Leaning Algorithm - V 3. If there are no examples left, it means that no such example has been observed, and we return a default value calculated from the majority classification at the node's parent. 4. If there are no attributes left, but both positive and negative examples, we have a problem. It means that these examples have exactly the same description, but different classifications. This happens when some of the data are incorrect; we say there is noise in the data. It also happens either when the attributes do not give enough information to describe the situation fully, or when the domain is truly nondeterministic. One simple way out of the problem is to use a majority vote.

Decision-Tree Leaning Algorithm - VI See Figure 18.6.

A Decision Tree Induced From Examples

Choosing attribute tests The scheme used in decision tree learning for selecting attributes is designed to minimize the depth of the final tree. The idea is to pick the attribute that goes as far as possible toward providing an exact classification of the examples. A perfect attribute divides the examples into sets that are all positive or all negative. The Patrons attribute is not perfect, but it is fairly good. A really useless attribute, such as Type, leaves the example sets with roughly the same proportion of positive and negative examples as the original set. All we need, then, is a formal measure of "fairly good" and "really useless“. The measure should have its maximum value when the attribute is perfect and its minimum value when the attribute is of no use at all.

Amount of Information - I One suitable measure is the expected amount of information provided by the attribute, where we use the term in the mathematical sense first defined in Shannon and Weaver (1949). Information theory measures information content in bits. One bit of information is enough to answer a yeslno question about which one has no idea, such as the flip of a fair coin.

Amount of Information - II In general, if the possible answers vi have probabilities P(vi), then the information content I of the actual answer is given by:

Amount of Information - III For decision tree learning, the question that needs answering is; for a given example, what is the correct classification? An estimate of the probabilities of the possible answers before any of the attributes have been tested is given by the proportions of positive and negative examples in the training set. Suppose the training set contains p positive examples and n negative examples. Then an estimate of the information contained in a correct answer is:

Amount of Information - IV A test on a single attribute A will not usually tell us all the information, but it will give us some of it. We can measure exactly how much by looking at how much information we still need after the attribute test. Any attribute A divides the training set E into subsets E1, . . . , Ev, according to their values for A, where A can have v distinct values. Each subset Ei has pi positive examples and ni negative examples, so if we go along that branch, we will need an additional I(pi/(pi + ni), ni/(pi + ni)) bits of information to answer the question.

Amount of Information - V A randomly chosen example from the training set has the ith value for the attribute with probability (pi + n i ) / ( p + n), so on average, after testing attribute A, we will need the following amount of information to classify the example:

Amount of Information - VI The information gain from the attribute test is the difference between the original information requirement and the new requirement:

Amount of Information - VII The heuristic used in the CHOOSE-ATTRIBUTE function is just to choose the attribute with the largest gain. Returning to the attributes considered in Figure 18.4, we have: Gain(Patrons) = 0.541 bits.