Machine Learning: Lecture 3

Slides:

Advertisements

Similar presentations

1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)

Advertisements

Decision Tree Learning

Decision Trees Decision tree representation ID3 learning algorithm

Machine Learning III Decision Tree Induction

1er. Escuela Red ProTIC - Tandil, de Abril, Decision Tree Learning 3.1 Introduction –Method for approximation of discrete-valued target functions.

Decision Tree Algorithm (C4.5)

ICS320-Foundations of Adaptive and Learning Systems

Classification Techniques: Decision Tree Learning

Decision Tree Learning 主講人：虞台文大同大學資工所智慧型多媒體研究室.

Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.

Decision Tree Learning

Part 7.3 Decision Trees Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting.

Decision Tree Algorithm

CS 590M Fall 2001: Security Issues in Data Mining Lecture 4: ID3.

Decision Tree Learning Learning Decision Trees (Mitchell 1997, Russell & Norvig 2003) –Decision tree induction is a simple but powerful learning paradigm.

Induction of Decision Trees

Decision Tree Learning

Decision tree learning

By Wang Rui State Key Lab of CAD&CG

Fall 2004 TDIDT Learning CS478 - Machine Learning.

Machine Learning Chapter 3. Decision Tree Learning

CS 484 – Artificial Intelligence1 Announcements List of 5 source for research paper Homework 5 due Tuesday, October 30 Book Review due Tuesday, October.

Artificial Intelligence 7. Decision trees

Mohammad Ali Keyvanrad

Decision tree learning Maria Simi, 2010/2011 Inductive inference with decision trees  Decision Trees is one of the most widely used and practical methods.

Machine Learning Lecture 10 Decision Tree Learning 1.

CpSc 810: Machine Learning Decision Tree Learning.

Learning from Observations Chapter 18 Through

Decision-Tree Induction & Decision-Rule Induction

For Wednesday No reading Homework: –Chapter 18, exercise 6.

For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.

CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.

1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.

CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.

机器学习陈昱北京大学计算机科学技术研究所信息安全工程研究中心. 课程基本信息  主讲教师：陈昱 Tel ：  助教：程再兴， Tel ：  课程网页：

1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.

Decision Trees, Part 1 Reading: Textbook, Chapter 6.

Decision Tree Learning

Decision Tree Learning Presented by Ping Zhang Nov. 26th, 2007.

Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.

Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

CSE573 Autumn /09/98 Machine Learning Administrative –Last topic: Decision Tree Learning Reading: 5.1, 5.4 Last time –finished NLP sample system’s.

CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.

Decision Tree Learning

Machine Learning Inductive Learning and Decision Trees

DECISION TREES An internal node represents a test on an attribute.

Decision Trees an introduction.

Università di Milano-Bicocca Laurea Magistrale in Informatica

CS 9633 Machine Learning Decision Tree Learning

Decision Tree Learning

Decision trees (concept learnig)

Machine Learning Lecture 2: Decision Tree Learning.

Decision trees (concept learnig)

Classification Algorithms

Decision Tree Learning

Artificial Intelligence

Data Science Algorithms: The Basic Methods

Introduction to Machine Learning Algorithms in Bioinformatics: Part II

Issues in Decision-Tree Learning Avoiding overfitting through pruning

Decision Tree Saed Sayad 9/21/2018.

Machine Learning Chapter 3. Decision Tree Learning

Decision Trees Decision tree representation ID3 learning algorithm

Play Tennis ????? Day Outlook Temperature Humidity Wind PlayTennis

Machine Learning Chapter 3. Decision Tree Learning

Decision Trees.

Decision Trees Decision tree representation ID3 learning algorithm

Artificial Intelligence 9. Perceptron

Decision Trees Berlin Chen

Machine Learning: Decision Tree Learning

Version Space Machine Learning Fall 2018.

Presentation transcript:

Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997) thanks to Brian Pardo (http://bryanpardo.com) for the illustrations on slides 9, 18

Decision Tree Representation Outlook Sunny Rain Overcast Humidity Wind Yes High Normal Strong Weak Yes No Yes No A Decision Tree for the concept PlayTennis

Appropriate Problems for Decision Tree Learning Instances are represented by discrete attribute-value pairs (though the basic algorithm was extended to real-valued attributes as well) The target function has discrete output values (can have more than two possible output values --> classes) Disjunctive hypothesis descriptions may be required The training data may contain errors The training data may contain missing attribute values

ID3: The Basic Decision Tree Learning Algorithm See database on the next slide What is the “best” attribute? Answer: Outlook D11 D12 D1 D2 D5 D10 D4 D6 D14 D3 D8 D9 D7 [“best” = with highest information gain] D13

Day Outlook Temperature Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Strong D3 Overcast Yes D4 Rain Mild D5 Cool Normal D6 D7 D8 D9 D10 D11 D12 D13 D14

ID3 (Cont’d) Yes What are the “best” attributes? Outlook Sunny Rain Overcast D6 D1 D8 D10 D3 D14 D4 D11 D12 D9 D2 D7 D5 D13 Yes What are the “best” attributes? Humidity and Wind

What Attribute to choose to “best” split a node? Choose the attribute that minimize the Disorder (or Entropy) in the subtree rooted at a given node. Disorder and Information are related as follows: the more disorderly a set, the more information is required to correctly guess an element of that set. Information: What is the best strategy for guessing a number from a finite set of possible numbers? i.e., how many questions do you need to ask in order to know the answer (we are looking for the minimal number of questions). Answer: log2 |S|, where S is the set of numbers and |S|, its cardinality. Q1: is it smaller than 5? Q2: is it smaller than 2? E.g.: 0 1 2 3 4 5 6 7 8 9 10 Q2 Q1

Entropy (1) The entropy of a set is a measure for characterizing the degree of disorder or impurity in a collection of examples. The idea was borrowed from the field of thermodynamics and relates to the states (Gas/Liquid/Solid) of a system. For classification, we use the following formula: Entropy(S) = - p+ log2 p+ - p- log2 p- Where p+ / p- represent the proportions of positive / negative examples, in S

Entropy (2) The entropy can also be thought of as the minimum number of bits of information necessary to encode the classification of an arbitrary member of S. Examples: If p+ is 1  All the examples are positive, i.e., no information is needed, Entropy(S)=0 If p+ is ½  1 bit is required to indicate whether the example is positive or negative, Entropy(S) = 1 If p+ is .8  the class can be encoded by (on average) less than 1 bit by giving short codes to positive examples and large codes to negative ones

Entropy (3)

Entropy (4)

Information Gain (1) The information gain is a measure of the effectiveness of an attribute in classifying the training data. The information gain is also the expected reduction in entropy caused by partitioning the examples according to this attribute. Gain(S, A), is the expected reduction in entropy caused by knowing the value of Attribute A.

Information Gain (2)

Day Outlook Temperature Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Strong D3 Overcast Yes D4 Rain Mild D5 Cool Normal D6 D7 D8 D9 D10 D11 D12 D13 D14

Example: PlayTennis Calculate the Information Gain of attribute Wind Step 1: Calculate Entropy(S) Step 2: Calculate Gain(S, Wind) Calculate the Information Gain of all the other attributes (outlook, temperature and humidity) Choose the attribute with the highest information gain. <See the calculations in class>

Hypothesis Space Search in Decision Tree Learning Hypothesis Space: Set of possible decision trees (i.e., complete space of finie discrete-valued functions). Search Method: Simple-to-Complex Hill-Climbing Search (only a single current hypothesis is maintained ( from candidate-elimination method)). No Backtracking!!! Evaluation Function: Information Gain Measure Batch Learning: ID3 uses all training examples at each step to make statistically-based decisions ( from candidate-elimination method which makes decisions incrementally). ==> the search is less sensitive to errors in individual training examples.

Inductive Bias in Decision Tree Learning ID3’s Inductive Bias: Shorter trees are preferred over longer trees. Trees that place high information gain attributes close to the root are preferred over those that do not. Note: this type of bias is different from the type of bias used by Candidate-Elimination: the inductive bias of ID3 follows from its search strategy (preference or search bias) whereas the inductive bias of the Candidate-Elimination algorithm follows from the definition of its hypothesis space (restriction or language bias).

Why Prefer Short Hypotheses? Occam’s razor: Prefer the simplest hypothesis that fits the data [William of Occam (Philosopher), circa 1320] Scientists seem to do that: E.g., Physicist seem to prefer simple explanations for the motion of planets, over more complex ones Argument: Since there are fewer short hypotheses than long ones, it is less likely that one will find a short hypothesis that coincidentally fits the training data. Problem with this argument: it can be made about many other constraints. Why is the “short description” constraint more relevant than others? Nevertheless: Occam’s razor was shown experimentally to be a successful strategy!

Issues in Decision Tree Learning: I. Overfitting Definition: Given a hypothesis space H, a hypothesis hH is said to overfit the training data if there exists some alternative hypothesis h’H, such that h has smaller error than h’ over the training examples, but h’ has a smaller error than h over the entire distribution of instances.

Avoiding Overfitting the Data (1) There are two approaches for overfitting avoidance in Decision Trees: Stop growing the tree before it perfectly fits the data Allow the tree to overfit the data, and then post-prune it.

Avoiding Overfitting the Data (2) There are three criterion that can be used to determine the optimal final tree size? Train and Validation Set Approach: use a separate set of examples (distinct from the training examples) to evaluate the utility  Reduced Error Pruning Use all the available training data but apply a statistical test to estimate the effect of expanding or pruning a particular node. Use an explicit measure of complexity for encoding the training examples and the decision tree.

Avoiding Overfitting the data (3) Reduced Error Prunning Consider each of the decision nodes in the tree For each node, compare the performance of the tree with that node expanded or not. Example: No 3. If Right Tree is as accurate as Left Tree over the validation set then prune that node.

Avoiding Overfitting the data (4) Rule Post-Pruning This is the strategy used in C4. 5: Grow a Tree Convert the tree into sets of rules Prune (generalize) each rule by removing any preconditions that result in improving its estimated accuracy. Sort the pruned rules by their estimated accuracy and consider them in this sequence when classifying subsequent instances.

Issues in Decision Tree Learning: II. Other Issues Incorporating Continuous-Valued Attributes Alternative Measures for Selecting Attributes Handling Training Examples with Missing Attribute Values Handling Attributes with Differing Costs