CSE573 Autumn 1997 1 03/11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.

Slides:



Advertisements
Similar presentations
1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
Advertisements

Decision Trees Decision tree representation ID3 learning algorithm
Machine Learning III Decision Tree Induction
1er. Escuela Red ProTIC - Tandil, de Abril, Decision Tree Learning 3.1 Introduction –Method for approximation of discrete-valued target functions.
Decision Tree Approach in Data Mining
ICS320-Foundations of Adaptive and Learning Systems
Classification Techniques: Decision Tree Learning
Part 7.3 Decision Trees Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting.
CS 590M Fall 2001: Security Issues in Data Mining Lecture 4: ID3.
Decision Tree Learning Learning Decision Trees (Mitchell 1997, Russell & Norvig 2003) –Decision tree induction is a simple but powerful learning paradigm.
Induction of Decision Trees
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Decision Trees Decision tree representation Top Down Construction
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Decision Tree Learning
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Artificial Intelligence 7. Decision trees
Mohammad Ali Keyvanrad
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
For Friday No reading No homework. Program 4 Exam 2 A week from Friday Covers 10, 11, 13, 14, 18, Take home due at the exam.
Machine Learning Lecture 10 Decision Tree Learning 1.
CpSc 810: Machine Learning Decision Tree Learning.
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
Decision-Tree Induction & Decision-Rule Induction
For Wednesday No reading Homework: –Chapter 18, exercise 6.
CS690L Data Mining: Classification
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
Decision Tree Learning
Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
CSE573 Autumn /09/98 Machine Learning Administrative –Last topic: Decision Tree Learning Reading: 5.1, 5.4 Last time –finished NLP sample system’s.
Decision Tree Learning
Chapter 6 Decision Tree.
Machine Learning Inductive Learning and Decision Trees
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
CS 9633 Machine Learning Decision Tree Learning
Decision Tree Learning
Decision trees (concept learnig)
Machine Learning Lecture 2: Decision Tree Learning.
Decision trees (concept learnig)
Classification Algorithms
Decision Tree Learning
Artificial Intelligence
Data Science Algorithms: The Basic Methods
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Decision Tree Saed Sayad 9/21/2018.
ID3 Algorithm.
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning: Lecture 3
Decision Trees Decision tree representation ID3 learning algorithm
Lecture 05: Decision Trees
Machine Learning Chapter 3. Decision Tree Learning
COMP61011 : Machine Learning Decision Trees
Decision Trees Decision tree representation ID3 learning algorithm
Learning Chapter 18 and Parts of Chapter 20
Data Classification for Data Mining
INTRODUCTION TO Machine Learning 2nd Edition
INTRODUCTION TO Machine Learning
Decision Trees Jeff Storey.
Presentation transcript:

CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20 p.m., here –Additional help today after class (me), Thursday 2-4 (SteveW), Friday after class (me), Saturday (me) Last time –general introduction to Machine Learning Machine Learning often isn’t what we mean by the word –“he learned the difference between right and wrong” –varieties of Machine Learning reinforcement learning (learn plans/policies) explanation-based learning (learn control rules to speed up problem solving) inductive (concept) learning (learn a general description from examples) –decision tree learning is an interesting special case This time –finish the decision tree construction algorithm

CSE573 Autumn Example

CSE573 Autumn Basic Algorithm Recall, a node in the tree represents a conjunction of attribute values. We will try to build “the shortest possible” tree that classifies all the training examples correctly. In the algorithm we also store the list of attributes we have not used so far for classification. Initialization: tree  {} attributes  {all attributes} examples  {all training examples} Recursion: –Choose a new attribute A with possible values {a i } –For each a i, add a subtree formed by recursively building the tree with the current node as root all attributes except A all examples where A=a i

CSE573 Autumn Basic Algorithm (cont.) Termination (working on a single node): –If all examples have the same classification, then this combination of attribute values is sufficient to classify all (training) examples. Return the unanimous classification. –If examples is empty, then there are no examples with this combination of attribute values. Associate some “guess” with this combination. –If attributes is empty, then the training data is not sufficient to discriminate. Return some “guess” based on the remaining examples.

CSE573 Autumn What Makes a Good Attribute for Splitting? D1,D2,...,D14 D1D2D14... DAY D1,D2,...,D14 ALL D1,D2,...,D14 HUMIDITY D1, S2, D3, D4, D8, D14 D5, D6, D7 D9, D10, D11 D12, D13 = high = normal D1,D2,...,D14 OUTLOOK D1, D3, D8, D9, D11 D3, D7, D12, D13 = sunny = overcast= rain D4, D5, D6, D10, D14 = D1 = D2 = D14 = TRUE

CSE573 Autumn How to choose the next attribute What is our goal in building the tree in the first place? –Maximize accuracy over the entire data set –Minimize expected number of tests to classify an example in the training set (In both cases this can argue for building the shortest tree.) We can’t really do the first looking only at the training set: we can only build a tree accurate for our subset and assume the characteristics of the full data set are the same. To minimize the expected number of tests –the best test would be one where each branch has all positive or all negative instances –the worst test would be one where the proportion of positive to negative instances is the same in every branch knowledge of A would provide no information about the example’s ultimate classification

CSE573 Autumn The Entropy (Disorder) of a Collection Suppose S is a collection containing positive and negative examples of the target concept: –Entropy(S)  – (p + log 2 p + + p - log 2 p - ) –where p + is the fraction of examples that are positive and p - is the fraction of examples that are negative Good features –minimum of 0 where p + = 0 and where p - = 0 –maximum of 1 where p + = p - = 0.5 Interpretation: how far away are we from having a leaf node in the tree? The best attribute would reduce the entropy in the child collections as quickly as possible.

CSE573 Autumn Entropy and Information Gain The best attribute is one that maximizes the expected decrease in entropy –if entropy decreases to 0, the tree need not be expanded further –if entropy does not decrease at all, the attribute was useless Gain is defined to be –Gain(S, A) = Entropy(S) –  v  values(A) p {A=v} Entropy(S {A=v} ) –where p {A=v} is the proportion of S where A=v, and –S {A=v} is the collection taken by selecting those elements of S where A=v

CSE573 Autumn Expected Information Gain Calculation [10+,15-] E(2/5) = 0.97 [8+,2-] E(8/10) = 0.72 [1+,11-] E(1/12) = 0.43 [1+,2-] E(1/3) = 0.92 (10) (12) (3) S = Gain(S,A) = (10/ / /25.92) = = 0.37

CSE573 Autumn Example S: [9+, 5-] E(9/14) = 0.940

CSE573 Autumn Choosing the First Attribute Humidity S: [9+, 5-] E = HighLow S: [3+, 4-] E = S: [6+, 1-] E = Wind S: [9+, 5-] E = HighLow S: [6+, 2-] E = S: [3+, 3-] E = Gain(S, Humidity) = (7/14) (7/14).592 =.151 Gain(S, Wind) = (8/14) (6/14)1.00 =.048 Gain(S, Outlook) =.246 Gain(S, Temperature) =.029

CSE573 Autumn After the First Iteration Outlook SunnyRain ? Yes ? Overcast D1, D2, …, D D1, D2, D8, D9, D11 [3+, 2-] E=.970 D3, D7, D12, D13 [4+, 0-] D4, D5, D6, D10, D14 [3+, 2-] Gain(S sunny, Humidity) =.970 Gain(S sunny, Temp) =.570 Gain(S sunny, Wind) =.019

CSE573 Autumn Final Tree Outlook SunnyRain Humidity Yes Wind Overcast NoYesNoYes HighLowStrongWeak

CSE573 Autumn Some Additional Technical Problems Noise in the data –Not much you can do about it Overfitting –What’s good for the training set may not be good for the full data set Missing values –Attribute values omitted in training set cases or in subsequent (untagged) cases to be classified

CSE573 Autumn Data Overfitting Overfitting, definition –Given a set of trees T, a tree t  T is said to overfit the training data if there is some alternative tree t’, such that t has better accuracy than t’ over the training examples, but t’ has better accuracy than t over the entire set of examples The decision not to stop until attributes or examples are exhausted is somewhat arbitrary –you could always stop and take the majority decision, and the tree would be shorter as a result! The standard stopping rule provides 100% accuracy on the training set, but not necessarily on the test set –if there is noise in the training data –if the training data is too small to give good coverage likely to be spurious correlation

CSE573 Autumn Overfitting (continued) How to avoid overfitting –stop growing the tree before it perfectly classifies the training data –allow overfitting, but post-prune the tree Training and validation sets –training set is used to build the tree –a separate validation set is used to evaluate the accuracy over subsequent data, and to evaluate the impact of pruning validation set is unlikely to exhibit the same noise and spurious correlation –rule of thumb: 2/3 to the training set, 1/3 to the validation set

CSE573 Autumn Reduced Error Pruning Pruning a node consists of removing all subtrees, making it a leaf, and assigning it the most common classification of the associated training examples. Prune nodes iteratively and greedily: next remove the node that most improves accuracy over the validation set –but never remove a node that decreases accuracy A good method if you have lots of cases

CSE573 Autumn Overfitting (continued) How to avoid overfitting –stop growing the tree before it perfectly classifies the training data –allow overfitting, but post-prune the tree Training and validation sets –training set is used to form the learned hypothesis –validation set used to evaluate the accuracy over subsequent data, and to evaluate the impact of pruning –justification: validation set is unlikely to exhibit the same noise and spurious correlation –rule of thumb: 2/3 to the training set, 1/3 to the validation set

CSE573 Autumn Missing Attribute Values Situations –missing attribute value(s) in the training set –missing value(s) in the validation or subsequent tests Quick and dirty methods –assign it the same value most common for other training examples at the same node –assign it the same value most common for other training examples at the same node that have the same classification “Fractional” method –assign a probability to each value of A based on observed frequencies –create “fractional cases” with these probabilities –weight information gain with each case’s fraction

CSE573 Autumn Example: Fractional Values D1, D2, D8, D9, D11 [3+, 2-] E=.970 D2(1.0), D8(0.5), D11(1.0) [1.5+,1.5-] E=1 D1(1.0), D8(0.5), D9(1.0) [1.5+,1.5-] E=1 wind=weak wind=strong

CSE573 Autumn Decision Tree Learning The problem: given a data set, produce the shortest-depth decision tree that accurately classifies the data The (heuristic): build the tree greedily on the basis of expected entropy loss Common problems –the training set is not a good surrogate for the full data set noise spurious correlations –thus the optimal tree for the test set may not be accurate for the full data set (overfitting) –missing values in training set or subsequent cases