Learning Chapter 18 and Parts of Chapter 20

Slides:



Advertisements
Similar presentations
DECISION TREES. Decision trees  One possible representation for hypotheses.
Advertisements

CHAPTER 9: Decision Trees
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Hunt’s Algorithm CIT365: Data Mining & Data Warehousing Bajuna Salehe
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Decision Tree.
Classification Techniques: Decision Tree Learning
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Decision Trees.
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
1 Pattern Recognition Pattern recognition is: 1. The name of the journal of the Pattern Recognition Society. 2. A research area in which patterns in data.
Pattern Recognition: Readings: Ch 4: , , 4.13
1 Pattern Recognition Pattern recognition is: 1. The name of the journal of the Pattern Recognition Society. 2. A research area in which patterns in data.
Ensemble Learning: An Introduction
Classification: Decision Trees
Induction of Decision Trees
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Lecture 5 (Classification with Decision Trees)
Decision Trees an Introduction.
Three kinds of learning
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Classification.
Classification: Decision Trees 2 Outline  Top-Down Decision Tree Construction  Choosing the Splitting Attribute  Information Gain and Gain Ratio.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Chapter 9 – Classification and Regression Trees
For Friday No reading No homework. Program 4 Exam 2 A week from Friday Covers 10, 11, 13, 14, 18, Take home due at the exam.
Machine Learning Queens College Lecture 2: Decision Trees.
Learning from Observations Chapter 18 Through
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
D ECISION T REES Jianping Fan Dept of Computer Science UNC-Charlotte.
CS690L Data Mining: Classification
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.
Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length Abdomen Length.
Lecture Notes for Chapter 4 Introduction to Data Mining
Decision Trees.
Decision Trees by Muhammad Owais Zahid
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Data Mining Chapter 4 Algorithms: The Basic Methods - Constructing decision trees Reporter: Yuen-Kuei Hsueh Date: 2008/7/24.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
CSE573 Autumn /09/98 Machine Learning Administrative –Last topic: Decision Tree Learning Reading: 5.1, 5.4 Last time –finished NLP sample system’s.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
10. Decision Trees and Markov Chains for Gene Finding.
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
Artificial Intelligence
Ch9: Decision Trees 9.1 Introduction A decision tree:
Data Science Algorithms: The Basic Methods
Learning I Linda Shapiro CSE 455.
Jianping Fan Dept of Computer Science UNC-Charlotte
Learning Chapter 18 and Parts of Chapter 20
Learning Chapter 18 and Parts of Chapter 20
Learning Chapter 18 and Parts of Chapter 20
Learning I Linda Shapiro EE/CSE 576.
Learning I Linda Shapiro CSE 455.
Learning Chapter 18 and Parts of Chapter 20
Learning I Linda Shapiro EE/CSE 576.
Learning I Linda Shapiro ECE/CSE 576.
Learning Chapter 18 and Parts of Chapter 20
Learning I Linda Shapiro ECE/CSE 576.
Presentation transcript:

Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all the knowledge a system needs. Different types of data may require very different parameters. Instead of trying to hard code all the knowledge, it makes sense to learn it.

Learning from Observations Supervised Learning – learn a function from a set of training examples which are preclassified feature vectors. feature vector class (square, red) I (square, blue) I (circle, red) II (circle blue) II (triangle, red) I (triange, green) I (ellipse, blue) II (ellipse, red) II Given a previously unseen feature vector, what is the rule that tells us if it is in class I or class II? (circle, green) ? (triangle, blue) ?

Learning from Observations Unsupervised Learning – No classes are given. The idea is to find patterns in the data. This generally involves clustering. Reinforcement Learning – learn from feedback after a decision is made.

Topics to Cover Inductive Learning decision trees ensembles Bayesian decision making neural nets

Decision Trees Theory is well-understood. Often used in pattern recognition problems. Has the nice property that you can easily understand the decision rule it has learned.

Shall I play tennis today?

How do we choose the best attribute? What should that attribute do for us?

Which attribute to select? witten&eibe

Criterion for attribute selection Which is the best attribute? The one which will result in the smallest tree Heuristic: choose the attribute that produces the “purest” nodes Need a good measure of purity! Maximal when? Minimal when?

Information Gain Which test is more informative? Split over whether Balance exceeds 50K Split over whether applicant is employed Over 50K Less or equal 50K Employed Unemployed

Information Gain Impurity/Entropy (informal) Measures the level of impurity in a group of examples

Impurity Very impure group Less impure Minimum impurity

Entropy: a common way to measure impurity pi is the probability of class i Compute it as the proportion of class i in the set. Entropy comes from information theory. The higher the entropy the more the information content. What does that mean for learning from examples?

2-Class Cases: Minimum impurity What is the entropy of a group in which all examples belong to the same class? entropy = - 1 log21 = 0 What is the entropy of a group with 50% in either class? entropy = -0.5 log20.5 – 0.5 log20.5 =1 not a good training set for learning Maximum impurity good training set for learning

Information Gain We want to determine which attribute in a given set of training feature vectors is most useful for discriminating between the classes to be learned. Information gain tells us how important a given attribute of the feature vectors is. We will use it to decide the ordering of attributes in the nodes of a decision tree.

Calculating Information Gain Information Gain = entropy(parent) – [average entropy(children)] child entropy Entire population (30 instances) 17 instances child entropy parent entropy 13 instances (Weighted) Average Entropy of Children = Information Gain= 0.996 - 0.615 = 0.38

Entropy-Based Automatic Decision Tree Construction Node 1 What feature should be used? Training Set S x1=(f11,f12,…f1m) x2=(f21,f22, f2m) . xn=(fn1,f22, f2m) What values? Quinlan suggested information gain in his ID3 system and later the gain ratio, both based on entropy.

Using Information Gain to Construct a Decision Tree Choose the attribute A with highest information gain for the full training set at the root of the tree. Full Training Set S Attribute A v1 v2 vk Construct child nodes for each value of A. Each has an associated subset of vectors in which A has a particular value. Set S  S={sS | value(A)=v1} repeat recursively till when? Information gain has the disadvantage that it prefers attributes with large number of values that split the data into small, pure subsets. Quinlan’s gain ratio did some normalization to improve this.

Information Content The information content I(C;F) of the class variable C with possible values {c1, c2, … cm} with respect to the feature variable F with possible values {f1, f2, … , fd} is defined by: P(C = ci) is the probability of class C having value ci. P(F=fj) is the probability of feature F having value fj. P(C=ci,F=fj) is the joint probability of class C = ci and variable F = fj. These are estimated from frequencies in the training data.

Simple Example X Y Z C 1 1 I 1 1 0 I 0 0 1 II 1 0 0 II How would you distinguish class I from class II?

Example (cont) X Y Z C 1 1 1 I 1 1 0 I 0 0 1 II 1 0 0 II Which attribute is best? Which is worst? Does it make sense?

Using Information Content Start with the root of the decision tree and the whole training set. Compute I(C,F) for each feature F. Choose the feature F with highest information content for the root node. Create branches for each value f of F. On each branch, create a new node with reduced training set and repeat recursively.

The tree is pruned back to the red line where On training data it looks great. But that’s not the case for the test data. The tree is pruned back to the red line where it gives more accurate results on the test data.

Decision Trees: Summary Representation=decision trees Bias=preference for small decision trees Search algorithm= Heuristic function=information gain or information content or others Overfitting and pruning Advantage is simplicity and easy conversion to rules.