Inductive Learning (1/2) Decision Tree Method

Slides:



Advertisements
Similar presentations
Learning from Observations Chapter 18 Section 1 – 3.
Advertisements

CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting.
DECISION TREES. Decision trees  One possible representation for hypotheses.
Decision Tree Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Spring 2004.
1 Machine Learning: Symbol-based 10a 10.0Introduction 10.1A Framework for Symbol-based Learning 10.2Version Space Search 10.3The ID3 Decision Tree Induction.
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2005.
18 LEARNING FROM OBSERVATIONS
Learning From Observations
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2004.
1 Chapter 18 Learning from Observations Decision tree examples Additional source used in preparing the slides: Jean-Claude Latombe’s CS121 slides: robotics.stanford.edu/~latombe/cs121.
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18.
ICS 273A Intro Machine Learning
1 Chapter 19 Knowledge in Learning Version spaces examples Additional sources used in preparing the slides: Jean-Claude Latombe’s CS121 slides: robotics.stanford.edu/~latombe/cs121.
Machine Learning CSE 473. © Daniel S. Weld Topics Agency Problem Spaces Search Knowledge Representation Reinforcement Learning InferencePlanning.
Inductive Learning (1/2) Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3.
LEARNING DECISION TREES
Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.
Logical Agents Chapter 7 Feb 26, Knowledge and Reasoning Knowledge of action outcome enables problem solving –a reflex agent can only find way from.
Learning decision trees
Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.
Learning….in a rather broad sense: improvement of performance on the basis of experience Machine learning…… improve for task T with respect to performance.
ICS 273A Intro Machine Learning
Learning: Introduction and Overview
1 Machine Learning: Symbol-based 10b 10.0Introduction 10.1A Framework for Symbol-based Learning 10.2Version Space Search 10.3The ID3 Decision Tree Induction.
Issues with Data Mining
CS B551: D ECISION T REES. A GENDA Decision trees Complexity Learning curves Combatting overfitting Boosting.
Machine Learning Chapter 3. Decision Tree Learning
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
Inductive learning Simplest form: learn a function from examples
Inductive Learning Decision Tree Method (If it’s not simple, it’s not worth learning it) R&N: Chap. 18, Sect. 18.1–3 Much of this taken from slides.
I NTRODUCTION TO M ACHINE L EARNING. L EARNING Agent has made observations (data) Now must make sense of it (hypotheses) Hypotheses alone may be important.
LEARNING DECISION TREES Yılmaz KILIÇASLAN. Definition - I Decision tree induction is one of the simplest, and yet most successful forms of learning algorithm.
Learning from observations
Learning from Observations Chapter 18 Through
CHAPTER 18 SECTION 1 – 3 Learning from Observations.
1 Logical Agents CS 171/271 (Chapter 7) Some text and images in these slides were drawn from Russel & Norvig’s published material.
Learning from Observations Chapter 18 Section 1 – 3, 5-8 (presentation TBC)
Learning from Observations Chapter 18 Section 1 – 3.
I NTRODUCTION TO M ACHINE L EARNING. L EARNING Agent has made observations (data) Now must make sense of it (hypotheses) Hypotheses alone may be important.
Machine Learning Foundations of Artificial Intelligence.
Decision Tree Learning R&N: Chap. 18, Sect. 18.1–3.
I NTRODUCTION TO M ACHINE L EARNING. L EARNING Agent has made observations (data) Now must make sense of it (hypotheses) Hypotheses alone may be important.
CS B351: D ECISION T REES. A GENDA Decision trees Learning curves Combatting overfitting.
Overview Concept Learning Representation Inductive Learning Hypothesis
CS690L Data Mining: Classification
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
Decision Trees. What is a decision tree? Input = assignment of values for given attributes –Discrete (often Boolean) or continuous Output = predicated.
1 Inductive Learning (continued) Chapter 19 Slides for Ch. 19 by J.C. Latombe.
1 Logical Agents CS 171/271 (Chapter 7) Some text and images in these slides were drawn from Russel & Norvig’s published material.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
CSC 8520 Spring Paula Matuszek DecisionTreeFirstDraft Paula Matuszek Spring,
Introduction to Machine Learning
Chapter 18 Section 1 – 3 Learning from Observations.
Inductive Learning (2/2) Version Space and PAC Learning Russell and Norvig: Chapter 18, Sections 18.5 through 18.7 Chapter 18, Section 18.5 Chapter 19,
Learning From Observations Inductive Learning Decision Trees Ensembles.
Anifuddin Azis LEARNING. Why is learning important? So far we have assumed we know how the world works Rules of queens puzzle Rules of chess Knowledge.
Decision Tree Learning CMPT 463. Reminders Homework 7 is due on Tuesday, May 10 Projects are due on Tuesday, May 10 o Moodle submission: readme.doc and.
Learning from Observations
Learning from Observations
Introduce to machine learning
Presented By S.Yamuna AP/CSE
Why Machine Learning Flood of data
Learning from Observations
Learning from Observations
Decision trees One possible representation for hypotheses
Inductive Learning (2/2) Version Space and PAC Learning
Machine Learning: Decision Tree Learning
Presentation transcript:

Inductive Learning (1/2) Decision Tree Method Russell and Norvig: Chapter 18, Sections 18.1 through 18.4 Chapter 18, Sections 18.1 through 18.3 CS121 – Winter 2003

Quotes “Our experience of the world is specific, yet we are able to formulate general theories that account for the past and predict the future” Genesereth and Nilsson, Logical Foundations of AI, 1987 “Entities are not to be multiplied without necessity” Ockham, 1285-1349

Learning Agent ? sensors environment agent actuators KB Critic Percepts Learning element Problem solver KB Actions

Contents Introduction to inductive learning Logic-based inductive learning: Decision tree method Version space method Function-based inductive learning Neural nets

Contents Introduction to inductive learning Logic-based inductive learning: Decision tree method Version space method + why inductive learning works Function-based inductive learning Neural nets 1 2 3

Inductive Learning Frameworks Function-learning formulation Logic-inference formulation

Function-Learning Formulation Goal function f Training set: (xi, f(xi)), i = 1,…,n Inductive inference: Find a function h that fits the point well f(x) x  Neural nets

Logic-Inference Formulation Background knowledge KB Training set D (observed knowledge) such that KB D and {KB, D} is satisfiable Inductive inference: Find h (inductive hypothesis) such that {KB, h} is satisfiable KB,h D Usually, not a sound inference h = D is a trivial, but uninteresting solution (data caching)

Rewarded Card Example Deck of cards, with each card designated by [r,s], its rank and suit, and some cards “rewarded” Background knowledge KB: ((r=1) v … v (r=10))  NUM(r) ((r=J) v (r=Q) v (r=K))  FACE(r) ((s=S) v (s=C))  BLACK(s) ((s=D) v (s=H))  RED(s) Training set D: REWARD([4,C])  REWARD([7,C])  REWARD([2,S])  REWARD([5,H])  REWARD([J,S])

There are several possible Rewarded Card Example Background knowledge KB: ((r=1) v … v (r=10))  NUM(r) ((r=J) v (r=Q) v (r=K))  FACE(r) ((s=S) v (s=C))  BLACK(s) ((s=D) v (s=H))  RED(s) Training set D: REWARD([4,C])  REWARD([7,C])  REWARD([2,S])  REWARD([5,H])  REWARD([J,S]) Possible inductive hypothesis: h  (NUM(r)  BLACK(s)  REWARD([r,s])) There are several possible inductive hypotheses

Learning a Predicate Set E of objects (e.g., cards) Goal predicate CONCEPT(x), where x is an object in E, that takes the value True or False (e.g., REWARD) Example: CONCEPT describes the precondition of an action, e.g., Unstack(C,A) E is the set of states CONCEPT(x)  HANDEMPTYx, BLOCK(C) x, BLOCK(A) x, CLEAR(C) x, ON(C,A) x Learning CONCEPT is a step toward learning the action

Learning a Predicate Set E of objects (e.g., cards) Goal predicate CONCEPT(x), where x is an object in E, that takes the value True or False (e.g., REWARD) Observable predicates A(x), B(X), … (e.g., NUM, RED) Training set: values of CONCEPT for some combinations of values of the observable predicates

A Possible Training Set Ex. # A B C D E CONCEPT 1 True False 2 3 4 5 6 7 8 9 10 Note that the training set does not say whether an observable predicate A, …, E is pertinent or not

Learning a Predicate Set E of objects (e.g., cards) Goal predicate CONCEPT(x), where x is an object in E, that takes the value True or False (e.g., REWARD) Observable predicates A(x), B(X), … (e.g., NUM, RED) Training set: values of CONCEPT for some combinations of values of the observable predicates Find a representation of CONCEPT in the form: CONCEPT(x)  S(A,B, …) where S(A,B,…) is a sentence built with the observable predicates, e.g.: CONCEPT(x)  A(x)  (B(x) v C(x))

Learning the concept of an Arch These objects are arches: (positive examples) These aren’t: (negative examples) ARCH(x)  HAS-PART(x,b1)  HAS-PART(x,b2)  HAS-PART(x,b3)  IS-A(b1,BRICK)  IS-A(b2,BRICK)  MEET(b1,b2)  (IS-A(b3,BRICK) v IS-A(b3,WEDGE))  SUPPORTED(b3,b1)  SUPPORTED(b3,b2)

Example set An example consists of the values of CONCEPT and the observable predicates for some object x A example is positive if CONCEPT is True, else it is negative The set E of all examples is the example set The training set is a subset of E a small one!

Hypothesis Space An hypothesis is any sentence h of the form: CONCEPT(x)  S(A,B, …) where S(A,B,…) is a sentence built with the observable predicates The set of all hypotheses is called the hypothesis space H An hypothesis h agrees with an example if it gives the correct value of CONCEPT It is called a space because it has some internal structure

Inductive Learning Scheme Inductive hypothesis h Training set D + - Example set X {[A, B, …, CONCEPT]} Hypothesis space H {[CONCEPT(x)  S(A,B, …)]}

Size of Hypothesis Space n observable predicates 2n entries in truth table In the absence of any restriction (bias), there are hypotheses to choose from n = 6  2x1019 hypotheses! 2 2n

Multiple Inductive Hypotheses Need for a system of preferences – called a bias – to compare possible hypotheses h1  NUM(x)  BLACK(x)  REWARD(x) h2  BLACK([r,s])  (r=J)  REWARD([r,s]) h3  ([r,s]=[4,C])  ([r,s]=[7,C])  [r,s]=[2,S])  REWARD([r,s]) h3  ([r,s]=[5,H])  ([r,s]=[J,S])  REWARD([r,s]) agree with all the examples in the training set

Keep-It-Simple (KIS) Bias Motivation If an hypothesis is too complex it may not be worth learning it (data caching might just do the job as well) There are much fewer simple hypotheses than complex ones, hence the hypothesis space is smaller Examples: Use much fewer observable predicates than suggested by the training set Constrain the learnt predicate, e.g., to use only “high-level” observable predicates such as NUM, FACE, BLACK, and RED and/or to have simple syntax (e.g., conjunction of literals) If the bias allows only sentences S that are conjunctions of k << n predicates picked from the n observable predicates, then the size of H is O(nk)

Putting Things Together Evaluation yes no Object set Goal predicate Observable predicates Training set D Test set Example set X Induced hypothesis h Learning procedure L Bias Hypothesis space H

Predicate-Learning Methods Decision tree Version space

Predicate as a Decision Tree The predicate CONCEPT(x)  A(x)  (B(x) v C(x)) can be represented by the following decision tree: Example: A mushroom is poisonous iff it is yellow and small, or yellow, big and spotted x is a mushroom CONCEPT = POISONOUS A = YELLOW B = BIG C = SPOTTED A? B? C? True False

Predicate as a Decision Tree The predicate CONCEPT(x)  A(x)  (B(x) v C(x)) can be represented by the following decision tree: Example: A mushroom is poisonous iff it is yellow and small, or yellow, big and spotted x is a mushroom CONCEPT = POISONOUS A = YELLOW B = BIG C = SPOTTED D = FUNNEL-CAP E = BULKY A? B? C? True False

Training Set 1 False True 2 3 4 5 6 7 8 9 10 11 12 13 Ex. # A B C D E CONCEPT 1 False True 2 3 4 5 6 7 8 9 10 11 12 13

Possible Decision Tree A T F

Possible Decision Tree A T F CONCEPT  (D  (E v A)) v (C  (B v ((E  A) v A))) A? B? C? True False CONCEPT  A  (B v C) KIS bias  Build smallest decision tree Computationally intractable problem  greedy algorithm

Getting Started The distribution of the training set is: True: 6, 7, 8, 9, 10,13 False: 1, 2, 3, 4, 5, 11, 12 Ex. # A B C D E CONCEPT 1 False True 2 3 4 5 6 7 8 9 10 11 12 13

Getting Started The distribution of training set is: True: 6, 7, 8, 9, 10,13 False: 1, 2, 3, 4, 5, 11, 12 Without testing any observable predicate, we could report that CONCEPT is False (majority rule) with an estimated probability of error P(E) = 6/13

Getting Started The distribution of training set is: True: 6, 7, 8, 9, 10,13 False: 1, 2, 3, 4, 5, 11, 12 Without testing any observable predicate, we could report that CONCEPT is False (majority rule) with an estimated probability of error P(E) = 6/13 Assuming that we will only include one observable predicate in the decision tree, which predicate should we test to minimize the probability or error?

Assume It’s A A F T True: 6, 7, 8, 9, 10, 13 False: 11, 12 1, 2, 3, 4, 5 T F If we test only A, we will report that CONCEPT is True if A is True (majority rule) and False otherwise The estimated probability of error is: Pr(E) = (8/13)x(2/8) + (5/8)x0 = 2/13

Assume It’s B B F T True: 9, 10 6, 7, 8, 13 False: 2, 3, 11, 12 1, 4, 5 T F If we test only B, we will report that CONCEPT is False if B is True and True otherwise The estimated probability of error is: Pr(E) = (6/13)x(2/6) + (7/13)x(3/7) = 5/13 6, 7, 8, 13

Assume It’s C C F T True: 6, 8, 9, 10, 13 7 False: 1, 3, 4 1, 5, 11, 12 T F If we test only C, we will report that CONCEPT is True if C is True and False otherwise The estimated probability of error is: Pr(E) = (8/13)x(3/8) + (5/13)x(1/5) = 4/13 7

Assume It’s D D F T True: 7, 10, 13 6, 8, 9 False: 3, 5 If we test only D, we will report that CONCEPT is True if D is True and False otherwise The estimated probability of error is: Pr(E) = (5/13)x(2/5) + (8/13)x(3/8) = 5/13 True: False: 7, 10, 13 3, 5 1, 2, 4, 11, 12 6, 8, 9

Assume It’s E So, the best predicate to test is A E F T True: False: 8, 9, 10, 13 1, 3, 5, 12 2, 4, 11 T F If we test only E we will report that CONCEPT is False, independent of the outcome The estimated probability of error is unchanged: Pr(E) = (8/13)x(4/8) + (5/13)x(2/5) = 6/13 6, 7 So, the best predicate to test is A

Choice of Second Predicate False C F T True: False: 6, 8, 9, 10, 13 11, 12 7 The majority rule gives the probability of error Pr(E|A) = 1/8 and Pr(E) = 1/13

Choice of Third Predicate False F T True B T F True: False: 11,12 7

Final Tree L  CONCEPT  A  (C v B) A? B? C? A C B True False True

Learning a Decision Tree DTL(D,Predicates) If all examples in D are positive then return True If all examples in D are negative then return False If Predicates is empty then return failure A  most discriminating predicate in Predicates Return the tree whose: - root is A, - left branch is DTL(D+A,Predicates-A), - right branch is DTL(D-A,Predicates-A) Noise in training set! May return majority rule, instead of failure Subset of examples that satisfy A

Using Information Theory Rather than minimizing the probability of error, most existing learning procedures try to minimize the expected number of questions needed to decide if an object x satisfies CONCEPT This minimization is based on a measure of the “quantity of information” that is contained in the truth value of an observable predicate

Miscellaneous Issues Assessing performance: Training set and test set Learning curve size of training set % correct on test set 100 Typical learning curve

Miscellaneous Issues Assessing performance: Overfitting Training set and test set Learning curve Overfitting Tree pruning Risk of using irrelevant observable predicates to generate an hypothesis that agrees with all examples in the training set The resulting decision tree + majority rule may not classify correctly all examples in the training set Terminate recursion when information gain is too small

Miscellaneous Issues The value of an observable predicate P is unknown for an example x. Then construct a decision tree for both values of P and select the value that ends up classifying x in the largest class Assessing performance: Training set and test set Learning curve Overfitting Tree pruning Cross-validation Missing data

Select threshold that maximizes information gain Miscellaneous Issues Assessing performance: Training set and test set Learning curve Overfitting Tree pruning Cross-validation Missing data Multi-valued and continuous attributes Select threshold that maximizes information gain These issues occur with virtually any learning method

Multi-Valued Attributes WillWait predicate (Russell and Norvig) Patrons? Hungry? Type? FriSat? No None False Full Some Yes True Thai Italian Burger Multi-valued attributes

Applications of Decision Tree Medical diagnostic / Drug design Evaluation of geological systems for assessing gas and oil basins Early detection of problems (e.g., jamming) during oil drilling operations Automatic generation of rules in expert systems

Summary Inductive learning frameworks Logic inference formulation Hypothesis space and KIS bias Inductive learning of decision trees Assessing performance Overfitting