Presentation is loading. Please wait.

Presentation is loading. Please wait.

Inductive Learning (1/2) Decision Tree Method

Similar presentations


Presentation on theme: "Inductive Learning (1/2) Decision Tree Method"— Presentation transcript:

1 Inductive Learning (1/2) Decision Tree Method
Russell and Norvig: Chapter 18, Sections 18.1 through 18.4 Chapter 18, Sections 18.1 through 18.3 CS121 – Winter 2003

2 Quotes “Our experience of the world is specific, yet we are able to formulate general theories that account for the past and predict the future” Genesereth and Nilsson, Logical Foundations of AI, 1987 “Entities are not to be multiplied without necessity” Ockham,

3 Learning Agent ? sensors environment agent actuators KB Critic
Percepts Learning element Problem solver KB Actions

4 Contents Introduction to inductive learning
Logic-based inductive learning: Decision tree method Version space method Function-based inductive learning Neural nets

5 Contents Introduction to inductive learning
Logic-based inductive learning: Decision tree method Version space method + why inductive learning works Function-based inductive learning Neural nets 1 2 3

6 Inductive Learning Frameworks
Function-learning formulation Logic-inference formulation

7 Function-Learning Formulation
Goal function f Training set: (xi, f(xi)), i = 1,…,n Inductive inference: Find a function h that fits the point well f(x) x  Neural nets

8 Logic-Inference Formulation
Background knowledge KB Training set D (observed knowledge) such that KB D and {KB, D} is satisfiable Inductive inference: Find h (inductive hypothesis) such that {KB, h} is satisfiable KB,h D Usually, not a sound inference h = D is a trivial, but uninteresting solution (data caching)

9 Rewarded Card Example Deck of cards, with each card designated by [r,s], its rank and suit, and some cards “rewarded” Background knowledge KB: ((r=1) v … v (r=10))  NUM(r) ((r=J) v (r=Q) v (r=K))  FACE(r) ((s=S) v (s=C))  BLACK(s) ((s=D) v (s=H))  RED(s) Training set D: REWARD([4,C])  REWARD([7,C])  REWARD([2,S])  REWARD([5,H])  REWARD([J,S])

10 There are several possible
Rewarded Card Example Background knowledge KB: ((r=1) v … v (r=10))  NUM(r) ((r=J) v (r=Q) v (r=K))  FACE(r) ((s=S) v (s=C))  BLACK(s) ((s=D) v (s=H))  RED(s) Training set D: REWARD([4,C])  REWARD([7,C])  REWARD([2,S])  REWARD([5,H])  REWARD([J,S]) Possible inductive hypothesis: h  (NUM(r)  BLACK(s)  REWARD([r,s])) There are several possible inductive hypotheses

11 Learning a Predicate Set E of objects (e.g., cards)
Goal predicate CONCEPT(x), where x is an object in E, that takes the value True or False (e.g., REWARD) Example: CONCEPT describes the precondition of an action, e.g., Unstack(C,A) E is the set of states CONCEPT(x)  HANDEMPTYx, BLOCK(C) x, BLOCK(A) x, CLEAR(C) x, ON(C,A) x Learning CONCEPT is a step toward learning the action

12 Learning a Predicate Set E of objects (e.g., cards)
Goal predicate CONCEPT(x), where x is an object in E, that takes the value True or False (e.g., REWARD) Observable predicates A(x), B(X), … (e.g., NUM, RED) Training set: values of CONCEPT for some combinations of values of the observable predicates

13 A Possible Training Set
Ex. # A B C D E CONCEPT 1 True False 2 3 4 5 6 7 8 9 10 Note that the training set does not say whether an observable predicate A, …, E is pertinent or not

14 Learning a Predicate Set E of objects (e.g., cards)
Goal predicate CONCEPT(x), where x is an object in E, that takes the value True or False (e.g., REWARD) Observable predicates A(x), B(X), … (e.g., NUM, RED) Training set: values of CONCEPT for some combinations of values of the observable predicates Find a representation of CONCEPT in the form: CONCEPT(x)  S(A,B, …) where S(A,B,…) is a sentence built with the observable predicates, e.g.: CONCEPT(x)  A(x)  (B(x) v C(x))

15 Learning the concept of an Arch
These objects are arches: (positive examples) These aren’t: (negative examples) ARCH(x)  HAS-PART(x,b1)  HAS-PART(x,b2)  HAS-PART(x,b3)  IS-A(b1,BRICK)  IS-A(b2,BRICK)  MEET(b1,b2)  (IS-A(b3,BRICK) v IS-A(b3,WEDGE))  SUPPORTED(b3,b1)  SUPPORTED(b3,b2)

16 Example set An example consists of the values of CONCEPT and the observable predicates for some object x A example is positive if CONCEPT is True, else it is negative The set E of all examples is the example set The training set is a subset of E a small one!

17 Hypothesis Space An hypothesis is any sentence h of the form: CONCEPT(x)  S(A,B, …) where S(A,B,…) is a sentence built with the observable predicates The set of all hypotheses is called the hypothesis space H An hypothesis h agrees with an example if it gives the correct value of CONCEPT It is called a space because it has some internal structure

18 Inductive Learning Scheme
Inductive hypothesis h Training set D + - Example set X {[A, B, …, CONCEPT]} Hypothesis space H {[CONCEPT(x)  S(A,B, …)]}

19 Size of Hypothesis Space
n observable predicates 2n entries in truth table In the absence of any restriction (bias), there are hypotheses to choose from n = 6  2x1019 hypotheses! 2 2n

20 Multiple Inductive Hypotheses
Need for a system of preferences – called a bias – to compare possible hypotheses h1  NUM(x)  BLACK(x)  REWARD(x) h2  BLACK([r,s])  (r=J)  REWARD([r,s]) h3  ([r,s]=[4,C])  ([r,s]=[7,C])  [r,s]=[2,S])  REWARD([r,s]) h3  ([r,s]=[5,H])  ([r,s]=[J,S])  REWARD([r,s]) agree with all the examples in the training set

21 Keep-It-Simple (KIS) Bias
Motivation If an hypothesis is too complex it may not be worth learning it (data caching might just do the job as well) There are much fewer simple hypotheses than complex ones, hence the hypothesis space is smaller Examples: Use much fewer observable predicates than suggested by the training set Constrain the learnt predicate, e.g., to use only “high-level” observable predicates such as NUM, FACE, BLACK, and RED and/or to have simple syntax (e.g., conjunction of literals) If the bias allows only sentences S that are conjunctions of k << n predicates picked from the n observable predicates, then the size of H is O(nk)

22 Putting Things Together
Evaluation yes no Object set Goal predicate Observable predicates Training set D Test set Example set X Induced hypothesis h Learning procedure L Bias Hypothesis space H

23 Predicate-Learning Methods
Decision tree Version space

24 Predicate as a Decision Tree
The predicate CONCEPT(x)  A(x)  (B(x) v C(x)) can be represented by the following decision tree: Example: A mushroom is poisonous iff it is yellow and small, or yellow, big and spotted x is a mushroom CONCEPT = POISONOUS A = YELLOW B = BIG C = SPOTTED A? B? C? True False

25 Predicate as a Decision Tree
The predicate CONCEPT(x)  A(x)  (B(x) v C(x)) can be represented by the following decision tree: Example: A mushroom is poisonous iff it is yellow and small, or yellow, big and spotted x is a mushroom CONCEPT = POISONOUS A = YELLOW B = BIG C = SPOTTED D = FUNNEL-CAP E = BULKY A? B? C? True False

26 Training Set 1 False True 2 3 4 5 6 7 8 9 10 11 12 13 Ex. # A B C D E
CONCEPT 1 False True 2 3 4 5 6 7 8 9 10 11 12 13

27 Possible Decision Tree
A T F

28 Possible Decision Tree
A T F CONCEPT  (D  (E v A)) v (C  (B v ((E  A) v A))) A? B? C? True False CONCEPT  A  (B v C) KIS bias  Build smallest decision tree Computationally intractable problem  greedy algorithm

29 Getting Started The distribution of the training set is:
True: 6, 7, 8, 9, 10,13 False: 1, 2, 3, 4, 5, 11, 12 Ex. # A B C D E CONCEPT 1 False True 2 3 4 5 6 7 8 9 10 11 12 13

30 Getting Started The distribution of training set is:
True: 6, 7, 8, 9, 10,13 False: 1, 2, 3, 4, 5, 11, 12 Without testing any observable predicate, we could report that CONCEPT is False (majority rule) with an estimated probability of error P(E) = 6/13

31 Getting Started The distribution of training set is:
True: 6, 7, 8, 9, 10,13 False: 1, 2, 3, 4, 5, 11, 12 Without testing any observable predicate, we could report that CONCEPT is False (majority rule) with an estimated probability of error P(E) = 6/13 Assuming that we will only include one observable predicate in the decision tree, which predicate should we test to minimize the probability or error?

32 Assume It’s A A F T True: 6, 7, 8, 9, 10, 13 False: 11, 12
1, 2, 3, 4, 5 T F If we test only A, we will report that CONCEPT is True if A is True (majority rule) and False otherwise The estimated probability of error is: Pr(E) = (8/13)x(2/8) + (5/8)x0 = 2/13

33 Assume It’s B B F T True: 9, 10 6, 7, 8, 13 False: 2, 3, 11, 12
1, 4, 5 T F If we test only B, we will report that CONCEPT is False if B is True and True otherwise The estimated probability of error is: Pr(E) = (6/13)x(2/6) + (7/13)x(3/7) = 5/13 6, 7, 8, 13

34 Assume It’s C C F T True: 6, 8, 9, 10, 13 7 False: 1, 3, 4
1, 5, 11, 12 T F If we test only C, we will report that CONCEPT is True if C is True and False otherwise The estimated probability of error is: Pr(E) = (8/13)x(3/8) + (5/13)x(1/5) = 4/13 7

35 Assume It’s D D F T True: 7, 10, 13 6, 8, 9 False: 3, 5
If we test only D, we will report that CONCEPT is True if D is True and False otherwise The estimated probability of error is: Pr(E) = (5/13)x(2/5) + (8/13)x(3/8) = 5/13 True: False: 7, 10, 13 3, 5 1, 2, 4, 11, 12 6, 8, 9

36 Assume It’s E So, the best predicate to test is A E F T True:
False: 8, 9, 10, 13 1, 3, 5, 12 2, 4, 11 T F If we test only E we will report that CONCEPT is False, independent of the outcome The estimated probability of error is unchanged: Pr(E) = (8/13)x(4/8) + (5/13)x(2/5) = 6/13 6, 7 So, the best predicate to test is A

37 Choice of Second Predicate
False C F T True: False: 6, 8, 9, 10, 13 11, 12 7 The majority rule gives the probability of error Pr(E|A) = 1/8 and Pr(E) = 1/13

38 Choice of Third Predicate
False F T True B T F True: False: 11,12 7

39 Final Tree L  CONCEPT  A  (C v B) A? B? C? A C B True False True

40 Learning a Decision Tree
DTL(D,Predicates) If all examples in D are positive then return True If all examples in D are negative then return False If Predicates is empty then return failure A  most discriminating predicate in Predicates Return the tree whose: - root is A, - left branch is DTL(D+A,Predicates-A), - right branch is DTL(D-A,Predicates-A) Noise in training set! May return majority rule, instead of failure Subset of examples that satisfy A

41 Using Information Theory
Rather than minimizing the probability of error, most existing learning procedures try to minimize the expected number of questions needed to decide if an object x satisfies CONCEPT This minimization is based on a measure of the “quantity of information” that is contained in the truth value of an observable predicate

42 Miscellaneous Issues Assessing performance: Training set and test set
Learning curve size of training set % correct on test set 100 Typical learning curve

43 Miscellaneous Issues Assessing performance: Overfitting
Training set and test set Learning curve Overfitting Tree pruning Risk of using irrelevant observable predicates to generate an hypothesis that agrees with all examples in the training set The resulting decision tree + majority rule may not classify correctly all examples in the training set Terminate recursion when information gain is too small

44 Miscellaneous Issues The value of an observable predicate P is unknown for an example x. Then construct a decision tree for both values of P and select the value that ends up classifying x in the largest class Assessing performance: Training set and test set Learning curve Overfitting Tree pruning Cross-validation Missing data

45 Select threshold that maximizes information gain
Miscellaneous Issues Assessing performance: Training set and test set Learning curve Overfitting Tree pruning Cross-validation Missing data Multi-valued and continuous attributes Select threshold that maximizes information gain These issues occur with virtually any learning method

46 Multi-Valued Attributes
WillWait predicate (Russell and Norvig) Patrons? Hungry? Type? FriSat? No None False Full Some Yes True Thai Italian Burger Multi-valued attributes

47 Applications of Decision Tree
Medical diagnostic / Drug design Evaluation of geological systems for assessing gas and oil basins Early detection of problems (e.g., jamming) during oil drilling operations Automatic generation of rules in expert systems

48 Summary Inductive learning frameworks Logic inference formulation
Hypothesis space and KIS bias Inductive learning of decision trees Assessing performance Overfitting


Download ppt "Inductive Learning (1/2) Decision Tree Method"

Similar presentations


Ads by Google