Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Classifier Learning: Rule Learning Intelligent Systems Lab. Soongsil University Thanks to Raymond J. Mooney in the University of Texas at Austin.

Similar presentations


Presentation on theme: "1 Classifier Learning: Rule Learning Intelligent Systems Lab. Soongsil University Thanks to Raymond J. Mooney in the University of Texas at Austin."— Presentation transcript:

1 1 Classifier Learning: Rule Learning Intelligent Systems Lab. Soongsil University Thanks to Raymond J. Mooney in the University of Texas at Austin

2 2 Introduction Set of “If-then” rules –The hypothesis is easy to interpret. GoalGoal –Look at a new method to learn rules Rules –Propositional rules (rules without variables) –First-order predicate rules (with variables)

3 3 Introduction 2 GOAL: Learning a target function as a set of IF-THEN rules BEFORE: Learning with decision trees –Learning the decision tree –Translate the tree into a set of IF-THEN rules (for each leaf one rule) OTHER POSSIBILITY: Learning with genetic algorithms –Each set of rule is coded as a bitvector –Several genetic operators are used on the hypothesis space TODAY AND HERE: –First: Learning rules in propositional form –Second: Learning rules in first-order form (Horn clauses which include variables) –Sequential search for rules, one after the other

4 4 Rule induction To learn a set of IF-THEN rules for classification - Suitable when the target function can be represented by a set of IF-THEN rules Target function h ≡ { Rule 1,Rule 2,...,Rule m } Rule j ≡ IF (precond j1 Λ precond j2 Λ...Λ precond jn ) THEN postcond j IF-THEN rules An expressive representation Most readable and understandable for human

5 5 Rule induction – Example (1) Learning a set of propositional rules (without variable) E.g., The target function (concept) Buy_Computer is represented by: IF (Age=Old Λ Student=No) THEN Buy_Computer=No IF (Student=Yes) THEN Buy_Computer=Yes IF (Age=Medium Λ Income=High) THEN Buy_Computer=Yes Learning a set of first-order rules (with variable) E.g., The target function (concept) Ancestor is represented by: IF parent(x,y) THEN ancestor(x,y) IF parent(x,y) Λ ancestor(y,z) THEN ancestor(x,z) (parent(x,y) is a predicate saying that y is the father/mother of x)

6 6 Rule induction – Example (2) Rule: IF (Age=Old Λ Student=No) THEN Buy_Computer=No Which instances are correctly classified by the above rule?

7 7 Learning Rules If-then rules in logic are a standard representation of knowledge that have proven useful in expert-systems and other AI systems –In propositional logic, a set of rules for a concept is equivalent to DNF(Disjunctive Normal Form) Methods for automatically inducing rules from data have been shown to build more accurate expert systems than human knowledge engineering for some applications. ( 전문가 시스템을 위한 정확한 규칙의 자동생성 !!) Rule-learning methods have been extended to first-order logic to handle relational (structural) representations. –Inductive Logic Programming (ILP) for learning Prolog programs from I/O pairs. –Allows moving beyond simple feature-vector representations of data.

8 For example, all of the following formulas are in DNF: 8 Converting a formula to DNF involves using logical equivalences, such as the double negative elimination, De Morgan's laws, and the distributive law.logical equivalencesdouble negative eliminationDe Morgan's laws distributive law

9 9 Rule Learning Approaches Translate decision trees into rules (C4.5) Sequential (set) covering algorithms –General-to-specific (top-down) (RIPPER, CN2, FOIL) –Specific-to-general (bottom-up) (GOLEM, CIGOL) –Hybrid search (AQ, Chillin, Progol) Translate neural-nets into rules (TREPAN)

10 10 Decision-Trees to Rules For each path in a decision tree from the root to a leaf, create a rule with the conjunction of tests along the path as an antecedent and the leaf label as the consequent. color red blue green shape circle square triangle B C A B C red  circle → A blue → B red  square → B green → C red  triangle → C

11 11 Post-Processing Decision-Tree Rules 에 관심사항 Resulting rules may contain unnecessary antecedents that are not needed to remove negative examples and result in over-fitting. –example) Prism 알고리즘의 경우 If humidity=normal and windy=false then play=yes 아래 2 번째의 negative examples 을 제거하기 위해 windy=false 를 추가한다면 play=yes 가 4/5 에서 positive instance (yes) 를 희생시키게 된다.. 그러나 200 개의 데이터를 적용한다면 play=yes 40 positive example 을 희생 시킴... that's crazy.

12 12 Post-Processing Decision-Tree Rules 에 관심사항 Rules are post-pruned by greedily removing antecedents or rules until performance on training data or validation set is significantly harmed. – 즉 모델을 단순화 하기 위해 accuracy 등의 척도가 저하되지 않을 때까지 post- prunning 은 점진적으로 수행 한다 Resulting rules may lead to competing or conflicting conclusions on some instances. Sort rules by training (validation) accuracy to create an ordered decision list. The first rule in the list that applies is used to classify a test instance. red  circle → A (97% train accuracy) red  big → B (95% train accuracy) : Test case: assigned to class A

13 13 Propositional rule induction – Training (Sequential Covering Algorithms) To learn the set of rules in a sequential (incremental) covering strategy - Step 1. Learn one rule - Step 2. Remove from the training set those instances correctly classified by the rule → Repeat the Steps 1, 2 to learn another rule (using the remaining training set) The learning procedure - Learns (i.e., covers) the rules sequentially (incrementally) - Can be repeated as many times as wanted to learn a set of rules that cover a desired portion of (or the full) the training set The set of learned rules is sorted according to some performance measure (e.g., classification accuracy) →The rules will be referred in this order when classifying a future instance

14 14 Propositional rule induction – Classification Given a test instance –The learned rules are tried (tested) sequentially (i.e., rule by rule) in the order resulted in the training phase –The first encountered rule that covers the instance (i.e., the rule’s preconditions in the IF clause match the instance) classifies it IF satisfied THEN classifies –If no rule covers the instance, then the instance is classified by the default rule The instance is classified by the most frequent value of the target attribute in the training instances

15 15 Rule Coverage and Accuracy Coverage of a rule: –Fraction of records that satisfy the antecedent of a rule Accuracy of a rule: –Fraction of records that satisfy both the antecedent and consequent of a rule Rule: IF (Status=Single) Then No Coverage = 40% (4/10), Accuracy = 50% (2/4)

16 16 Sequential Covering Algorithms Require that each rule has high accuracy but low coverage –High accuracy  the correct prediction –Accepting low coverage  the prediction NOT necessary for every training example

17 17 Sequential Covering Algorithms (cont.)

18 18 Sequential Covering A set of rules is learned one at a time, each time finding a single rule that covers a large number of positive instances without covering any negatives, removing the positives that it covers, and learning additional rules to cover the rest. – 한번에 하나의 rule 을 학습, 하나의 rule 은 다수의 positive instances 을 cover 하나 negatives 는 cover 하지 않는다. This is an instance of the greedy algorithm for minimum set covering and does not guarantee a minimum number of learned rules.( Occam's Razor 미보장 ) Minimum set covering is an NP-hard problem and the greedy algorithm is a standard approximation algorithm. Methods for learning individual rules vary.

19 19 참고 : Set Covering Problem Let S = {1, 2, …, n}, and suppose S j ⊆ S for each j. We say that index set J is a cover of S if ∪ j ∈ J ∈ S j = S Set covering problem: find a minimum cardinality set cover of S. Application to locating 119-fire stations. Locating hospitals. Locating Starbucks and many non-obvious applications. x

20 20 Greedy Sequential Covering Example X Y + + + + + + + + + + + + +

21 21 Greedy Sequential Covering Example X Y + + + + + + + + + + + + +

22 22 Greedy Sequential Covering Example X Y + + + + + +

23 23 Greedy Sequential Covering Example X Y + + + + + +

24 24 Greedy Sequential Covering Example X Y + + +

25 25 Greedy Sequential Covering Example X Y + + +

26 26 Greedy Sequential Covering Example X Y

27 27 No-optimal Covering Example X Y + + + + + + + + + + + + +

28 28 Greedy Sequential Covering Example X Y + + + + + + + + + + + + +

29 29 Greedy Sequential Covering Example X Y + + + + + +

30 30 Greedy Sequential Covering Example X Y + + + + + +

31 31 Greedy Sequential Covering Example X Y + +

32 32 Greedy Sequential Covering Example X Y + +

33 33 Greedy Sequential Covering Example X Y +

34 34 Greedy Sequential Covering Example X Y +

35 35 Greedy Sequential Covering Example X Y

36 36 Strategies for Learning a Single Rule Top Down (General to Specific): –Start with the most-general (empty) rule. –Repeatedly add antecedent constraints on features that eliminate negative examples while maintaining as many positives as possible. –Stop when only positives are covered. Bottom Up (Specific to General) –Start with a most-specific rule (e.g. complete instance description of a random instance). –Repeatedly remove antecedent constraints in order to cover more positives. –Stop when further generalization results in covering negatives.

37 37 Sequential Covering Algorithms (cont.) General to Specific Beam Search –How do we learn each individual rule? –Requirements for LEARN-ONE-RULE High accuracy, need not high coverage –One approach is... To implement LEARN-ONE-RULE in similar way as in decision tree learning (ID3), but to follow only the most promising branch in the tree at each step. As illustrated in the figure, the search begins by considering the most general rule precondition possible (the empty test that matches every instance), then greedily adding the attribute test that most improves rule performance over training examples.

38 38 Sequential Covering Algorithms (cont.) Specialising search –Organises a hypothesis space search in general the same fashion as the ID3, but follows only the most promising branch of the tree at each step 1. Begin with the most general rule (no/empty precondition) 2. Follow the most promising branch: Greedily adding the attribute test that most improves the measured performance of the rule over the training example 3. Greedy depth-first search with no backtracking Danger of sub-optimal choice Reduce the risk: Beam Search (CN2-algorithm) Algorithm maintains the list of the k best candidates In each search step, descendants are generated for each of these k-best candidates The resulting set is then reduced to the k most promising members

39 39 IF {Humidity = Normal} THEN Play-Tennis = Yes IF {Wind = Strong} THEN Play-Tennis = No IF {Wind = weak} THEN Play-Tennis = Yes IF {Humidity = High} THEN Play-Tennis = No … IF { } THEN Play-Tennis = Yes … IF {Humidity = Normal, Outlook = Sunny} THEN Play-Tennis = Yes IF {Humidity = Normal, Wind = Strong} THEN Play-Tennis = Yes IF {Humidity = Normal, Wind = weak} THEN Play-Tennis = Yes IF {Humidity = Normal, Outlook = Rain} THEN Play-Tennis = Yes Sequential Covering Algorithms (cont.) General to Specific Beam Search

40 40 Sequential Covering Algorithms (cont.) Variations ( 다음의 경우 modification 이 가능, 예외적인 경우 임 !!) –Learn only rules that cover positive examples In the case that the fraction of positive example is small – 일반적으로 + 에 대한 accuracy 가 낮을 때 accuracy 를 높이는 조건 추가 – 그러나 accuracy 높을지라도 positive examples 수가 극히 적으면 선택하지 않음 In this case, we can modify the algorithm to learn only from those rare example, and classify anything not covered by any rule as negative. –abnormal data, outlier 등에 대한 규칙 생성 – 그리고 - 에 대한 규칙을 생성 하도록 Instead of entropy, use a measure that evaluates the fraction of positive examples covered by the hypothesis –Use Accuracy !!

41 41 Top-Down Rule Learning Example X Y + + + + + + + + + + + + +

42 42 Top-Down Rule Learning Example X Y + + + + + + + + + + + + + C 1 <Y 7/15 + class 를 분류하는 규칙 생성 (Rule Induction) 의 문제 In the case that the fraction of positive example is small + 에 대한 accuracy 가 낮을 때  more specific antecedent constraints 추가

43 43 Top-Down Rule Learning Example X Y + + + + + + + + + + + + + C 2 <X C 1 <Y 7/11

44 44 Top-Down Rule Learning Example X Y + + + + + + + + + + + + + Y<C 3 C 2 <X C 1 <Y 7/11

45 45 Top-Down Rule Learning Example X Y + + + + + + + + + + + + + X<C 4 Y<C 3 C 1 <Y C 2 <X

46 46 Bottom-Up Rule Learning Example X Y + + + + + + + + + + + + + + class 를 분류하는 규칙 생성 (Rule Induction) 의 문제 X 값이 주어진 + 경우

47 47 Bottom-Up Rule Learning Example X Y + + + + + + + + + + + + + + class 를 분류하는 규칙 생성 (Rule Induction) 의 문제 X 값이 주어진 + 경우,

48 48 Bottom-Up Rule Learning Example X Y + + + + + + + + + + + + + + class 를 분류하는 규칙 생성 (Rule Induction) 의 문제 X 값이 주어진 + 경우, X 값이 주어진 + 경우, 구역 안의 X 값이 주어진 경우

49 49 Bottom-Up Rule Learning Example X Y + + + + + + + + + + + + + + class 를 분류하는 규칙 생성 (Rule Induction) 의 문제 X 값이 주어진 + 경우, X 값이 주어진 + 경우, 구역 안의 X 값이 주어진 경우 X 값이 주어진 + 경우,

50 50 Bottom-Up Rule Learning Example X Y + + + + + + + + + + + + +

51 51 Bottom-Up Rule Learning Example X Y + + + + + + + + + + + + +

52 52 Bottom-Up Rule Learning Example X Y + + + + + + + + + + + + +

53 53 Bottom-Up Rule Learning Example X Y + + + + + + + + + + + + +

54 54 Bottom-Up Rule Learning Example X Y + + + + + + + + + + + + +

55 55 Bottom-Up Rule Learning Example X Y + + + + + + + + + + + + +

56 56 Bottom-Up Rule Learning Example X Y + + + + + + + + + + + + +

57 57 General to Specific Beam Search –Greedy search without backtracking  danger of suboptimal choice at any step –The algorithm can be extended using beam-search Keep a list of the k best candidates at each step 각 탐색 단계에서 k 최고 후보 (k best candidates) 에 대한 후손들을 생성, 결과로 얻어진 rule set 은 다시 k 최고 후보로 선택하여 처리한다. Sequential Covering Algorithms (cont.) IF {Humidity = Normal} THEN Play-Tennis = Yes IF {Wind = Strong} THEN Play-Tennis = No IF {Wind = weak} THEN Play-Tennis = Yes IF {Humidity = High} THEN Play-Tennis = No … IF {} THEN Play-Tennis = Yes … IF {Humidity = Normal, Outlook = Sunny} THEN Play-Tennis = Yes IF {Humidity = Normal, Wind = Strong} THEN Play-Tennis = Yes IF {Humidity = Normal, Wind = weak} THEN Play-Tennis = Yes IF {Humidity = Normal, Outlook = Rain} THEN Play-Tennis = Yes

58 58 Learning Rule Sets: Summary Key design issue for learning sets of rules Sequential or Simultaneous? Sequential : learning one rule at a time, removing the covered examples and repeating the process on the remaining examples Simultaneous : learning the entire set of disjuncts simultaneously as part of the single search for an acceptable decision tree as in ID3 General-to-specific or Specific-to-general? G  S : Learn-One-Rule S  G : Find-S Generate-and-test(G&T) Search thru syntactically legal hypotheses Example-Driven(ED)? Find-S, Candidate-Elimination Post-pruning of Rules? Similar method to the one discussed in decision tree learning

59 참고 : Generate-And-Test 59 1.Generate a possible solution. 2.Test to see if this is the expected solution. 3.If the solution has been found quit else go to step 1. Generate-and-test, like depth- first search, requires that complete solutions be generated for testing. In its most systematic form, it is only an exhaustive search of the problem space. Solutions can also be generated randomly but solution is not guaranteed. This approach is what is known as British Museum algorithm: finding an object in the British Museum by wandering randomly.

60 60 Measure performance of a rule (1) Relative frequency(=accuracy) : –D_train R : The set of the training instances that match the preconditions of rule R –n : # of examples matched by rule, i.e, size of D_train R –n c : # of examples classified by rule correctly

61 61 Measure performance of a rule (2) Data 가 충분하지 않고 소수의 data 로 rule 을 평가해야 할 때 m-estimate of accuracy : – p: The prior probability that an instance, randomly drawn from the entire dataset, will have the classification assigned by rule R →p is the prior assumed accuracy, initial expectation ( 사전 확율 ), 예, 100 개의 example 중 12 개의 example 을 예측했다면 p=0.12 –m: 사전 확률 p 가 Rule performance measure 에 영향을 미치는 가중값 (A weight that indicates how much the prior probability p influences the rule performance measure) ─ If m=0, then m-estimate becomes the relative frequency measure ─ 가중값 m 이 증가하면 더 많은 학습 data 가 필요하다. 즉, n, n c 수를 확대하여 p 를 견제하여

62 62 Measure performance of a rule (3) Entropy measure: –c: The number of possible values (i.e., classes) of the target attribute –p i : The proportion of instances in D_train R for which the target attribute takes on the i-th value (i.e., class)

63 63 Sequential covering algorithms – Issues Reduce the (more difficult) problem of learning a disjunctive set of rules to a sequence of (simpler) problems, each is to learn a single conjunctive rule –After a rule is learned (i.e., found), the training instances covered(classified) by the rule are removed from the training set( 학습한 data 는 제거 !!) –Each rule may not be treated as independent of other rules(rule 들의 상호독립이 아닐 수도 있음 !!) Perform a greedy search (for finding a sequence of rules) without backtracking –Not guaranteed to find the smallest set of rules –Not guaranteed to find the best set of rules

64 64 Learning First-Order Rules Ancestor (x, y)  Parent (x, y). From now... –We consider learning rule that contain variables (first- order rules) –Inductive learning of first-order rules : inductive logic programming (ILP) Can be viewed as automatically inferring Prolog programs –Two methods are considered FOIL Induction( 귀납법 ) as inverted deduction

65 65 First-order rule –Rules that contain variables –Example Ancestor (x, y)  Parent (x, y). Ancestor (x, y)  Parent (x, z)  Ancestor (z, y) : recursive –More expressive than propositional rules IF (Father1 = Bob) & (Name2 = Bob) & (Female1 = True), THEN Daughter1,2 = True IF Father(y,x) & Female(y), THEN Daughter(x,y) Learning First-Order Rules (cont.)

66 66 Learning First-Order Rules (cont.) Formal definitions in first-order logic –Constants: e.g., John, Kansas, 42 –Variables: e.g., Name, State, x –Predicates: e.g., male, as in male(junwoo) –Functions: e.g., age,cosine as in, age(gunho), cosine(x) –Term: constant, variable, or function(term) –Literal is a predicate (or its negation) applied to a set of terms, e.g., greater_Than(age(John),20), ¬male(x), etc. A first-order rule is a Horn clause –H and L i (i=1..n) are literals –Clause: disjunction of literals with implicit universal quantification –Horn clause: at most one positive literal ( H   L1   L2  …   Ln)

67 67 First Order Horn Clauses –Rules that have one or more preconditions and one single consequent. Predicates may have variables The following Horn clause is equivalent –H  ㄱ L1  …  ㄱ Ln –H  (L1  …  Ln ) –IF (L1  …  Ln), THEN H Learning First-Order Rules (cont.)

68 68 Learning Sets of First-Order Rules: FOIL First-Order Inductive Learning (FOIL) Natural extension of “Sequential covering + Learn- one-rule” FOIL rule : similar to Horn clause with two exceptions –Syntactic restriction : no function –More expressive than Horn clauses : Negation allowed in rule bodies

69 69 Learning a Single Rule in FOIL Top-down approach originally applied to first-order logic (Quinlan, 1990). Basic algorithm for instances with discrete-valued features: Let A={ } (set of rule antecedents) Let N be the set of negative examples Let P the current set of uncovered positive examples Until N is empty do For every feature-value pair (literal) (F i =V ij ) calculate Gain(F i =V ij, P, N) Pick literal, L, with highest Gain. Add L to A. Remove from N any examples that do not satisfy L. Remove from P any examples that do not satisfy L. Return the rule: A 1  A 2  …  A n → Positive

70 70 Learning first-order rules – FOIL alg.

71 71 Sequential (set) covering algorithms l CN2 Algorithm: –Start from an empty conjunct: IF { } THEN yes class –Add conjuncts that minimizes the entropy measure: {A}, {A,B}, … –Determine the rule consequent by taking majority class of instances covered by the rule l RIPPER Algorithm: –Start from an empty rule R0: {} => class(initial rule) –Add conjuncts that maximizes FOIL’s information gain measure: R 1 : {A} => class (rule after adding conjunct) t: number of positive instances covered by both R 0 and R 1 p 0 : number of positive instances covered by R 0 n 0 : number of negative instances covered by R 0 p 1 : number of positive instances covered by R 1 n 1 : number of negative instances covered by R 1

72 72 Foil Gain Metric Want to achieve two goals –Decrease coverage of negative examples Measure increase in percentage of positives covered when literal is added to the rule. –Maintain coverage of as many positives as possible. Count number of positives covered.

73 73 Foil_Gain measure R 0 : IF {} Then class (initial rule) R 1 : IF {A} Then class (rule after adding conjunct) t: number of positive instances covered by both R 0 and R 1 p 0 : number of positive instances covered by R 0 n 0 : number of negative instances covered by R 0 p 1 : number of positive instances covered by R 1 n 1 : number of negative instances covered by R 1

74 74 Example: Foil_Gain measure R 0 : IF { } Then + (initial rule) R 1 : IF {A} Then class (rule after adding conjunct) R 2 : IF {B} Then class (rule after adding conjunct) R 3 : IF {C} Then class (rule after adding conjunct) Assume the initial rule is : IF { } Then +. This rule covers p0 = 100 positive examples and n0 = 400 negative examples. Choose one rule !! R1 covers 4 positive examples and 1 negative example. R2 covers 30 positive examples and 10 negative example. R3 covers 100 positive examples and 90 negative examples.

75 75 Example: Foil_Gain measure R 0 : { } => +. (initial rule) This rule covers 100 positive examples and 400 negative examples. R 1 : {A} => class (rule after adding conjunct) R1 covers 4 positive examples and 1 negative example. t: 4, number of positive instances covered by both R 0 and R 1 p 0 : 100 n 0 : 400 p 1 : 4 n 1 : 1

76 76 Example: Foil_Gain measure R 0 : { }  +. (initial rule) This rule covers 100 positive examples and 400 negative examples. R 2 : {B}  + (rule after adding conjunct) R2 covers 30 positive examples and 10 negative example.. t: 30, number of positive instances covered by both R 0 and R 1 p 0 : 100 n 0 : 400 p 1 : 30 n 1 : 10

77 77 Example: Foil_Gain measure R 0 : { } => +. (initial rule) This rule covers 100 positive examples and 400 negative examples. R 3 : {C} => + (rule after adding conjunct) R3 covers 100 positive examples and 90 negative example.. t: 100, number of positive instances covered by both R 0 and R 1 p 0 : 100 n 0 : 400 p 1 : 100 n 1 : 90

78 78 Example for Variable Bindings Training data assertions –GrandDaughter(Victor, Sharon) –Female(Sharon) Father(Sharon,Bob) –Father(Tom, Bob) Father(Bob, Victor) Use closed world assumption: any literal involving the specified predicates and literals that is not listed is assumed to be false –  GrandDaughter(Tom,Bob)  GrandDaughter(Tom,Tom) –  GrandDaughter(Bob,Victor) –  Female(Tom) etc.

79 79 Possible Variable Bindings Initial rule –GrandDaughter(x,y) Possible bindings from training assertions (how many possible bindings of 4 literals to initial rule?) –Positive binding: {x/Victor, y/Sharon} –Negative bindings: {x/Victor, y/Victor} – {x/Tom, y/Sharon}, etc. Positive bindings provide positive evidence and negative bindings provide negative evidence against the rule under consideration. (Positive binding, negative bindings 모두 그 입증자료는 존재한다 !!)

80 80 Rule Pruning in FOIL Prepruning method based on minimum description length (MDL) principle. Postpruning to eliminate unnecessary complexity due to limitations of greedy algorithm. For each rule, R For each antecedent, A, of rule If deleting A from R does not cause negatives to become covered then delete A For each rule, R ( 중복 적용 rule 제거 !!) If deleting R does not uncover any positives (since they are redundantly covered by other rules) then delete R 조건 A 를 제거하여 - 결과를 야기 하지 않는 다면 R 에서 제거 !! 즉, 결과에 영향을 미치지 않는 경우 Pruning !!

81 Rule Growing Two common strategies

82 CN2 Algorithm: –Start from an empty conjunct: {} –Add conjuncts that minimizes the entropy measure: {A}, {A,B}, … –Determine the rule consequent by taking majority class of instances covered by the rule RIPPER Algorithm: –Start from an empty rule: {} => class –Add conjuncts that maximizes FOIL’s information gain measure: R0: {} => class (initial rule) R1: {A} => class (rule after adding conjunct) Gain(R0, R1) = t [ log (p1/(p1+n1)) – log (p0/(p0 + n0)) ] where t: number of positive instances covered by both R0 and R1 p0: number of positive instances covered by R0 n0: number of negative instances covered by R0 p1: number of positive instances covered by R1 n1: number of negative instances covered by R1 Remind: Rule Growing

83 83 General to Specific Beam Search

84 84 General to Specific Beam Search 4 LearnOneRule( target_attribute, attributes, examples, k ) Initialise best_hypothesis = Ø, the most general hypothesis Initialise candidate_hypotheses  { best_hypothesis } while ( candidate_hypothesis is not empty ) do 1. Generate the next more-specific candidate_hypothesis 2. Update best_hypothesis 3. Update candidate_hypothesis return a rule of the form “IF best_hypothesis THEN prediction“ where prediction is the most frequent value of target_attribute among those examples that match best_hypothesis. Performance( h, examples, target_attribute ) h_examples the subset of examples that match h return -Entropy( h_examples ), where Entropy is with respect to target_attribute The CN2-Algorithm

85 85 General to Specific Beam Search 5 Generate the next more specific candidate_hypothesis ' Update best_hypothesis all_constraints set of all constraints (a = v), where a Î attributes and v is a value of a occuring in the current set of examples new_candidate_hypothesis for each h in candidate_hypotheses, for each c in all_constraints Create a specialisation of h by adding the constraint c Remove from new_candidate_hypothesis any hypotheses which are duplicate, inconsistent or not maximally specific for all h in new_candidate_hypothesis do if statistically significant when tested on examples Performance( h, examples, target_attribute ) > Performance( best_hypothesis, examples, target_attribute ) ) then best_hypothesis h

86 86 General to Specific Beam Search 6 Update the candidate-hypothesis ' Performance function guides the search in the Learn-One -Rule O s: the current set of training examples O c: the number of possible values of the target attribute O : part of the examples, which are classified with the i th. value candidate_hypothesis the k best members of new_candidate_hypothesis, according to Performance function

87 87 Example for CN2-Algorithm LearnOneRule(EnjoySport, {Sky, AirTemp, Humidity, Wind, Water, Forecast, EnjoySport}, examples, 2) best_hypothesis = Ø candidate_hypotheses = {Ø} all_constraints = {Sky=Sunny, Sky=Rainy, AirTemp=Warm, AirTemp=Cold, Humidity=Normal, Humidity=High, Wind=Strong, Water=Warm, Water=Cool, Forecast=Same, Forecast=Change} For convenience, Use Performance = n c / n n = Number of examples, covered by the rule n c = Number of examples covered by the rule and classification is correct

88 88 Example for CN2-Algorithm (2) Pass 1 Remove delivers no result candidate_hypotheses = {Sky=Sunny, AirTemp=Warm} best_hypothesis is Sky=Sunny or AirTemp=Warm ? Performance = n c / n IF { }Then EnjoySport =yes

89 89 Example for CN2-Algorithm (3) Pass 2 Remove (duplicate, inconsistent, not maximally specific) candidate_hypotheses = {Sky=Sunny AND AirTemp=Warm, Sky=Sunny AND Humidity=High} best_hypothesis remains Sky=Sunny Hypothese Performance candidate_hyp best_hyp Sky=Sunny AND Sky=Sunny not maximally specific Sky=Sunny AND Sky=Rainy inconsistent Sky=Sunny AND AirTemp=Warm 1=3/3 x Sky=Sunny AND AirTemp=Cold undef. (n = 0) Sky=Sunny AND Humidity=High 1=2/2 x Sky=Sunn y AND Humidity=Normal 1=1/1 Sky=Sunny AND Wind=Strong 1 Sky=Sunny AND Water=Warm 1 Sky=Sunny AND Water=Cool 1 Sky=Sunny AND Forecast=Same 1 Sky=Sunny AND Forecast=Change 0.5 AirTemp=Warm AND Sky=Sunny Duplicate (s.h. Hyp. 3) AirTemp=Warm A ND Sky=Rainy undef. (n = 0) AirTemp=Warm AND AirTemp=Warm not maximally specific AirTemp=Warm AND AirTemp=Cold inconsistent AirTemp=Warm AND Humidity=High 1 AirTemp=Warm AND Humidity=Normal 1 AirTemp=Warm AND Wind=Strong 1 AirTemp=Warm AND Water=Warm 1 AirTemp=Warm AND Water=Cool 1 AirTemp=Warm AND Forecast=Same 1 AirTemp=Warm AND Forecast=Change 0.5 IF {Sky=Sunny and ? }Then EnjoySport =yes

90 90 Relational Learning and Inductive Logic Programming (ILP) Fixed feature vectors are a very limited representation of instances. {Sky=Sunny AND AirTemp=Warm, Sky=Sunny AND Humidity=High}  propositional rule 의 한계 극복가능 !! Examples or target concept may require relational representation that includes multiple entities with relationships between them (e.g. a graph with labeled edges and nodes).  상호 연관성 있는 다수 실체들의 관련성을 나타낼 수 있음 !! First-order predicate logic is a more powerful representation for handling such relational descriptions !!.  First Order Rule 의 이점 !! Horn clauses (i.e. if-then rules in predicate logic, Prolog programs) are a useful restriction on full first-order logic that allows decidable inference.  if-then rules in predicate logic 은 결정을 위한 추론의 한정적 표현에 유용하다 !!  즉, ~ 에 한하여 ~ 을 만족하면, ~ 에 대하여만 등의 한정적 표현 Allows learning programs from sample I/O pairs.  sample 데이터 로 부터 학습 !!

91 91 Inductive Logic Programming (ILP) Examples Learn definitions of family relationships given data for primitive types and relations. parent(A,B) :- brother(A,C), parent(C,B). uncle(A,B) :- husband(A,C), sister(C,D), parent(D,B). Learn recursive list programs from I/O pairs. member(X, [X | Y]). member(X, [Y | Z]) :- member(X, Z). append([ ],L,L). append([X|L1], L2, [X|L12]):- append(L1, L2, L12).

92 92 Inductive Logic Programming (ILP) Goal is to induce a Horn-clause definition for some target predicate P, given definitions of a set of background predicates Q. Goal is to find a syntactically simple Horn-clause definition, D, for P given background knowledge B defining the background predicates Q. –For every positive example p i of P D ∪ B |-- p i –For every negative example n i of P D ∪ B |-- p i X |-- Y : “Y follow deductively from X” or “X 는 y 를 수반한다" Background definitions are provided either: –Extensionally: List of ground tuples satisfying the predicate. –Intensionally: Prolog definitions of the predicate.

93 93 ILP Systems Top-Down: –F OIL (Quinlan, 1990) Bottom-Up: –C IGOL (Muggleton & Buntine, 1988) –G OLEM (Muggleton, 1990) Hybrid: –C HILLIN (Mooney & Zelle, 1994) –P ROGOL (Muggleton, 1995) –A LEPH (Srinivasan, 2000)

94 94 FOIL First-Order Inductive Logic Top-down sequential covering algorithm “upgraded” to learn Prolog clauses, but without logical functions. Background knowledge must be provided extensionally. Initialize clause for target predicate P to P(X 1,….X T ) :-. Possible specializations of a clause include adding all possible literals: –Q i (V 1,…,V Ti ) –not(Q i (V 1,…,V Ti )) –X i = X j –not(X i = X j ) where X’s are “bound” variables already in the existing clause; at least one of V 1,…,V Ti is a bound variable, others can be new. Allow recursive literals P(V 1,…,V T ) if they do not cause an infinite regress. Handle alternative possible values of new intermediate variables by maintaining examples as tuples of all variable values.  모든 변수 값의 tuples 형태의 예제에 의해 새로운 가교적인 변수를 처리 할 수 있다.

95 95 F OIL Training Data For learning a recursive definition, the positive set must consist of all tuples of constants that satisfy the target predicate, given some fixed universe of constants. Background knowledge consists of complete set of tuples for each background predicate for this universe. Example: Consider learning a definition for the target predicate path for finding a path in a directed acyclic graph. path(X,Y) :- edge(X,Y). path(X,Y) :- edge(X,Z), path(Z,Y). 1 2 3 4 6 5 Background Knowledge, B edge: {,,,,, } path: {,,,,,,,,, } some target predicate P Horn-clause definition, D,

96 96 F OIL Negative Training Data Negative examples of target predicate can be provided directly, or generated indirectly by making a closed world assumption. –Every pair of constants not in positive tuples for path predicate. 1 2 3 4 6 5 Negative path tuples: 편의상 path(X, Y)  path:{,,,,,,,,,,,,,,,,, }

97 97 example F OIL Induction 1 2 3 4 6 5 Pos :path:{,,,,,,,,, } Start with clause: path(X,Y):-. Possible literals to add : edge(X,X),edge(Y,Y),edge(X,Y),edge(Y,X),edge(X,Z), edge(Y,Z),edge(Z,X),edge(Z,Y),path(X,X),path(Y,Y), path(X,Y),path(Y,X),path(X,Z),path(Y,Z),path(Z,X), path(Z,Y),X=Y, plus negations of all of these. Neg: path: {,,,,,,,,,,,,,,,,, }

98 98 example F OIL Induction 1 2 3 4 6 5 Test: path(X,Y):- edge(X,X). Covers 0 positive examples Not a good literal. edge: {,,,,, } path: {,,,,,,,,, } Pos :path:{,,,,,,,,, } Neg: path: {,,,,,,,,,,,,,,,,, }

99 99 Sample F OIL Induction 1 2 3 4 6 5 Pos :path:{,,,,,,,,, } Test: path(X,Y):- edge(X,Y). Covers 6 positive examples Covers 0 negative examples Chosen as best literal. Result is base clause. Neg :path: {,,,,,,,,,,,,,,,,, } edge: {,,,,, } path: {,,,,,,,,, } Possible literals :edge(X,X),edge(Y,Y),edge(X,Y),edge(Y,X), edge(X,Z),edge(Y,Z),edge(Z,X),edge(Z,Y),path(X,X),pa th(Y,Y),path(X,Y),path(Y,X),path(X,Z),path(Y,Z),path (Z,X),path(Z,Y),X=Y,

100 100 Sample F OIL Induction 1 2 3 4 6 5 Pos : {,,, } Test: path(X,Y):- edge(X,Y). Covers 6 positive examples Covers 0 negative examples Chosen as best literal. Result is base clause. Neg: {,,,,,,,,,,,,,,,,, } Remove covered positive tuples !!. edge: {,,,,, } path: {,,,,,,,,, }

101 101 1 2 3 4 6 5 Pos : {,,, } path(X,Y):- edge(X,Y). Start new clause path(X,Y):-. Neg: {,,,,,,,,,,,,,,,,, } Sample F OIL Induction Possible literals :edge(X,X),edge(Y,Y),edge(X,Y),edge(Y,X), edge(X,Z),edge(Y,Z),edge(Z,X),edge(Z,Y),path(X,X),pa th(Y,Y),path(X,Y),path(Y,X),path(X,Z),path(Y,Z),path (Z,X),path(Z,Y),X=Y,

102 102 1 2 3 4 6 5 Pos : {,,, } Test: path(X,Y):- edge(Y,X). Neg: {,,,,,,,,,,,,,,,,, } Covers 0 positive examples Covers ? negative examples Not a good literal. Sample F OIL Induction edge: {,,,,, } path: {,,,,,,,,, }

103 103 1 2 3 4 6 5 Pos : {,,, } Test: path(X,Y):- edge(X,Z). Neg: {,,,,,,,,,,,,,,,,, } Covers all 4 positive examples Covers 14 of 26 negative examples Sample F OIL Induction Negatives still covered, remove uncovered examples !! edge: {,,,,, } path: {,,,,,,,,, } Possible literals :edge(X,X),edge(Y,Y),edge(X,Y),edge(Y,X), edge(X,Z),edge(Y,Z),edge(Z,X),edge(Z,Y),path(X,X),pa th(Y,Y),path(X,Y),path(Y,X),path(X,Z),path(Y,Z),path (Z,X),path(Z,Y),X=Y,

104 104 1 2 3 4 6 5 Pos : {,,, } Test: path(X,Y):- edge(X,Z). Neg: {,,,,,,,,,,,,, } Covers all 4 positive examples Covers 14 of 26 negative examples Eventually chosen as best possible literal !! Negatives still covered, remove uncovered examples !! example F OIL Induction edge: {,,,,, } path: {,,,,,,,,, }

105 105 example F OIL Induction 1 2 3 4 6 5 Pos : {,,,, } Test: path(X,Y):- edge(X,Z). Neg: {,,,,,,,,,,,,, } Covers all 4 positive examples Covers 14 of 26 negative examples Eventually chosen as best possible literal Negatives still covered, remove uncovered examples. Expand tuples to account for possible Z values !!. edge: {,,,,, } path: {,,,,,,,,, } {

106 106 Sample F OIL Induction 1 2 3 4 6 5 Pos : {,,,,, } Test: path(X,Y):- edge(X,Z). Neg: {,,,,,,,,,,,,, } Covers all 4 positive examples Covers 15 of 26 negative examples Eventually chosen as best possible literal Negatives still covered, remove uncovered examples. Expand tuples to account for possible Z values !!. { edge: {,,,,, } path: {,,,,,,,,, }

107 107 Sample F OIL Induction 1 2 3 4 6 5 Pos : {,,,,, } Test: path(X,Y):- edge(X,Z). Neg: {,,,,,,,,,,,,, } Covers all 4 positive examples Covers 15 of 26 negative examples Eventually chosen as best possible literal Negatives still covered, remove uncovered examples. Expand tuples to account for possible Z values !!. edge: {,,,,, } path: {,,,,,,,,, }

108 108 Sample F OIL Induction 1 2 3 4 6 5 Pos : {,,,,,, } Test: path(X,Y):- edge(X,Z). Neg: {,,,,,,,,,,,,, } Covers all 4 positive examples Covers 15 of 26 negative examples Eventually chosen as best possible literal Negatives still covered, remove uncovered examples. Expand tuples to account for possible Z values !!. { edge: {,,,,, } path: {,,,,,,,,, }

109 109 example F OIL Induction 1 2 3 4 6 5 Pos : {,,,,,, } Test: path(X,Y):- edge(X,Z). Neg: {,,,,,,,,,,,, } Covers all 4 positive examples Covers 14 of 26 negative examples Eventually chosen as best possible literal Negatives still covered, remove uncovered examples. Expand tuples to account for possible Z values. edge: {,,,,, } path: {,,,,,,,,, }

110 110 example F OIL Induction 1 2 3 4 6 5 Pos : {,,,,,, } Continue specializing clause: path(X,Y):- edge(X,Z). Neg: {,,,,,,,,,,,, } edge: {,,,,, } path: {,,,,,,,,, } Possible literals :edge(X,X),edge(Y,Y),edge(X,Y),edge(Y,X), edge(X,Z),edge(Y,Z),edge(Z,X),edge(Z,Y),path(X,X),pa th(Y,Y),path(X,Y),path(Y,X),path(X,Z),path(Y,Z),path (Z,X),path(Z,Y),X=Y,

111 111 Sample F OIL Induction 1 2 3 4 6 5 Pos : {,,,,,, } Test: path(X,Y):- edge(X,Z),edge(Z,Y). Neg: {,,,,,,,,,,,, } Covers 3 positive examples Covers 0 negative examples edge: {,,,,, } path: {,,,,,,,,, } Possible literals :edge(X,X),edge(Y,Y),edge(X,Y),edge(Y,X), edge(X,Z),edge(Y,Z),edge(Z,X),edge(Z,Y),path(X,X),pa th(Y,Y),path(X,Y),path(Y,X),path(X,Z),path(Y,Z),path (Z,X),path(Z,Y),X=Y, edge 의 경우 path(3,5):- edge(3,6),edge(6,5). 이 존재  … 예 ) edge 의 경우 path(1,Y):- edge(1,2),edge(2,Y). 를 만족하는 path(1,Y), edge(2,Y) 가 부존재 edge 의 경우 path(1,6):- edge(1,3),edge(3,6). 이 존재 

112 112 Sample F OIL Induction 1 2 3 4 6 5 Pos : {,,,,,, } Test: path(X,Y):- edge(X,Z), path(Z,Y). Neg: {,,,,,,,,,,,, } Covers 4 positive examples Covers 0 negative examples Eventually chosen as best literal; completes clause. Definition complete, since all original tuples are covered (by way of covering some tuple.) edge: {,,,,, } path: {,,,,,,,,, } Pos : {,,, }

113 Rules by F OIL Induction 113 path(X,Y):- edge(X,Y). path(X,Y):- edge(X,Z), path(Z,Y).

114 114 Logic Program Induction in F OIL F OIL has also learned –append given components and null –reverse given append, components, and null –quicksort given partition, append, components, and null –Other programs from the first few chapters of a Prolog text. Learning recursive programs in F OIL requires a complete set of positive examples for some constrained universe of constants, so that a recursive call can always be evaluated extensionally. –For lists, all lists of a limited length composed from a small set of constants (e.g. all lists up to length 3 using {a,b,c}). –Size of extensional background grows combinatorially. path: {,, … } Negative examples usually computed using a closed-world assumption. –Grows combinatorially large for higher arity target predicates. e.g.) path(X1, X2, …Xn) –Can randomly sample negatives to make tractable. append([ ],L,L). append([X|L1], L2, [X|L12]):- append(L1, L2, L12). del(X, [X|T], T). del(X, [Y|T], [Y|T1]) :- del(X, T, T1).

115 115 More Realistic Applications 3-115 PROLOG parent(pam, bob). %Pam is a parent of bob parent(tom, bob). parent(tom, liz). parent(bob, ann). parent(bob, pat). parent(pat, jim). female(pam). %Pam is a female male(tom). male(bob). female(ann). female(pat). female(liz). offspring(Y, X) :- parent(X, Y). mother(X, Y) :- parent(X, Y), female(X). %X is the mother of Y if X is a parent of Y and X % is female grandparent(X, Z) :- parent(X, Y), parent(Y, Z). sister(X, Y) :- parent(Z, X), parent(Z, Y), female(X), female(Y). % rule1 predecessor(X, Z) :- parent(X,Z). % rule2 predecessor(X, Z) :- parent(X,Y), predecessor(Y, Z). ?- predecessor(pam, X). X = bob; X = ann; X = pat; X = jim

116 Homework: Induce Logic Program in F OIL !! 1. Generate Background Knowledge, B Example) parent(pam, bob). %Pam is a parent of bob ancestor(pam, bob). %Pam is an ancestor of bob ancestor(pam, pat). % pam is ancestor of pat 2. Generate negative examples by making a closed world assumption. % rule1 ancestor(X, Z) :- % rule2 ancestor(X, Z) :- 116 bob pamtom liz ann pat jim

117 117 F OIL Limitations Search space of literals (branching factor) can become intractable. –Use aspects of bottom-up search to limit search. Requires large extensional background definitions. –Use intensional background via Prolog inference. Hill-climbing search gets stuck at local optima and may not even find a consistent clause. –Use limited backtracking (beam search) –Include determinate literals with zero gain. –Use relational pathfinding or relational clichés. Requires complete examples to learn recursive definitions. –Use intensional interpretation of learned recursive clauses.

118 118 Rule Learning and ILP Summary There are effective methods for learning symbolic rules from data using greedy sequential covering and top-down or bottom-up search. These methods have been extended to first-order logic to learn relational rules and recursive Prolog programs. Knowledge represented by rules is generally more interpretable by people, allowing human insight into what is learned and possible human approval and correction of learned knowledge.

119 119 Leaning Task Apply Model Induction Deduction Learn Model (tree) Model (Tree) Test Set Learning algorithm Training Set B ∧ h ∧ x i |– f(x i ) O(B,D) = h F(A,B) = C, where A ∧ B |– C

120 120 Induction as Inverted Deduction Induction is finding h such that ( ∀ ∈ D) B ∧ h ∧ x i |– f(x i ) where x i is the ith training instance f(x i ) is the target function value for x i B is other background knowledge So let’s design inductive algorithms by inverting operators for automated deduction

121 121 Induction as Inverted Deduction “pairs of people, such that child of u is v,” f( x i ) : Child(Bob,Sharon) x i : Male(Bob),Female(Sharon),Father(Sharon,Bob) B : Parent(u,v) ← Father(u,v) What satisfies ( ∀ ∈ D) B ∧ h ∧ x i |– f( x i ) ? h1 : Child(u,v) ← Father(v,u)  h1 ∧ x i with no need B h2 : Child(u,v) ← Parent(v,u)  B ∧ h1 ∧ x i D father parent

122 122 Induction as Inverted Deduction “pairs of people, such that child of u is v,” f( x i ) : Child(Bob,Sharon) x i : Male(Bob),Female(Sharon),Father(Sharon,Bob) B : Parent(u,v) ← Father(u,v) What satisfies ( ∀ ∈ D) B ∧ h ∧ x i |– f( x i ) ? h1 : Child(u,v) ← Father(v,u)  h1 ∧ x i with no need B D h1 : Child(u,v) ← Father(v,u) Father(Sharon,Bob), … Child(Bob,Sharon) xixi f(xi)f(xi)

123 123 Induction as Inverted Deduction “pairs of people, such that child of u is v,” f( x i ) : Child(Bob,Sharon) x i : Male(Bob),Female(Sharon),Father(Sharon,Bob) B : Parent(u,v) ← Father(u,v) What satisfies ( ∀ ∈ D) B ∧ h ∧ x i |– f( x i ) ? h2 : Child(u,v) ← Parent(v,u)  B ∧ h1 ∧ x i D B : Parent(u,v) ← Father(u,v) Father(Sharon,Bob),… Child(Bob,Sharon) xixi f(xi)f(xi) h1 : Child(v,u) ← Father(u,v)  h2 : father parent

124 124 Induction as Inverted Deduction We have mechanical deductive operators F(A,B) = C, where A ∧ B |– C need inductive operators O(B,D) = h where ( ∀ ∈ D) B ∧ h ∧ x i |– f(x i )

125 125 Induction as Inverted Deduction Positives: Subsumes earlier idea of finding h that “fits” training data Domain theory B helps define meaning of “fit” the data B ∧ h ∧ x i |– f(x i ) Negatives: Doesn’t allow for noisy data. Consider ( ∀ ∈ D) B ∧ h ∧ x i |– f(x i ) First order logic gives a huge hypothesis space H – overfitting… – intractability of calculating all acceptable h’s

126 126 Deduction: Resolution Rule C1: P ∨ L C2:¬ L ∨ R resolvent C: P ∨ R 1. Given initial clauses C1 and C2, find a literal L from clause C1 such that ¬ L occurs in clause C2. 2. Form the resolvent C by including all literals from C1 and C2, except for L and ¬ L. More precisely, the set of literals occurring in the conclusion C is C = (C1 - {L}) ∪ (C2 - {¬ L}) where ∪ denotes set union, and “-” set difference.

127 127 Inverting Resolution C = (C1 - {L}) ∪ (C2 - {¬L}) C1: PassExam ∨ ¬KnowMaterial C2: KnowMaterial ∨ ¬Study C: PassExam ∨ ¬Study {L} {¬L}

128 128 Inverted Resolution (Propositional) Given initial clauses C1 and C, find a literal L that occurs in clause C1, but not in clause C. Form the second clause C2 by including the following literals C2 = (C - (C1 - {L})) ∪ {¬ L} C1: PassExam ∨ ¬KnowMaterial C2: KnowMaterial ∨ ¬Study C: PassExam ∨ ¬Study

129 129 Inverted Resolution First Order Resolution –substitution to be any mapping of variables to terms. ex) θ = { x / Bob, y / z } –Unifying Substitution For two literal L 1 and L 2, provided L 1 θ = L 2 θ Ex) θ = {x/Bill, z/y} L 1 =Father(x, y), L 2 =Father(Bill, z) L 1 θ = L 2 θ = Father(Bill, y)

130 130 Inverted Resolution First Order Resolution –Resolution Operator (first-order form) Find a literal L 1 from clause C 1, literal L 2 from clause C 2, and substitution θ such that L 1 θ= ¬ L 2 θ. From the resolvent C by including all literals from C 1 θ and C 2 θ, except for L 1 θ and ¬ L 2 θ. More precisely, the set of literals occurring in the conclusion C is C = (C 1 - {L 1 })θ ∪ (C 2 - {L 2 })θ

131 131 Inverted Resolution First Order Resolution Example) Find the resolvent C !! –C 1 = White(x) ← Swan(x), C 2 = Swan(Fred) –C 1 = White(x) ∨¬ Swan(x),  L 1 = ¬ Swan(x), L 2 =Swan(Fred) If we use unifying substitution θ = {x/Fred} L 1 θ= ¬ L 2 θ = ¬ swan(x) {x/Fred} = ¬ swan(Fred) {x/Fred} = ¬ swan(Fred) (C 1 -{L 1 })θ = White(Fred) (C 2 -{L 2 })θ = Ø ∴ C = White(Fred) C = (C1 - {L 1 })θ ∪ (C2 - {L 2 })θ

132 132 First Order Resolution 1. Find a literal L 1 from clause C 1, literal L 2 from clause C 2, and substitution θ such that L 1 θ 1 = ¬L 2 θ 2 (where, θ = θ 1 θ 2 (factorization)) 2. Form the resolvent C by including all literals from C 1 θ and C 2 θ, except for L 1 theta and ¬L 2 θ. More precisely, the set of literals occuring in the conclusion is C = (C 1 - {L 1 })θ 1 ∪ (C 2 - {L 2 })θ 2 C - (C 1 - {L 1 })θ 1 = (C 2 - {L 2 })θ 2 C - (C 1 - {L 1 })θ 1 + {L 2 }θ 2 = (C 2 )θ 2 Inverting: C 2 = ( C - (C 1 - {L 1 }) θ 1 ) θ 2 -1 ∪ {¬L 1 θ 1 θ 2 -1 }

133 133 Inverse Resolution : First-order case C=(C 1 -{L 1 })θ 1 ∪ (C 2 -{L 2 })θ 2 (where, θ = θ 1 θ 2 (factorization)) C - (C 1 -{L 1 })θ 1 = (C 2 -{L 2 })θ 2 (where, L 2 = ¬ L 1 θ 1 θ 2 -1 ) ∴ C 2 =(C-(C 1 -{L 1 })θ 1 )θ 2 -1 ∪ { ¬ L 1 θ 1 θ 2 -1 } First Order Resolution

134 134 Inverting Resolution (cont.) Inverse Resolution : First-order case –We wish to learn rules for GrandChild(y,x), target predicate Given training data D: GrandChild(Bob,Shannon) Background info. B : { Father(Shannon,Tom), Father(Tom,Bob) } C=GrandChild(Bob,Shannon) C 1 =Father(Shannon,Tom) L 1 =Father(Shannon,Tom)

135 135 Inverting Resolution (cont.) Inverse Resolution : First-order case C 1 =Father(Shannon,Tom) C=GrandChild(Bob,Shannon) C 2 =GrandChild(Bob,x) ∨ ¬ Father(X,Tom) B=Father(Tom,Bob) GrandChild(y,x) ∨ ¬ Father(x,z), ∨ ¬ Father(z,y) θ 2 -1 ={Shannon/x} {Bob/y, Tom/z} C 2 =(C-(C 1 -{L 1 })θ 1 )θ 2 -1 ∪ { ¬ L 1 θ 1 θ 2 -1 } (C-(C 1 -{L 1 })θ 1 ) θ 2 -1 = (C)θ 2 -1 { ¬ L 1 θ 1 θ 2 -1 } L 1 =Father(Shannon,Tom) θ 1 -1 ={}, GrandChild(y,x) ← Father(x,z), Father(z,y)

136 136 Inverting Resolution (cont.) Inverse Resolution : First-order case –We wish to learn rules for GrandChild(y,x), target predicate Given training data D: GrandChild(Bob,Shannon) Background info. B : { Father(Shannon,Tom), Father(Tom,Bob) } C=GrandChild(Bob,Shannon) C 1 =Father(Shannon,Tom) L 1 =Father(Shannon,Tom) Suppose we choose inverse substitution θ 1 -1 ={}, θ 2 -1 ={Shannon/x) (C-(C 1 -{L 1 })θ 1 ) θ 2 -1 = (C)θ 2 -1 = GrandChild(Bob,x) { ¬ L 1 θ 1 θ 2 -1 } = ¬ Father(x,Tom) ∴ C 2 = GrandChild(Bob,x) ∨¬ Father(x,Tom) or equivalently GrandChild(Bob,x) ← Father(x,Tom) C 2 =(C-(C 1 -{L 1 })θ 1 )θ 2 -1 ∪ { ¬ L 1 θ 1 θ 2 -1 }

137 137 Rule Learning Issues Which is better rules or trees? –Trees share structure between disjuncts. –Rules allow completely independent features in each disjunct. –Mapping some rules sets to decision trees results in an exponential increase in size. A  B → P C  D → P A t f B t f P C t f D t f PN N C t f D t f PN N What if add rule: E  F → P ??

138 138 Rule Learning Issues Which is better top-down or bottom-up search? –Bottom-up is more subject to noise, e.g. the random seeds that are chosen may be noisy. –Top-down is wasteful when there are many features which do not even occur in the positive examples (e.g. text categorization).

139 139 Summary Learning Rules from Data Sequential Covering Algorithms –Learning single rules by search Beam search Alternative covering methods –Learning rule sets First-Order Rules –Learning single first-order rules Representation: first-order Horn clauses Extending Sequential-Covering and Learn-One-Rule: variables in rule preconditions

140 140 –FOIL: learning first-order rule sets Idea: inducing logical rules from observed relations Guiding search in FOIL Learning recursive rule sets –Induction as inverted deduction Idea : inducing logical rule as inverted deduction O(B, D) = h –such that ( ∀ ∈ D) (B ∧ h ∧ x i )├ f(x i ) Generate only hypotheses satisfying the constraint, (B ∧ h ∧ x i )├ f(x i ) –Cf. FOIL : generates many hypotheses at each search step based on syntax, including those not satisfying this constraint Inverse resolution operator can consider only a small fraction of the available data –Cf. FOIL : consider all available data Summary (cont.)


Download ppt "1 Classifier Learning: Rule Learning Intelligent Systems Lab. Soongsil University Thanks to Raymond J. Mooney in the University of Texas at Austin."

Similar presentations


Ads by Google