1 Machine Learning: Rule Learning. 2 Learning Rules If-then rules in logic are a standard representation of knowledge that have proven useful in expert-systems.

1 Machine Learning: Rule Learning

2 Learning Rules If-then rules in logic are a standard representation of knowledge that have proven useful in expert-systems and other AI systems –In propositional logic a set of rules for a concept is equivalent to DNF Rules are fairly easy for people to understand and therefore can help provide insight and comprehensible results for human users. –Frequently used in data mining applications where goal is discovering understandable patterns in data. Methods for automatically inducing rules from data have been shown to build more accurate expert systems than human knowledge engineering for some applications. Rule-learning methods have been extended to first-order logic to handle relational (structural) representations. –Inductive Logic Programming (ILP) for learning Prolog programs from I/O pairs. –Allows moving beyond simple feature-vector representations of data.

Two types of rules 3 Propositional Sunny  Warm  PlayTennis First order predicate logic Father(x,y)  female(y)  Daughter(y,x) Parent(x,y)  Parent(y,z)  Ancestor(x,z)Propositional logic assumes the world contains facts. Its expressive power is much lower than FOL. You cannot state: Daughter(y,x) but only specific facts like:

Terminology in FOL 4 All expressions composed of: –Constants (Bob, Louise) –Variables (x,y) –Predicate symbols (Loves, Greater_Than) –Function symbols (age) Term: any constant, any variable, or any function applied to any term (Bob, x, age(Bob) ) Literal: any predicate or its negation applied to any term Loves(Bob, Louise)positive literal  Greater_Than(age(x), 20)negative literal

More terminology 5 Ground literal—a literal that does not contain any variables e.g. Loves(Bob, Louise) Clause—any disjunction of literals were all variables are assumed to be universally quantified. Horn Clause—a clause containing at most one positive (NON-NEGATIVE) literal H   L 1  …   L n Or equivalently (L 1  …  L n )  H Body or antecedents Head or consequent Li examples: Loves(x,y) Color(blue) or equivalently Color=blue

6 Rule Learning Approaches Translate decision trees into rules (C4.5) Sequential (set) covering algorithms –General-to-specific (top-down) (CN2, FOIL) –Specific-to-general (bottom-up) (GOLEM, CIGOL) –Hybrid search (AQ, Chillin, Progol) Translate neural-nets into rules (TREPAN) (not in this course)

7 Decision-Trees to Rules For each path in a decision tree from the root to a leaf, create a rule with the conjunction of tests along the path as an antecedent and the leaf label as the consequent. color red blue green shape circle square triangle B C A B C red  circle → A blue → B red  square → B green → C red  triangle → C Notice again: these are all equivalent: Color(red)  Shape(circle) → A Color=red  Shape=circle → A red  circle → A

8 Post-Processing Decision-Tree Rules 1.Resulting rules may contain unnecessary antecedents that are not needed to remove negative examples and result in over-fitting. 2.Rules are post-pruned by greedily removing antecedents or rules until performance on training data or validation set is significantly harmed. Ex: a  b  c  d→ A vrs. a  b  c→ A or a  b→ A etc (remember that from root to leaves the support of test nodes is lower. The most informative features are tested first, according to e.g. information gain)

Post-Processing Decision-Tree Rules 3. Resulting rules may lead to competing conflicting conclusions on some instances. e.g. a  b  c  d→ A a  b  c  f→ B 4. Sort rules by training (validation) accuracy to create an ordered decision list. The first rule in the list that applies is used to classify a test instance. 9 red  circle → A (97% train accuracy) red  big → B (95% train accuracy) : Test case: assigned to class A

10 Sequential Covering A set of rules is learned one at a time, each time finding a single rule that covers a large number of positive instances without covering any negatives, removing the positives that it covers, and learning additional rules to cover the rest. Let P be the set of positive examples Until P is empty do: Learn a rule R that covers a large number of elements of P but no negatives. Add R to the list of rules. Remove positives covered by R from P This is an instance of a greedy algorithm for minimum set covering and does not guarantee a minimum number of learned rules. Minimum set covering is an NP-hard problem and the greedy algorithm is a standard approximation algorithm. Methods for learning individual rules vary (see next).

11 Greedy Sequential Covering Example X Y + + + + + + + + + + + + + X AND Y ARE THE FEATURES DESCRIBING EACH INSTANCE

12 Greedy Sequential Covering Example THE RULE IS: (A<X<B) ∧ (C<Y<D) X Y + + + + + + + + + + + + + A B C D

13 Greedy Sequential Covering Example X Y + + + + + +

15 Greedy Sequential Covering Example X Y + + +

16 Greedy Sequential Covering Example X Y + + +

17 Greedy Sequential Covering Example X Y

18 No-optimal Covering Example X Y + + + + + + + + + + + + +

19 Greedy Sequential Covering Example X Y + + + + + + + + + + + + +

22 Greedy Sequential Covering Example X Y + +

23 Greedy Sequential Covering Example X Y + +

24 Greedy Sequential Covering Example X Y +

25 Greedy Sequential Covering Example X Y +

26 Greedy Sequential Covering Example X Y

27 Strategies for Learning a Single Rule in GSC Top Down (General to Specific): –Start with the most-general (empty) rule. –Repeatedly add antecedent constraints on features that eliminate negative examples while maintaining as many positives as possible. –Stop when only positives are covered. Bottom Up (Specific to General) –Start with a most-specific rule (e.g. complete instance description of a random instance). –Repeatedly remove antecedent constraints in order to cover more positives. –Stop when further generalization results in covering negatives.

28 Top-Down Rule Learning Example X Y + + + + + + + + + + + + +

29 Top-Down Rule Learning Example X Y + + + + + + + + + + + + + Y>C 1

30 Top-Down Rule Learning Example X Y + + + + + + + + + + + + + Y>C 1 X>C 2

31 Top-Down Rule Learning Example X Y + + + + + + + + + + + + + Y>C 1 X>C 2 Y<C 3

32 Top-Down Rule Learning Example X Y + + + + + + + + + + + + + Y>C 1 X>C 2 Y<C 3 X<C 4

33 Bottom-Up Rule Learning Example X Y + + + + + + + + + + + + +

Learning First Order Rules First order predicate logic rules are much more expressive than propositions, as we noted before. Often referred to as inductive logic programming (ILP)

45 FOIL Top-down approach originally applied to first-order logic (Quinlan, 1990). Basic algorithm for instances with discrete-valued features: Let A={} (set of rule antecedents) Let N be the set of negative examples Let P the current set of uncovered positive examples Until N is empty do For every feature-value pair (literal) (F i =V ij ) calculate Gain(F i =V ij, P, N) Pick literal, L, with highest gain. Add L to A. Remove from N any examples that do not satisfy L. Remove from P any examples that do not satisfy L. Return the rule: A 1  A 2  …  A n → Positive

Two Differences between Seq. Covering and FOIL FOIL uses different approach for general to specific candidate specializations FOIL employs a performance measure, Foil_Gain that is different from entropy (CRF. DECISION TREES)

Rules for generating Candidate Specializations in FOIL FOIL generates new literals that can each be added to a rule of form: L 1 …L n  P(x 1,x 2,…,x k ) New Candidate Literals L n+1 can be: –A new literal Q(v 1,…v r ) where Q is any allowable predicate Each v i is a new variable (e.g. x k+1 ) or variable (e.g. (x 1,x 2,…,x k )) in rule At least one of the v i in the new literal is already in the rule –The literal: Equal(x j,x k ) where x j, x k are variables already in the rule. –The negation of either of the above forms of literals

Example 48

An Example Goal is to learn rules to predict: GrandDaughter(x,y) Possible predicates: –Father –Female Initial rule:  GrandDaughter(x,y)

1. Generate Candidate Literals (see previous rules) Equal(x,y) Female(x) Female(y) Father(x,y) Father(y,x) Father(x,z) Father(z,x) Father(y,z) Father(z,y)  Equal(x,y)  Female(x)  Female(y)  Father(x,y)  Father(y,x)  Father(x,z)  Father(z,x)  Father(y,z)  Father(z,y)  GrandDaughter(x,y) 1. The new variable is z. 2. Cannot add a new variable unless the literal includes at least one “old” variable. 3. Equal applies only to existing variables x and y. 4. for any literal, its negation is added.

Consider all alternatives and select the best (see later FOIL GAIN) 51

Extending the best rule Suppose Father(y,z) is selected as the best of the candidate literals (according to Gain, see later) The rule that is generated is: Father(y,z)  GrandDaughter(x,y) New literals for next iteration include –All remaining previous literals –Plus new literals (when adding a new variable w) Female(z) Equal(z,x) Equal(z,y) Father(z,x) Father(z,y) Father(z,w) Father(w,z) and negations of all of these

Termination of Specialization When only positive examples are covered by the rule, the search for specializations of the rule terminates. All positive examples covered by the rule are removed from the training set. If additional positive examples remain, a new search for an additional rule is started.

Guiding the Search Must determine the performance of candidate rules over the training set at each step. Must consider all possible bindings of variables to the rule. In general, positive bindings are preferred over negative bindings.

Example Training data assertions GrandDaughter(Victor, Sharon) Female(Sharon) Father(Sharon,Bob) Father(Tom, Bob) Father(Bob, Victor) Use closed world assumption: any literal involving the specified predicates and literals that is not listed is assumed to be false (what is not currently known to be true, is false)  GrandDaughter(Tom,Bob)  GrandDaughter(Victor,Victor)  GrandDaughter(Bob,Victor)  Female(Tom) etc.

Possible Variable Bindings Initial rule  GrandDaughter(x,y) Possible bindings from training assertions (how many possible bindings of 4 literals to initial rule?) Positive binding: {x/Victor, y/Sharon} Negative bindings: {x/Victor, y/Victor} {x/Tom, y/Bob}, etc. Positive bindings provide positive evidence and negative bindings provide negative evidence against the rule under consideration.

How many bindings? Order matters! (so, we must count PERMUTATIONS) (e.g. GD(Sharon,Sue)≠GD(Sue,Sharon)) Repetions are allowed (e.g. : GD(Sharon,Sharon) So how many? N k (in our case, 4 persons, and k=2 =>16!) 59

Evaluation Function FOIL Gain Estimate of utility of adding a new literal Based on the numbers of positive and negative bindings covered before and after adding the new literal.

FOIL Gain R a rule L candidate literal R’ rule created by adding L to R p 0 number of positive bindings of rule R p 1 number of positive bindings of rule R’ n 0 number of negative bindings of rule R n 1 number of negative bindings of rule R’ t number of positive bindings of R still covered after adding L to yield R’

62 We have 16 bindings (4 2 ) for GD(x,y, of which one positive)

Interpretation of Foil Gain Interpret in terms of information theory –The second log term is the minimum number of bits needed to encode the classification of an arbitrary positive binding among the bindings covered by R –The first log term is the number of bits required if the binding is one of those covered by rule R ’. –Because t is the number of bindings of R that remain covered by R ’, Foil_Gain(L,R) is the reduction due to L in the total number of bits needed to encode the classification of all positive bindings of R.

For rule 1 64

For rule 2 65

Ex: summer school database 66

67 Negative examples: Relations: subscription,course,company and person Positive examples:

Learning Recursive Rule Sets Allow new literals added to rule body to refer to the target predicate. Example: edge(X,Y)  path(X,Y) edge(X,Z)  path(Z,Y)  path(X,Y) FOIL can accommodate recursive rules, but must take precautions to avoid infinite recursion

69 F OIL Example 2 (recursive) Example: Consider learning a definition for the target predicate path for finding a path in a directed acyclic graph. edge(X,Y)  path(X,Y) edge(X,Z)  path(Z,Y)  path(X,Y) 1 2 3 4 6 5 edge: {,,,,, } path: {,,,,,,,,, }

70 F OIL Negative Training Data Negative examples of target predicate can be provided directly, or (as in previous example) generated indirectly by making a closed world assumption. –Every pair of constants not in positive tuples for path predicate. 1 2 3 4 6 5 Negative path tuples: {,,,,,,,,,,,,,,,,, }

71 Sample F OIL Induction 1 2 3 4 6 5 Pos : {,,,,,,,,, } Start with clause:  path(X,Y) Possible literals to add : edge(X,X),edge(Y,Y),edge(X,Y),edge(Y,X),edge(X,Z), edge(Y,Z),edge(Z,X),edge(Z,Y),path(X,X),path(Y,Y), path(X,Y),path(Y,X),path(X,Z),path(Y,Z),path(Z,X), path(Z,Y),X=Y, plus negations of all of these. Neg: {,,,,,,,,,,,,,,,,, }

72 Neg: {,,,,,,,,,,,,,,,,, } Sample F OIL Induction 1 2 3 4 6 5 Pos : {,,,,,,,,, } Test: edge(X,X)  path(X,Y). Covers 0 positive examples Covers 6 negative examples Not a good literal.

73 Sample F OIL Induction 1 2 3 4 6 5 Pos : {,,,,,,,,, } Test: edge (X,Y)  path(X,Y) Covers 6 positive example Covers 0 negative examples Chosen as best literal. Result is base clause. Neg: {,,,,,,,,,,,,,,,,, }

74 Sample F OIL Induction 1 2 3 4 6 5 Pos : {,,, } Test: edge (X,Y)  path(X,Y) Covers 6 positive examples Covers 0 negative examples Chosen as best literal. Result is base clause. Neg: {,,,,,,,,,,,,,,,,, } Remove covered positive tuples.

75 Sample F OIL Induction 1 2 3 4 6 5 Pos : {,,, } Start new clause  path(X,Y) Neg: {,,,,,,,,,,,,,,,,, }

76 Sample F OIL Induction 1 2 3 4 6 5 Pos : {,,, } Test: edge(X,Y)->path(X,Y) Neg: {,,,,,,,,,,,,,,,,, } Covers 0 positive examples Covers 0 negative examples Not a good literal.

77 Sample F OIL Induction 1 2 3 4 6 5 Pos : {,,, } Test: edge (X,Z)  path(X,Y) Neg: {,,,,,,,,,,,,,,,,, } Covers all 4 positive examples Covers 14 of 26 negative examples Eventually chosen as best possible literal

78 Sample F OIL Induction 1 2 3 4 6 5 Pos : {,,, } Test: edge (X,Z)  path(X,Y) Neg: {,,,,,,,,,,,,, } Covers all 4 positive examples Covers 15 of 26 negative examples Eventually chosen as best possible literal Negatives still covered, remove uncovered examples.

79 Sample F OIL Induction 1 2 3 4 6 5 Pos : {,,,, } Test: edge (X,Z)  path(X,Y) Neg: {,,,,,,,,,,,,, } Covers all 4 positive examples Covers 15 of 26 negative examples Eventually chosen as best possible literal Negatives still covered, remove uncovered examples. Expand tuples to account for possible Z values.

80 Sample F OIL Induction 1 2 3 4 6 5 Pos : {,,,,,, } Test: edge (X,Z)  path(X,Y) Neg: {,,,,,,,,,,,,, } Covers all 4 positive examples Covers 15 of 26 negative examples Eventually chosen as best possible literal Negatives still covered, remove uncovered examples. Expand tuples to account for possible Z values.

81 Sample F OIL Induction 1 2 3 4 6 5 Pos : {,,,,,, } Continue specializing clause: edge (X,Z)  path(X,Y) Neg: {,,,,,,,,,,,, }

82 Sample F OIL Induction 1 2 3 4 6 5 Pos : {,,,,,, } Test: edge (X,Z) ∧ edge(Z,Y)  path(X,Y). Neg: {,,,,,,,,,,,, } Covers 3 positive examples Covers 0 negative examples

83 Sample F OIL Induction 1 2 3 4 6 5 Pos : {,,,,,, } Test: edge (X,Z) ∧ edge(Z,Y)  path(X,Y) Neg: {,,,,,,,,,,,, } Covers 4 positive examples Covers 0 negative examples Eventually chosen as best literal; completes clause. Definition complete, since all original tuples are covered (by way of covering some tuple.)

84 Recursion Limitation Must not build a clause that results in an infinite regress. –path(X,Y)  path(X,Y). –path(X,Y)  path(Y,X). To guarantee termination of the learned clause, must “ reduce ” at least one argument according some well-founded partial ordering. A binary predicate, R, is a well-founded partial ordering if the transitive closure does not contain R(a,a) for any constant a. –rest(A,B) –edge(A,B) for an acyclic graph

85 Ensuring Termination in F OIL First empirically determines all binary-predicates in the background that form a well-founded partial ordering by computing their transitive closures. Only allows recursive calls in which one of the arguments is reduced according to a known well-founded partial ordering. edge (X,Z) ∧ edge(Z,Y)  path(X,Y) X is reduced to Z by edge so this recursive call is O.K May prevent legal recursive calls that terminate for some other more-complex reason. Due to halting problem, cannot determine if an arbitrary recursive definition is guaranteed to halt.

86 Some Applications Classifying chemical compounds as mutagenic (cancer causing) based on their graphical molecular structure and chemical background knowledge. Classifying web documents based on both the content of the page and its links to and from other pages with particular content. –A web page is a university faculty home page if: It contains the words “ Professor ” and “ University ”, and It is pointed to by a page with the word “ faculty ”, and It points to a page with the words “ course ” and “ exam ”

87 F OIL Limitations Search space of literals (branching factor) can become intractable. –Use aspects of bottom-up search to limit search. Requires large extensional background definitions. –Use intensional background via Prolog inference. Hill-climbing search gets stuck at local optima and may not even find a consistent clause. –Use limited backtracking (beam search) –Include determinate literals with zero gain. –Use relational pathfinding or relational clichés. Requires complete examples to learn recursive definitions. –Use intensional interpretation of learned recursive clauses.

88 F OIL Limitations (cont.) Requires a large set of closed-world negatives. –Exploit “ output completeness ” to provide “ implicit ” negatives. past-tense([s,i,n,g], [s,a,n,g]) Inability to handle logical functions. –Use bottom-up methods that handle functions Background predicates must be sufficient to construct definition, e.g. cannot learn reverse unless given append. –Predicate invention Learn reverse by inventing append Learn sort by inventing insert

89 Rule Learning and ILP Summary There are effective methods for learning symbolic rules from data using greedy sequential covering and top-down or bottom-up search. These methods have been extended to first-order logic to learn relational rules and recursive Prolog programs. Knowledge represented by rules is generally more interpretable by people, allowing human insight into what is learned and possible human approval and correction of learned knowledge.

Literals and clauses 91

92 IF triangle( T1) ∧ in( T1, T2) ∧ triangle( T2) THEN Class = yes ELSIF triangle( T1) ∧ in( T1, T2) THEN Class = no ELSIF triangle( T1) THEN Class = no ELSIF circle( C) THEN Class = no ELSE Class = yes

1 Machine Learning: Rule Learning. 2 Learning Rules If-then rules in logic are a standard representation of knowledge that have proven useful in expert-systems.

Similar presentations

Presentation on theme: "1 Machine Learning: Rule Learning. 2 Learning Rules If-then rules in logic are a standard representation of knowledge that have proven useful in expert-systems."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Machine Learning: Rule Learning. 2 Learning Rules If-then rules in logic are a standard representation of knowledge that have proven useful in expert-systems.

Similar presentations

Presentation on theme: "1 Machine Learning: Rule Learning. 2 Learning Rules If-then rules in logic are a standard representation of knowledge that have proven useful in expert-systems."— Presentation transcript:

Similar presentations

About project

Feedback