Presentation is loading. Please wait.

Presentation is loading. Please wait.

Rule-based Learning Propositional Version. Rule Learning Based on generalization operations A generalization (resp. specialization) operation is an operation.

Similar presentations


Presentation on theme: "Rule-based Learning Propositional Version. Rule Learning Based on generalization operations A generalization (resp. specialization) operation is an operation."— Presentation transcript:

1 Rule-based Learning Propositional Version

2 Rule Learning Based on generalization operations A generalization (resp. specialization) operation is an operation that transforms a concept X into a new concept Y such that Y is more general (resp. specific) than X Examples include dropping/adding conditions, adding/removing disjuncts, etc.

3 Covering Relation Generalization operations induce a partial ordering on the concept space, called the covering relation Partial order = reflexive, antisymmetric and transitive Lattice topology

4 Sequential Covering (I) Learning consists of iteratively learning rules that cover yet uncovered training instances Assume the existence of a Learn_one_Rule function: Input: a set of training instances Output: a single high-accuracy (not necessarily high- coverage) rule Then, we can define the following generic rule set learning algorithm:

5 Sequential Covering (II) Algorithm Sequential_Covering( Instances ) Learned_rules = {} Rule = Learn_one_Rule( Instances ) While Quality( Rule, Instances ) > Threshold Do Learned_rules = Learned_rules + Rule Instances = Instances - {instances correctly classified by Rule } Rule = Learn_one_Rule( Instances ) Sort Learned_rules by Quality over Instances Return Learned_rules where Quality() is a user-defined rule quality evaluation function

6 CN2 (I) Algorithm Learn_one_Rule_CN2( Instances, k ) Best_hypo = {} Candidate_hypo = { Best_hypo } While Candidate_hypo <> {} Do All_constraints = {( a = v ): a is an attribute and v is a value of a found in Instances} New_candidate_hypo = For each h in Candidate_hypo For each c in All_constraints, specialize h by adding c Remove from New_candidate_hypo any hypotheses that are duplicates, inconsistent or not maximally specific For all h in New_candidate_hypo If Quality_CN2( h, Instances ) > Quality_CN2( Best_hypo, Instances ) Best_hypo = h Candidate_hypo = the k best members of New_candidate_hypo as per Quality_CN2 Return a rule of the form “IF Best_hypo THEN Pred ” where Pred = most frequent target attribute's value among the instances that match Best_hypo

7 CN2 (II) Algorithm Quality_CN2( h, Instances ) h_instances = { i in Instances : i matches h } Return -Entropy( h_instances ) where Entropy is computed with respect to the target attribute Note that CN2 performs a general-to-specific beam search, keeping not the single best candidate at each step, but a list of the k best candidates.

8 Illustrative Training Set

9 CN2 Example (I) First pass: Full instance set 2-best1: « Income Level = Low » (4-0-0), « Income Level = High » (0-1-5) Can’t do better than (4-0-0) Best_hypo: « Income Level = Low » First rule: IF Income Level = Low THEN HIGH

10 CN2 Example (II) Second pass: Instances 2-3, 5-6, 8-10, 12-14 2-best1: « Income Level = High » (0-1-5), « Credit History = Good » (0-1-3) Best_hypo: « Income Level = High » 2-best2: « Income Level = High AND Credit History = Good » (0-0-3), « Income level = High AND Collateral = None » (0-0-3) Best_hypo: « Income Level = High AND Credit History = Good » Can’t do better than (0-0-3) Second rule: IF Income Level = High AND Credit History = Good THEN LOW

11 CN2 Example (III) Third pass: Instances 2-3, 5-6, 8, 12, 14 2-best1: « Credit History = Good » (0-1- 0), « Debt level = High » (2-1-0) Best_hypo: « Credit History = Good » Can’t do better than (0-1-0) Third rule: IF Credit History = Good THEN MODERATE

12 CN2 Example (IV) Fourth pass: Instances 2-3, 5-6, 8, 14 2-best1: « Debt level = High » (2-0-0), « Income Level = Medium » (2-1-0) Best_hypo: « Debt Level = High » Can’t do better than (2-0-0) Fourth rule: IF Debt Level = High THEN HIGH

13 CN2 Example (V) Fifth pass: Instances 3, 5-6, 8 2-best1: « Credit History = Bad » (0-1-0), « Income Level = Medium » (0-1-0) Best_hypo: « Credit History = Bad » Can’t do better than (0-1-0) Fifth rule: IF Credit History = Bad THEN MODERATE

14 CN2 Example (VI) Sixth pass: Instances 3, 5-6 2-best1: « Income Level = High » (0-0-2), « Collateral = Adequate » (0-0-1) Best_hypo: « Income Level = High » Can’t do better than (0-0-2) Sixth rule: IF Income Level = High THEN LOW

15 CN2 Example (VII) Seventh pass: Instance 3 2-best1: « Credit History = Unknown » (0-1-0), « Debt level = Low » (0-1-0) Best_hypo: « Credit History = Unknown » Can’t do better than (0-1-0) Seventh rule: IF Credit History = Unknown THEN MODERATE

16 CN2 Example (VIII) Quality: - Sum [ p i log(p i ) ] Rule 1: (4-0-0)- Rank 1 Rule 2: (0-0-3)- Rank 2 Rule 3: (1-1-3)- Rank 5 Rule 4: (4-1-2)- Rank 6 Rule 5: (3-1-0)- Rank 4 Rule 6: (0-1-5)- Rank 3 Rule 7: (2-1-2)- Rank 7

17 CN2 Example (IX) IF Income Level = Low THEN HIGH IF Income Level = High AND Credit History = Good THEN LOW IF Income Level = High THEN LOW IF Credit History = Bad THEN MODERATE IF Credit History = Good THEN MODERATE IF Debt Level = High THEN HIGH IF Credit History = Unknown THEN MODERATE

18 Rule-based Learning First-order Version

19 Motivation (I) Consider the following (MONK1) problem: 6 attributes A1, A2, A4: 1, 2, 3 A3, A6: 1, 2 A5: 1, 2, 3, 4 2 classes: 0, 1 Target concept: If (A1=A2 or A5=1) then Class 1 Decision tree representation?

20 Motivation (II) A1 A2 1A5 1010 1 1010 1 1010

21 Motivation (II) How about a rule list? If A1=1 and A2=1 then Class=1 If A1=1 and A2=1 then Class=1 If A1=2 and A2=2 then Class=1 If A1=2 and A2=2 then Class=1 If A1=3 and A2=3 then Class=1 If A1=3 and A2=3 then Class=1 If A5=1 then Class=1 If A5=1 then Class=1 Class=0 Class=0

22 First-order Language What is the problem? What we really want is a language of generalization that supports first-order concepts, so that relations between attributes may be accounted for in a natural way For simplicity, we restrict ourselves to Horn clauses

23 Horn Clauses A literal is a predicate or its negation A clause is any disjunction of literals whose variables are universally quantified A Horn clause is an expression of the form:

24 FOIL (I) Algorithm FOIL( Target_predicate, Predicates, Examples ) Pos =those Examples for which Target_predicate is true Neg =those Examples for which Target_predicate is false Learned_rules = {} While Pos <> {} Do New_rule = the rule that predicts Target_predicate with no precondition New_rule_neg = Neg While New_rule_neg <> {} Do Candidate_literals = GenCandidateLit( New_rule, Predicates ) Best_literal = argmax L in Candidate_literals FoilGain( L, New_rule ) Add Best_literal to New_rule ’s preconditions New_rule_neg = subset of New_rule_neg that satisfies New_rule ’s preconditions Learned_rules = Learned_rules + New_rule Pos = Pos – {members of Pos covered by New_rule } Return Learned_rules

25 FOIL (II) Algorithm GenCandidateLit( Rule, Predicates ) Let Rule  P ( x 1, …, x k )  L 1, …, L n Return all literals of the form Q ( v 1, …, v r ) where Q is any predicate in Predicates and the v i ’s are either new variables or variables already present in Rule, with the constraint that at least one of the v i ’s must already exist as a variable in Rule Equal( x j, x k ) where x j and x k are variables already present in Rule The negation of all of the above forms of literals

26 FOIL (III) Algorithm FoilGain( L, Rule ) Return Where Where p 0 is the number of positive bindings of Rule p 0 is the number of positive bindings of Rule n 0 is the number of negative bindings of Rule n 0 is the number of negative bindings of Rule p 1 is the number of positive bindings of Rule+L p 1 is the number of positive bindings of Rule+L n 1 is the number of negative bindings of Rule+L n 1 is the number of negative bindings of Rule+L t is the number of positive bindings of Rule that are still covered after adding L to Rule t is the number of positive bindings of Rule that are still covered after adding L to Rule

27 Illustration (I) Consider the data: GrandDaughter(Victor, Sharon) Father(Sharon, Bob) Father(Tom, Bob) Female(Sharon) Father(Bob, Victor) Target concept: GrandDaughter( x, y ) Closed-world assumption

28 Illustration (II) Training set: Positive examples: GrandDaughter(Victor, Sharon) Negative examples: GrandDaughter(Victor, Victor) GrandDaughter(Victor, Bob) GrandDaughter(Victor, Tom) GrandDaughter(Sharon, Victor) GrandDaughter(Sharon, Sharon) GrandDaughter(Sharon, Bob) GrandDaughter(Sharon, Tom) GrandDaughter(Bob, Victor) GrandDaughter(Bob, Sharon) GrandDaughter(Bob, Bob) GrandDaughter(Bob, Tom) GrandDaughter(Tom, Victor) GrandDaughter(Tom, Sharon) GrandDaughter(Tom, Bob) GrandDaughter(Tom, Tom)

29 Illustration (III) Most general rule: GrandDaughter( x, y ) <= Specializations: Father( x, y ) Father( x, z ) Father( y, x ) Father( y, z ) Father( x, z ) Negations of each of the above Father( z, x ) Female( x ) Female( y ) Equal( x, y )

30 Illustration (IV) Consider 1 st specialization GrandDaughter( x, y ) <= Father( x, y ) 16 possible bindings: x /Victor, y /Victor x /Victor y /Sharon … x /Tom, y /Tom FoilGain: p 0 = 1 ( x /Victor, y /Sharon), n 0 = 15 p 1 = 0, n 1 = 16 t = 0 So that GainFoil(1 st specialization) = 0

31 Illustration (V) Consider 4 th specialization GrandDaughter( x, y ) <= Father( y, z ) 64 possible bindings: x /Victor, y /Victor, z /Victor x /Victor y /Victor, z /Sharon … x /Tom, y /Tom, z /Tom FoilGain: p 0 = 1 ( x /Victor, y /Sharon), n 0 = 15 p 1 = 1 ( x /Victor, y /Sharon, z /Bob), n 1 = 11 ( x /Victor, y /Bob, z /Victor) ( x /Victor, y /Tom, z /Bob) ( x /Sharon, y /Bob, z /Victor) ( x /Sharon, y /Tom, z /Bob) ( x /Bob, y /Tom, z /Bob) ( x /Bob, y /Sharon, z /Bob) ( x /Tom, y /Sharon, z /Bob) ( x /Tom, y /Bob, z /Victor) ( x /Sharon, y /Sharon, z /Bob) ( x /Bob, y /Bob, z /Victor) ( x /Tom, y /Tom, z /Bob) t = 1 So that GainFoil(4 th specialization) = 0.415

32 Illustration (VI) Assume the 4 th specialization is indeed selected Partial rule: GrandDaughter( x, y ) <= Father( y, z ) Still covers 11 negative examples New set of candidate literals: All of the previous ones Female( z ) Equal( x, z ) Equal( y, z ) Father( z, w ) Father( w, z ) Negations of each of the above

33 Illustration (VII) Consider the specialization GrandDaughter( x, y ) <= Father( y, z ), Equal( x, z ) 64 possible bindings: x /Victor, y /Victor, z /Victor x /Victor y /Victor, z /Sharon … x /Tom, y /Tom, z /Tom FoilGain: p 0 = 1 ( x /Victor, y /Sharon, z /Bob), n 0 = 11 p 1 = 0, n 1 = 3 ( x /Victor, y /Bob, z /Victor) ( x /Bob, y /Tom, z /Bob) ( x /Bob, y /Sharon, z /Bob) t = 0 So that GainFoil(specialization) = 0

34 Illustration (VIII) Consider the specialization GrandDaughter( x, y ) <= Father( y, z ), Father( z, x ) 64 possible bindings: x /Victor, y /Victor, z /Victor x /Victor y /Victor, z /Sharon … x /Tom, y /Tom, z /Tom FoilGain: p 0 = 1 ( x /Victor, y /Sharon, z /Bob), n 0 = 11 p 1 = 1( x /Victor, y /Sharon, z /Bob), n 1 = 1 ( x /Victor, y /Tom, z /Bob) t = 1 So that GainFoil(specialization) = 2.585

35 Illustration (IX) Assume that specialization is indeed selected Partial rule: GrandDaughter( x, y ) <= Father( y, z ), Father( z, x ) Still covers 1 negative example No new set of candidate literals Use all of the previous ones

36 Illustration (X) Consider the specialization GrandDaughter( x, y ) <= Father( y, z ), Father( z, x ), Female( y ) 64 possible bindings: x /Victor, y /Victor, z /Victor x /Victor y /Victor, z /Sharon … x /Tom, y /Tom, z /Tom FoilGain: p 0 = 1 ( x /Victor, y /Sharon, z /Bob), n 0 = 1 p 1 = 1( x /Victor, y /Sharon, z /Bob), n 1 = 0 t = 1 So that GainFoil(specialization) = 1

37 Illustration (XI) No negative examples are covered and all positive examples are covered So, we get the final correct rule: GrandDaughter( x, y ) <= Father( y, z ), Father( z, x ), Female( y )

38 Recursive Predicates If the target predicate is included in the list Predicates, then FOIL can learn recursive definitions such as: Ancestor( x, y ) <= Parent( x, y ) Ancestor( x, y ) <= Parent( x, z ), Ancestor( z, y )

39 Exercise Consider learning the definition of directed acyclic graphs from the data: Edge(x, y):,,,,, Path(x, y):,,,,,,,,, 1 2 3 4 5 6

40 Going Further… What if the domain calls for richer structure and/or expressiveness In principle, we can always flatten the representation BUT: From a knowledge acquisition point of view Structure may be essential to induce good concepts From a knowledge representation point of view It seems desirable to be able to capture physical structures in the data with corresponding abstract structures in its representation

41 Proposal Use highly-expressive representation language (based on higher-order logic) Sets, multisets, graphs, etc. Functions as well as predicates In principle, arbitrary data structures Functions/predicates as arguments Three algorithms Decision-tree learner Rule-based learner Strongly-typed evolutionary programming

42 Illustration (I) NIEHS’ DB of chemical compounds (337 registered at the time) Two sets of descriptive features: Structural: atoms and bond connectives Non-structural: outcomes of laboratory analyses (e.g., Ashby alerts, Ames test results) Information on carcinogenicity obtained by carrying out long- term bioassays Using labeled compounds, build a model that: Correctly predicts the carcinogenicity of 23 new compounds that were, at the time, undergoing testing by the NTP Offers insight into the features that govern chemical carcinogenicity

43 Illustration (II) Knowledge representation Atom: (Label, Element, AtomType, Charge) Bond: ((Label, Label), BondType) Structure: ({Atom}, {Bond}) (i.e., a graph!) Non-structure: (F 1, F 2, …, F n ) Target function: Carcinogenic: Molecule -> Boolean Expected form: IF Cond THEN C 1 ELSE C 2

44 Illustration (III) Conjecture: toxicological information only makes explicit, properties implicit in the molecular structure of chemicals => Use structural information only Expected advantages: Faster, more economical predictions Less reliance on laboratory animals Potential for increased insight into the mechanistic paths and features that govern chemical toxicity, since the solutions produced are readily interpretable as chemical structures

45 Illustration (IV) carcinogenic(v1) = if ((((card (setfilter (\v3 -> ((proj2 v3) == O)) (proj5 v1))) < 5) && ((card (setfilter (\v5 -> ((proj2 v5) == 7)) (proj6 v1))) > 19)) || exists \v4 -> ((elem v4 (proj6 v1)) && ((proj2 v4) == 3))) || (exists \v2 -> ((elem v2 (proj5 v1)) && ((((((proj3 v2) == 42) || ((proj3 v2) == 8)) || ((proj2 v2) == I)) || ((proj2 v2) == F)) || ((((proj4 v2) within (-0.812,-0.248)) && ((proj4 v2) > -0.316)) || (((proj3 v2) == 51) || (((proj3 v2) == 93) && ((proj4 v2) < -0.316)))))) && ((card (setfilter (\v5 -> ((proj2 v5) == 7))(proj6 v1))) < 15)) then Inactive else Active; --------------------------------------------------------------------------------------------------------------------- A molecule is Inactive if it contains less than 5 oxygen atoms and has more than 19 aromatic bonds, or it contains a triple bond, or it has less than 15 aromatic bonds and contains an atom that is of type 8, 42 or 51, or is a iodine or a fluorine atom, or is of type 93 with a partial charge less than -0.316, or has a partial charge between -0.316 and -0.248 Otherwise, it is Active

46 Illustration (IV) Best solution found with structural information only configuration, with C 1 = inactive Accuracy: 78% (rank: joint 2nd of 10) Insightfulness: inconclusive (some « useful » bits and some « noise ») Sacrifice comprehensibility Accuracy: 87% (rank: joint 1st in 10)


Download ppt "Rule-based Learning Propositional Version. Rule Learning Based on generalization operations A generalization (resp. specialization) operation is an operation."

Similar presentations


Ads by Google