Learning set of rules.

Slides:



Advertisements
Similar presentations
Computational Learning Theory
Advertisements

Analytical Learning.
Concept Learning and the General-to-Specific Ordering
Data Mining Classification: Alternative Techniques
1 CS 391L: Machine Learning: Rule Learning Raymond J. Mooney University of Texas at Austin.
Decision Trees Decision tree representation ID3 learning algorithm
1er. Escuela Red ProTIC - Tandil, de Abril, Decision Tree Learning 3.1 Introduction –Method for approximation of discrete-valued target functions.
ICS320-Foundations of Adaptive and Learning Systems
Knowledge Representation and Reasoning Learning Sets of Rules and Analytical Learning Harris Georgiou – 4.
Decision Tree Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Chapter 10 Learning Sets Of Rules
Machine Learning Chapter 10. Learning Sets of Rules Tom M. Mitchell.
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Tuesday, November 27, 2001 William.
Università di Milano-Bicocca Laurea Magistrale in Informatica
Decision Tree Learning
Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Decision Tree Algorithm
Induction of Decision Trees
Evaluating Hypotheses
Machine Learning: Symbol-Based
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
CSCI 5582 Fall 2006 CSCI 5582 Artificial Intelligence Lecture 22 Jim Martin.
MACHINE LEARNING. What is learning? A computer program learns if it improves its performance at some task through experience (T. Mitchell, 1997) A computer.
Machine Learning Chapter 3. Decision Tree Learning
Mohammad Ali Keyvanrad
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Wednesday, 11 April 2007 William.
CS Learning Rules1 Learning Sets of Rules. CS Learning Rules2 Learning Rules If (Color = Red) and (Shape = round) then Class is A If (Color.
CpSc 810: Machine Learning Decision Tree Learning.
November 10, Machine Learning: Lecture 9 Rule Learning / Inductive Logic Programming.
1 Machine Learning: Rule Learning. 2 Learning Rules If-then rules in logic are a standard representation of knowledge that have proven useful in expert-systems.
Decision Trees. Decision trees Decision trees are powerful and popular tools for classification and prediction. The attractiveness of decision trees is.
Machine Learning Chapter 2. Concept Learning and The General-to-specific Ordering Tom M. Mitchell.
Predicate Calculus Syntax Countable set of predicate symbols, each with specified arity  0. For example: clinical data with multiple tables of patient.
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Thursday, 12 April 2007 William.
Concept Learning and the General-to-Specific Ordering 이 종우 자연언어처리연구실.
Outline Inductive bias General-to specific ordering of hypotheses
Overview Concept Learning Representation Inductive Learning Hypothesis
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
For Monday Finish chapter 19 No homework. Program 4 Any questions?
For Wednesday Read 20.4 Lots of interesting stuff in chapter 20, but we don’t have time to cover it all.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
For Monday Finish chapter 19 Take-home exam due. Program 4 Any questions?
CS 5751 Machine Learning Chapter 10 Learning Sets of Rules1 Learning Sets of Rules Sequential covering algorithms FOIL Induction as the inverse of deduction.
First-Order Logic and Inductive Logic Programming.
First-Order Rule Learning. Sequential Covering (I) Learning consists of iteratively learning rules that cover yet uncovered training instances Assume.
Machine Learning Concept Learning General-to Specific Ordering
Decision Tree Learning Presented by Ping Zhang Nov. 26th, 2007.
Concept Learning and The General-To Specific Ordering
Computational Learning Theory Part 1: Preliminaries 1.
Chap. 10 Learning Sets of Rules 박성배 서울대학교 컴퓨터공학과.
Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
1 Classifier Learning: Rule Learning Intelligent Systems Lab. Soongsil University Thanks to Raymond J. Mooney in the University of Texas at Austin.
Rule-based Learning Propositional Version. Rule Learning Based on generalization operations A generalization (resp. specialization) operation is an operation.
CSE573 Autumn /09/98 Machine Learning Administrative –Last topic: Decision Tree Learning Reading: 5.1, 5.4 Last time –finished NLP sample system’s.
Decision Tree Learning
CS 9633 Machine Learning Concept Learning
First-Order Logic and Inductive Logic Programming
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning: Lecture 3
Rule Learning Hankui Zhuo April 28, 2018.
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning Chapter 2
Implementation of Learning Systems
Version Space Machine Learning Fall 2018.
Machine Learning Chapter 2
Presentation transcript:

Learning set of rules

Content Introduction Sequential Covering Algorithms Learning Rule Sets: Summary Learning First-Order Rule Learning Sets of First-Order Rules: FOIL Summary

Introduction If-Then Rules Rules with variables: Horn-clauses Are very expressive Easy to understand Rules with variables: Horn-clauses Set of Horn-clauses build up a PROLOG program Learning of Horn-clauses: Inductive Logic Programming (ILP) Example first-order rule set for the target concept Ancestor IF Parent (x,y) THEN Ancestor (x,y) IF Parent(x,z) Ancestor(z,y) THEN Ancestor(x,y)

Introduction 2 GOAL: Learning a target function as a set of IF-THEN rules BEFORE: Learning with decision trees Learning the decision tree Translate the tree into a set of IF-THEN rules (for each leaf one rule) OTHER POSSIBILITY: Learning with genetic algorithms Each set of rule is coded as a bitvector Several genetic operators are used on the hypothesis space TODAY AND HERE: First: Learning rules in propositional form Second: Learning rules in first-order form (Horn clauses which include variables) Sequential search for rules, one after the other

Content Introduction Sequential Covering Algorithms General to Specific Beam Search Variations Learning Rule Sets: Summary Learning First-Order Rule Learning Sets of First-Order Rules: FOIL Summary

Sequential Covering Algorithms Goal of such an algorithm: Learning a disjunct set of rules, which defines a preferably good classification of the training data Principle: Learning rule sets based on the strategy of learning one rule, removing the examples it covers, then iterating this process. Requirement for the Learn-One-Rule method: As Input it accepts a set of positive and negative training examples As Output it delivers a single rule that covers many of the positive examples and maybe a few of the negative examples Required: The output rule has a high accuracy but not necessarily a high coverage

Sequential Covering Algorithms 2 Procedure: Learning set of rules invokes the Learn-One-Rule method on all of the available training examples Remove every positive example covered by the rule Eventually short the final set of the rules: more accurate rules can be considered first Greedy search: It is not guaranteed to find the smallest or best set of rules that covers the training example.

Sequential Covering Algorithms 3 SequentialCovering( target_attribute, attributes, examples, threshold ) learned_rules { } rule LearnOneRule( target_attribute, attributes, examples ) while (Performance( rule, examples ) > threshold ) do learned_rules learned_rules + rule examples examples - { examples correctly classified by rule } rule LearnOneRule( target_attribute, attributes, examples ) learned_rules sort learned_rules according to Performance over examples return learned_rules

General to Specific Beam Search Specialising search Organises a hypothesis space search in general the same fashion as the ID3, but follows only the most promising branch of the tree at each step Begin with the most general rule (no/empty precondition) Follow the most promising branch: Greedily adding the attribute test that most improves the measured performance of the rule over the training example Greedy depth-first search with no backtracking Danger of sub-optimal choice Reduce the risk: Beam Search (CN2-algorithm) Algorithm maintains the list of the k best candidates In each search step, descendants are generated for each of these k-best candidates The resulting set is then reduced to the k most promising members

General to Specific Beam Search 2 Learning with decision tree

General to Specific Beam Search 3

General to Specific Beam Search 4 The CN2-Algorithm LearnOneRule( target_attribute, attributes, examples, k ) Initialise best_hypothesis to the most general hypothesis Ø Initialise candidate_hypotheses to the set { best_hypothesis } while ( candidate_hypothesis is not empty ) do 1. Generate the next more-specific candidate_hypothesis 2. Update best_hypothesis 3. Update candidate_hypothesis return a rule of the form „IF best_hypothesis THEN prediction“ where prediction is the most frequent value of target_attribute among those examples that match best_hypothesis. Performance( h, examples, target_attribute ) h_examples the subset of examples that match h return -Entropy( h_examples ), where Entropy is with respect to target_attribute

General to Specific Beam Search 5 Generate the next more specific candidate_hypothesis all_constraints set of all constraints (a = v), where a Î attributes and v is a value of a occuring in the current set of examples new_candidate_hypothesis for each h in candidate_hypotheses, for each c in all_constraints create a specialisation of h by adding the constraint c Remove from new_candidate_hypothesis any hypotheses which are duplicate, inconsistent or not maximally specific Update best_hypothesis for all h in new_candidate_hypothesis do if statistically significant when tested on examples Performance( h, examples, target_attribute ) > Performance( best_hypothesis, examples, target_attribute ) ) then best_hypothesis h

General to Specific Beam Search 6 Update the candidate-hypothesis candidate_hypothesis the k best members of new_candidate_hypothesis, according to Performance function Performance function guides the search in the Learn-One -Rule s: the current set of training examples c: the number of possible values of the target attribute : part of the examples, which are classified with the ith. value

Example for CN2-Algorithm LearnOneRule(EnjoySport, {Sky, AirTemp, Humidity, Wind, Water, Forecast, EnjoySport}, examples, 2) best_hypothesis = Ø candidate_hypotheses = {Ø} all_constraints = {Sky=Sunny, Sky=Rainy, AirTemp=Warm, AirTemp=Cold, Humidity=Normal, Humidity=High, Wind=Strong, Water=Warm, Water=Cool, Forecast=Same, Forecast=Change} Performance = nc / n n = Number of examples, covered by the rule nc = Number of examples covered by the rule and classification is correct

Example for CN2-Algorithm (2) Pass 1 Remove delivers no result candidate_hypotheses = {Sky=Sunny, AirTemp=Warm} best_hypothesis ist Sky=Sunny

Example for CN2-Algorithm (3) Pass 2 Remove (duplicate, inconsistent, not maximally specific) candidate_hypotheses = {Sky=Sunny AND AirTemp=Warm, Sky=Sunny AND Humidity=High} best_hypothesis remains Sky=Sunny

Content Introduction Sequential Covering Algorithms Learning Rule Sets: Summary Learning First-Order Rule Learning Sets of First-Order Rules:FOIL Summary

Learning Rule Sets: Summary Key dimension in the design of the rule learning algorithm Here sequential covering: learn one rule, remove the positive examples covered, iterate on the remaining examples ID3 simultaneous covering Which one should be prefered? Key difference: choice at the most primitive step in the search ID3: chooses among attributes by comparing the partitions of the data they generated CN2: chooses among attribute-value pairs by comparing the subsets of data they cover Number of choices: learn n rules each containing k attribute-value tests in their precondition CN2: n*k primitive search steps ID3: fewer independent search steps If the data is plentiful, then it may support the larger number of independent decisons If the data is scarce, the sharing of decisions regarding preconditions of different rules may be more effective

Learning Rule Sets: Summary 2 CN2: general-to-specific (cf. Find-S specific-to-general) Advantage: there is a single maximally general hypothesis from which to begin the search <=> there are many specific ones GOLEM: choosing several positive examples at random to initialise and to guide the search. The best hypothesis obtained through multiple random choices is the selected one CN2: generate then test Find-S, CN2, AQ and Cigol are example-driven Advantage: each choice in the search is based on the hypothesis performance over many examples How are the rules post-pruned?

Content Introduction Sequential Covering Algorithms Learning Rule Sets: Summary Learning First-Order Rule First-Order Horn Clauses Terminology Learning Sets of First-Order Rules: FOIL Summary

Learning First-Order Rule Previous: Algorithms for learning sets of propositional rules Now: Extension of these algorithms for learning first-order rules In particular: inductive logic programming (ILP): Learning of first-order rules or theories Use PROLOG: general purpose Turing-equivalent programming language in which programs are expressed as collection of Horn clauses.

First-Order Horn Clauses Advantage of the representation: learning Daughter(x,y) Result of C4.5 or CN2 Result ILP Problem: Propositional representations offer no general way to describe the essential relation among the values of the attributes In first-order Horn clauses variables that do not occur in the postcondition may occur in the precondition Possible: Use the same predicates in the rule postconditions and preconditions (recursive rules)

Terminology Every well formed expression is composed of constants, variables, predicates and functions A term is any constant, any variable, or any function applied to any term Literal is any predicate (or its negation) applied to any set of terms A ground literal is a literal that does not contain any variables A negative literal is a literal containing a negated predicate A positive literal is a literal with no negative sign A clause is any disjunction of literals: whose variables are universally quantifed A Horn clause is an expression of the form where are positive literals. H is called the head, postcondicions or consequent of the Horn clause. The conjunction of the literal is called the body precondition or ancendents of the Horn clause

Terminology 2 For any literals A and B, the expression is equivalent to , and the expression is equivalent to => a Horn clause can equivalently be written as A substitution is any function that replaces variables by terms. For example, the substitution replaces the variable x by the term 3 and replaces the variable y by the term z. Given a substitution and a literal L => denotes the result of applying the substitution to L. An unifying substitution for two literals and is any substitution such that

Content Introduction Sequential Covering Algorithms Learning Rule Sets: Summary Learning First-Order Rule Learning Sets of First-Order Rules:FOIL Generating Candidate Specialisation in FOIL Guiding Search in FOIL Learning Recursive Rule Sets Summary of FOIL Summary

Learning Sets of First-Order Rules:FOIL FOIL( target_predicate, predicates, examples ) pos those examples, for which target_predicate is true Neg those examples, for which target_predicate is false learned_rules { } while ( pos ) do /* learn a new rule */ new_rule the rule that predicts target_predicate with no preconditions new_rule_neg neg while ( new_rule_neg ) do /* Add a new literal to specialize new_rule */ candidate_literals generate candidate new literals for new_rule, based on predicates best_literal argmaxL candidate_literals FoilGain( L, new_rule ) add best_literal to preconditions of new_rule new_rule_neg subset of new_rule_neg that satisfies new_rule‘s preconditions learned_rules learned_rules + new_rule pos pos - { members of pos covered by new_rule } return learned_rules

Learning Sets of First-Order Rules:FOIL 2 External loop : „Specific-to-General“ Generalise the set of rules: Add a new rule to the existing hypothesis Each new rule „generalises“ the actual hypothesis to increase the number of the positive classified instances Start with the special precondition, in which there is not any positive target concept

Learning Sets of First-Order Rules:FOIL 3 FOIL( target_predicate, predicates, examples ) pos those examples, for which target_predicate is true neg those examples, for which target_predicate is false learned_rules { } while ( pos ) do /* learn a new rule */ new_rule the rule that predict target_predicate with no preconditions new_rule_neg neg while ( new_rule_neg ) do /* Add a new literal to specialize new_rule */ candidate_literals generate candidate new literals for new_rule, based on predicates best_literal argmax L candidate_literals FoilGain( L, new_rule ) add best_literal to preconditions of new_rule new_rule_neg subset of new_rule_neg that satisfies new_rule's preconditions learned_rules learned_rules + new_rule pos pos - { members of pos covered by new_rule } return learned_rules

Learning Sets of First-Order Rules:FOIL 4 Internal loop : „General-to-specific“ Specialization of the current rule Generation of new literals Current rule: P( x1, x2, ..., xk ) ¬ L ... Ln Treatment of new literales in the following form: Q( v1, ..., vr ), where Q is a predicate name occuring in Predicates (Q Î predicates) and where the vi are either new variables or variables already present in the rule At least one of the vi in the created literal must already exist as a variable in the rule. Equal( xj, xk ), where xj and xk variables already present in the rule The negation of either of the above forms of literals

Learning Sets of First-Order Rules:FOIL 5 Example: predict GrandDaughter(x,y) other predicates: Father and Female FOIL begins with the most general rule GrandDaughter(x,y) Specialisation: the following literals are generated as candidates: Equal(x,y), Female(x), Female(y), Father(x,y), Father(y,x), Father(x,z), Father(z,x), Father(y,z), Father(z,y) + the negation of each of these literals Suppose FOIL to select greedily Father(y,z): GrandDaughter(x,y) Father(y,z) Generating the candidate literals to further specialise the rule: all in the previous step generated literals + Female(z), Equal(z,x), Equal(z,y), Father(z,w) Father(w,z) + their negation Suppose FOIL to select greedily at this point Father(z,x) and on the next iteration Female(x) GrandDaughter(x,y) Father(y,z) Father(z,x) Female(x)

Guiding the Search in FOIL Select the most promising literal: FOIL considers the performance of the rule over the training data Consider all possible bindings of each variable in the current rule for each literal As new literals are added to the rule the set of bindings will change If a literal is added that introduces a new variable then the bindings for the rule will grow in length If the new variable can be bound to different constants, then the number of bindings fitting the extended rule can be greater than the number associated with the original rule The association function of FOIL (FOILGain) that estimates the utility of adding a new literal to the rule is based on the number of positive and negative bindings before and after adding the new literal

Guiding the Search in FOIL 2 FoilGain Consider some rule R and a candidate literal L that might be added to the body of R (R') . Prove for each literal and for all bindings the information gain High value - high information gain p0: the number of positive bindings of R p1: the number of positive bindings of R' n0: the number of negative bindings of R n1: the number of negative bindings of R' t: the number of positive bindings of rule R that are still covered after adding literal L to R

Summary of FOIL Extend CN2 to handle learning first-order rules similar to Horn clauses During the learning process FOIL performs a general-to-specific search at each step adding a single new literal to the rule precondition At each step FOILGain is used to select one of the candidate literals If new literals are allowed to refer to the target predicate, then FOIL can, in principle learn sets of recursive rules Handling noisy data the search is continued until some trade-off occurs between rule accuracy, coverage and complexity FOIL uses a minimum description length approach to limit the growth of rules: by adding new literals only if their description length is shorter than the description length of the training data they explain.

Summary Sequential covering learns a disjunctive set of rules by first learning a single accurate rule then removing the positive examples covered by this rule and iterating the process over the remaining training examples Variety of methods, vary in the search strategy CN2: general-to-specific beam search, generating and testing progressively more specific rules until a sufficiently accurate rule is found Set of first-order rules provides a highly expressive representation FOIL