Presentation is loading. Please wait.

Presentation is loading. Please wait.

Decision-Tree Induction & Decision-Rule Induction

Similar presentations


Presentation on theme: "Decision-Tree Induction & Decision-Rule Induction"— Presentation transcript:

1 Decision-Tree Induction & Decision-Rule Induction
Evgueni Smirnov Thank you mister vice-chancellor for the opportunity given to defend my PhD thesis! Dear University authorities, dear Committee members, Lady and gentlemen, 1. It is my pleasure to present the results of my PhD research called "Conjunctive and Disjunctive version Spaces with Instance-Based Boundary Sets". The research belongs to the field of concept learning, and that is why I start the presentation with introducing the main elements of the concept learning task.

2 Overview Instances, Concepts, Languages Types of Tasks Decision Trees
Decision Rules Evaluation Techniques Intro to Weka

3 Instances, Concepts, Languages
A concept is a set of objects in a world that are unified by a reason. A reason may be a similar appearance, structure or function. friendly robots Suppose that we have a world of robots. Any concept in this world is a set of robots that are unified by some reason. For example, these three robots represent a concept “Friendly robots”. In order to study concepts we have to represent robots in an instance language. For example, the yellow robot is represented with one description, that I call instance. The description is “head = square, and , body = round, and, smiling = yes, and, holding = flag, and, color = yellow”. In the same manner we describe other robots. In this way the concept “friendly robots” is represented with three descriptions that stand for each of the robots. These descriptions form the extensional representation of the concept in the instance language and we can study the concept using this representation. The problem with the extensional representation of any concept is that it can be large. That is why we introduce a second language, a language of concepts. In this language any concept is represented with an intensional description. In order to make it is possible we introduce a membership relation that determines the relation between instance and concept descriptions. In this way the intensional description of the target concept “friendly robots” is “smiling = yes”. Example. The set: {children, photos, cat, diplomas} can be viewed as a concept “Most important things to take out of your apartment when it catches fire”.

4 Instances, Concepts, Languages
head = square body = round smiling = yes holding = flag color = yellow Li friendly robots

5 Instances, Concepts, Languages
smiling = yes  friendly robots head = square body = round smiling = yes holding = flag color = yellow Lc M Li friendly robots

6 The Concept Learning Task
<Li, Lc, M, <I+, I->> Lc M Li I+: I-: After introducing the main elements of the concept learning task, I define the concept learning task itself. Given a instance language, a concept language, a membership relation between instance and concept languages, a set of positive instances and a set of negative instances of the target concept, the goal of the task is to find the descriptions of the target concept in the concept language.

7 The Clustering Task <Li, Lc, M, I> Lc M Li

8 The Association Learning Task
<Li, Lc, M, I> Lc M Li

9 Decision Trees Decision trees Appropriate problems for decision trees
Entropy and Information Gain The ID3 algorithm Avoiding Overfitting via Pruning Handling Continuous-Valued Attributes Handling Missing Attribute Values

10 Decision Trees Definition: A decision tree is a tree s.t.:
Each internal node tests an attribute Each branch corresponds to attribute value Each leaf node assigns a classification Outlook Sunny Overcast Rainy Humidity Windy High Normal no False True yes

11 Data Set for Playing Tennis
Outlook Temp. Humidity Windy Play Sunny Hot High False no True Overcast yes Rainy Mild Cool Normal Outlook Temp. Humidity Windy play Sunny Mild High False no Cool Normal yes Rainy True Overcast Hot

12 Decision Tree For Playing Tennis
Outlook Sunny Overcast Rainy Humidity yes Windy High Normal False True no yes yes no

13 When to Consider Decision Trees
Each instance consists of an attribute with discrete values (e.g. outlook/sunny, etc..) The classification is over discrete values (e.g. yes/no ) It is okay to have disjunctive descriptions – each path in the tree represents a disjunction of attribute combinations. Any Boolean function can be represented! It is okay for the training data to contain errors – decision trees are robust to classification errors in the training data. It is okay for the training data to contain missing values – decision trees can be used even if instances have missing attributes.

14 Rules in Decision Trees
Outlook Sunny Overcast Rainy Humidity Windy High Normal no False True yes If Outlook = Sunny & Humidity = High then Play = no If Outlook = Sunny & Humidity = Normal then Play = yes If Outlook = Overcast then Play = yes If Outlook = Rainy & Windy = False then Play = yes If Outlook = Rainy & Windy = True then Play = no

15 Decision Tree Induction
Basic Algorithm: 1. A  the “best" decision attribute for a node N. 2. Assign A as decision attribute for the node N. 3. For each value of A, create new descendant of the node N. 4. Sort training examples to leaf nodes. 5. IF training examples perfectly classified, THEN STOP. ELSE iterate over new leaf nodes

16 Decision Tree Induction
Outlook Sunny Rain Overcast ____________________________________ Outlook Temp Hum Wind Play Sunny Hot High Weak no Sunny Hot High Strong no Sunny Mild High Weak no Sunny Cool Normal Weak yes Sunny Mild Normal Strong yes _____________________________________ Outlook Temp Hum Wind Play Overcast Hot High Weak yes Overcast Cool Normal Strong yes _____________________________________ Outlook Temp Hum Wind Play Rain Mild High Weak yes Rain Cool Normal Weak yes Rain Cool Normal Strong no Rain Mild Normal Weak yes Rain Mild High Strong no

17 Entropy Let S be a sample of training examples, and
p+ is the proportion of positive examples in S and p- is the proportion of negative examples in S. Then:  entropy measures the impurity of S: E( S) = - p+ log2 p+ – p- log2 p-

18 Entropy Example from the Dataset
Outlook Temp. Humidity Windy Play Sunny Hot High False no True Overcast yes Rainy Mild Cool Normal Outlook Temp. Humidity Windy play Sunny Mild High False no Cool Normal yes Rainy True Overcast Hot

19 Information Gain Information Gain is the expected reduction in entropy caused by partitioning the instances according to a given attribute. Gain(S, A) = E(S) - where Sv = { s  S | A(s) = v} S Sv1 = { s  S | A(s) = v1} Sv12 = { s  S | A(s) = v2}

20 Example Which attribute should be tested here?
Outlook Sunny Rain Overcast ____________________________________ Outlook Temp Hum Wind Play Sunny Hot High False No Sunny Hot High True No Sunny Mild High False No Sunny Cool Normal False Yes Sunny Mild Normal True Yes _____________________________________ Outlook Temp Hum Wind Play Overcast Hot High Weak Yes Overcast Cool Normal Strong Yes _____________________________________ Outlook Temp Hum Windy Play Rain Mild High False Yes Rain Cool Normal False Yes Rain Cool Normal True No Rain Mild Normal False Yes Rain Mild High True No  Which attribute should be tested here? Gain (Ssunny , Humidity) = = (3/5) (2/5) 0.0 = .970 Gain (Ssunny , Temperature) = (2/5) (2/5) (1/5) 0.0 = .570 Gain (Ssunny , Wind) = (2/5) (3/5) .918 = .019

21 ID3 Algorithm Informally:
Determine the attribute with the highest information gain on the training set. Use this attribute as the root, create a branch for each of the values the attribute can have. For each branch, repeat the process with subset of the training set that is classified by that branch.

22 Hypothesis Space Search in ID3
The hypothesis space is the set of all decision trees defined over the given set of attributes. ID3’s hypothesis space is a compete space; i.e., the target description is there! ID3 performs a simple-to-complex, hill climbing search through this space.

23 Hypothesis Space Search in ID3
The evaluation function is the information gain. ID3 maintains only a single current decision tree. ID3 performs no backtracking in its search. ID3 uses all training instances at each step of the search.

24 Overfitting Definition: Given a hypothesis space H, a hypothesis h  H is said to overfit the training data if there exists some hypothesis h’  H, such that h has smaller error that h’ over the training instances, but h’ has a smaller error that h over the entire distribution of instances.

25 Reasons for Overfitting
Outlook sunny overcast rainy Humidity yes Windy high normal false true no yes yes no Noisy training instances. Consider an noisy training example: Outlook = Sunny; Temp = Hot; Humidity = Normal; Wind = True; PlayTennis = No This instance affects the training instances: Outlook = Sunny; Temp = Cool; Humidity = Normal; Wind = False; PlayTennis = Yes Outlook = Sunny; Temp = Mild; Humidity = Normal; Wind = True; PlayTennis = Yes

26 Reasons for Overfitting
Outlook sunny overcast rainy Humidity yes Windy high normal false true no Windy yes no false true Outlook = Sunny; Temp = Hot; Humidity = Normal; Wind = True; PlayTennis = No Outlook = Sunny; Temp = Cool; Humidity = Normal; Wind = False; PlayTennis = Yes Outlook = Sunny; Temp = Mild; Humidity = Normal; Wind = True; PlayTennis = Yes yes Temp mild cool high yes no ?

27 Reasons for Overfitting
Small number of instances are associated with leaf nodes. In this case it is possible that for coincidental regularities to occur that are unrelated to the actual target concept. - + + + - + - + - + - + - area with probably wrong predictions - + - - - - - - - - - - - -

28 Approaches to Avoiding Overfitting
Pre-pruning: stop growing the tree earlier, before it reaches the point where it perfectly classifies the training data Post-pruning: Allow the tree to overfit the data, and then post-prune the tree.

29 Validation Set Validation set is a set of instances used to evaluate the utility of nodes in decision trees. The validation set has to be chosen so that it is unlikely to suffer from same errors or fluctuations as the training set. Usually before pruning the training data is split randomly into a growing set and a validation set.

30 Reduced-Error Pruning
Split data into growing and validation sets. Pruning a decision node d consists of: removing the subtree rooted at d. making d a leaf node. assigning d the most common classification of the training instances associated with d. Outlook sunny overcast rainy Humidity yes Windy high normal false true no yes yes no 3 instances 2 instances Accuracy of the tree on the validation set is 90%.

31 Reduced-Error Pruning
Split data into growing and validation sets. Pruning a decision node d consists of: removing the subtree rooted at d. making d a leaf node. assigning d the most common classification of the training instances associated with d. Outlook sunny overcast rainy no yes Windy false true yes no Accuracy of the tree on the validation set is 92.4%.

32 Reduced-Error Pruning
Split data into growing and validation sets. Pruning a decision node d consists of: removing the subtree rooted at d. making d a leaf node. assigning d the most common classification of the training instances associated with d. Do until further pruning is harmful: Evaluate impact on validation set of pruning each possible node (plus those below it). Greedily remove the one that most improves validation set accuracy. Outlook sunny overcast rainy no yes Windy false true yes no Accuracy of the tree on the validation set is 92.4%.

33 Reduced Error Pruning Example

34 Posterior Class Probability Distribution of Decision Trees
Outlook Sunny Overcast Rainy no: 2 pos and 3 neg Ppos = 0.4, Pneg = 0.6 no: 2 pos and 0 neg Ppos = 1.0, Pneg = 0.0 Windy False True no: 0 pos and 2 neg Ppos = 0.0, Pneg = 1.0 no: 3 pos and 0 neg Ppos = 1.0, Pneg = 0.0

35 Rule Post-Pruning Convert tree to equivalent set of rules.
Prune each rule independently of others. Sort final rules by their estimated accuracy, and consider them in this sequence when classifying subsequent instances. Outlook IF (Outlook = Sunny) & (Humidity = High) THEN PlayTennis = No IF (Outlook = Sunny) & (Humidity = Normal) THEN PlayTennis = Yes ………. sunny overcast rainy Humidity yes Windy normal false true no yes yes no

36 Continuous Valued Attributes
Create a set of discrete attributes to test continuous. Apply Information Gain in order to choose the best attribute. Temperature: PlayTennis: No No Yes Yes Yes No Temp> Tem>85

37 Missing Attribute Values
Strategies: Assign most common value of A among other instances belonging to the same concept. If node n tests the attribute A, assign most common value of A among other instances sorted to node n. If node n tests the attribute A, assign a probability to each of possible values of A. These probabilities are estimated based on the observed frequencies of the values of A. These probabilities are used in the information gain measure (via info gain) ( ).

38 Summary Points Decision tree learning provides a practical method for concept learning. ID3-like algorithms search complete hypothesis space. The inductive bias of decision trees is preference (search) bias. Overfitting the training data is an important issue in decision tree learning. A large number of extensions of the ID3 algorithm have been proposed for overfitting avoidance, handling missing attributes, handling numerical attributes, etc.

39 Learning Decision Rules
Basic Sequential Covering Algorithm Learn-One-Rule Procedure Pruning

40 Definition of Decision Rules
Definition: Decision rules are rules with the following form: if <conditions> then concept C. Example: If you run the Prism algorithm from Weka on the weather data you will get the following set of decision rules: if outlook = overcast then PlayTennis = yes if humidity = normal and windy = FALSE then PlayTennis = yes if temperature = mild and humidity = normal then PlayTennis = yes if outlook = rainy and windy = FALSE then PlayTennis = yes if outlook = sunny and humidity = high then PlayTennis = no if outlook = rainy and windy = TRUE then PlayTennis = no

41 Why Decision Rules? Decision rules are more compact.
Decision rules are more understandable. X Y 1 Z W Example: Let X {0,1}, Y {0,1}, Z {0,1}, W {0,1}. The rules are: if X=1 and Y=1 then 1 if Z=1 and W=1 then 1 Otherwise 0;

42 Why Decision Rules? Decision boundaries of decision trees
+ - Decision boundaries of decision trees + - Decision boundaries of decision rules

43 How to Learn Decision Rules?
We can convert trees to rules We can use specific rule-learning methods

44 Sequential Covering Algorithms
function LearnRuleSet(Target, Attrs, Examples, Threshold): LearnedRules :=  Rule := LearnOneRule(Target, Attrs, Examples) while performance(Rule,Examples) > Threshold, do LearnedRules := LearnedRules  {Rule} Examples := Examples \ {examples covered by Rule} sort LearnedRules according to performance return LearnedRules

45 Illustration IF true THEN pos - - + - - + - + + + + + + + + + - - - +

46 Illustration IF true THEN pos - - + - - + - + IF A THEN pos + + + + +

47 Illustration IF true THEN pos - - + - - + - + IF A THEN pos
IF A & B THEN pos + + + + + + + + - - - + - - - - -

48 Illustration IF true THEN pos - - + - - + - + + + + + + + + + - - - +
IF A & B THEN pos

49 Illustration IF true THEN pos IF C THEN pos - - + - - + - + + + + + +
IF A & B THEN pos

50 Illustration - - + - - + - + + + + + + + + + - - - + - - - - -
IF A & B THEN pos IF true THEN pos IF C THEN pos IF C & D THEN pos

51 Learning One Rule To learn one rule we use one of the strategies below: Top-down: Start with maximally general rule Add literals one by one Bottom-up: Start with maximally specific rule Remove literals one by one Combination of top-down and bottom-up: Candidate-elimination algorithm.

52 Bottom-up vs. Top-down Bottom-up: typically more specific rules - - +
Top-down: typically more general rules

53 Learning One Rule Bottom-up: Example-driven (AQ family). Top-down:
Generate-then-Test (CN-2).

54 Example of Learning One Rule

55 Heuristics for Learning One Rule
When is a rule “good”? High accuracy; Less important: high coverage. Possible evaluation functions: Relative frequency: nc/n, where nc is the number of correctly classified instances, and n is the number of instances covered by the rule; m-estimate of accuracy: (nc+ mp)/(n+m), where nc is the number of correctly classified instances, n is the number of instances covered by the rule, p is the prior probablity of the class predicted by the rule, and m is the weight of p. Entropy.

56 How to Arrange the Rules
The rules are ordered according to the order they have been learned. This order is used for instance classification. The rules are ordered according to their accuracy. This order is used for instance classification. The rules are not ordered but there exists a strategy how to apply the rules (e.g., an instance covered by conflicting rules gets the classification of the rule that classifies correctly more training instances; if an instance is not covered by any rule, then it gets the classification of the majority class represented in the training data).

57 Approaches to Avoiding Overfitting
Pre-pruning: stop learning the decision rules before they reach the point where they perfectly classify the training data Post-pruning: allow the decision rules to overfit the training data, and then post-prune the rules.

58 Post-Pruning Split instances into Growing Set and Pruning Set;
Learn set SR of rules using Growing Set; Find the best simplification BSR of SR. while (Accuracy(BSR, Pruning Set) > Accuracy(SR, Pruning Set) ) do SR = BSR; Find the best simplification BSR of SR. return BSR;

59 Incremental Reduced Error Pruning
Post-pruning D1 D1 D21 D3 D2 D22 D3

60 Incremental Reduced Error Pruning
Split Training Set into Growing Set and Validation Set; Learn rule R using Growing Set; Prune the rule R using Validation Set; if performance(R, Training Set) > Threshold Add R to Set of Learned Rules Remove in Training Set the instances covered by R; go to 1; else return Set of Learned Rules

61 Summary Points Decision rules are easier for human comprehension than decision trees. Decision rules have simpler decision boundaries than decision trees. Decision rules are learned by sequential covering of the training instances.

62 Model Evaluation Techniques
Evaluation on the training set: too optimistic Classifier Training set Training set

63 Model Evaluation Techniques
Hold-out Method: depends on the make-up of the test set. Classifier Training set Test set Data To improve the precision of the hold-out method: it is repeated many times.

64 Model Evaluation Techniques
k-fold Cross Validation Classifier train test Data train test test train

65 Intro to Weka @relation weather.symbolic
@attribute outlook {sunny, overcast, rainy} @attribute temperature {hot, mild, cool} @attribute humidity {high, normal} @attribute windy {TRUE, FALSE} @attribute play {TRUE, FALSE} @data sunny,hot,high,FALSE,FALSE sunny,hot,high,TRUE,FALSE overcast,hot,high,FALSE,TRUE rainy,mild,high,FALSE,TRUE rainy,cool,normal,FALSE,TRUE rainy,cool,normal,TRUE,FALSE overcast,cool,normal,TRUE,TRUE ………….


Download ppt "Decision-Tree Induction & Decision-Rule Induction"

Similar presentations


Ads by Google