Class1 Class2 The methods discussed so far are Linear Discriminants.

Class1 Class2 The methods discussed so far are Linear Discriminants

1 0 1 0 XOR Problem: Not Linearly Separable!

Decision Rules for XOR Problem: If x=0 then If y=0 then class=0 Else class = 1 Else if x=1 then If y=0 then class=1 Else class = 0 1 0 10

f f t t C1 C2C1 Y X A Sample Decision Tree By default, a false value is to the left, true to the right. It is easy to generate a tree to perfectly classify training data; it is much harder to generate a tree that works well on the test data!

Decision Tree Induction Pick feature to test, say X Split training cases into a set where X=True, and X=False If a set is entirely of cases in one class, label it as a leaf. Alternately, label a set as a leaf if there are fewer than some threshold number of cases, e.g. 5 Repeat process on sets that are not leaves.

Decision Tree Induction How to pick which feature to test? Use heuristic search! Entropy heuristic attempts to reduce the degree of randomness, or “impurity” of the selected feature (# bits needed to encode) E.g. High randomness: Feature that splits into two sets, each set 50% class 1 and 2. Low Randomness: Feature that splits into two sets, everything in set1=class1, everything in set2=class2.

The entropy of a particular state is the negative sum over all the classes of the probability of each class multiplied by the log of the probability: 2 classes, C1 and C2 100 cases For this state, 50 cases are in C1 and 50 cases are in C2 Thus the probability of each class, P1 and P2 are 0.5. The entropy of this node = -[ (0.5)(lg 0.5) + (0.5)(lg 0.5) ] = 1 75 cases in C1 and 25 cases in C2 P(C1) = 0.75, P(C2) = 0.25 Entropy = -[ (0.75)(lg 0.75) + (0.25)(lg 0.25) ] = 0.81

Our algorithm will pick the feature or test that reduces the entropy the most. This can be achieved by selecting the feature test that maximizes the following equation, in the event that there are only two classes: Find the feature test that maximizes: If there were more classes, we would have to include p class *entropy(node class ).

For example, let’s say that we are at a node with an entropy of 1 as calculated previously. If we split the current set of cases testing if Feature A is true or false: Feature A 100 cases F T 10 cases C1 20 cases C2 50 cases C1 20 cases C2 E=-[(1/3)*lg(1/3) + (2/3)*lg(2/3) = 0.92 E=-[(5/7)*lg(5/7) + (2/7)*lg(2/7) = 0.86

If we split the current set of cases testing if Feature B is true or false: Feature B 100 cases F T 50 cases C150 cases C2 E=-[(1)*lg(1) + (0)*lg(0) = 0 (actually undefined at lg(0)) E=-[(0)*lg(0) + (1)*lg(1) = 0 Larger change in entropy, will pick Feature B over Feature A!

# of Terminals vs. Error Rates (for Iris Data problem)

Reduction in Tree Size Prune Branches –Induct Tree –From the bottom, move up to the subtree starting at non-terminal node –Prune this node –Test the new tree on the *test* cases –If it performs better than the original tree, keep the changes and continue Subtle Flaw: Trains on the Test Data. Need a large sample size to get valid results.

Web Demo http://www.cs.ualberta.ca/~aixplore/learnin g/DecisionTrees/Applet/DecisionTreeApple t.html

Rule Induction Overview Generic separate-and-conquer strategy CN2 rule induction algorithm Improvements to rule induction

Problem Given: –A target concept –Positive and negative examples –Examples composed of features Find: –A simple set of rules that discriminates between (unseen) positive and negative examples of the target concept

Sample Unordered Rules If X then C1 If X and Y then C2 If NOT X and Z and Y then C3 If B then C2 What if two rules fire at once? Just OR together?

Target Concept Target concept in the form of rules. If we only have 3 features, X, Y, and Z, then we could generate the following possible rules: –If X then… –If X and Y then… –If X and Y and Z then… –If X and Z then … –If Y then … –If Y and Z then … –If Z then… Exponentially large space, larger if allow NOT’s

Generic Separate-and-Conquer Strategy TargetConcept = NULL While NumPositive(Examples) > 0 BestRule = TRUE Rule = BestRule Cover = ApplyRule(Rule) While NumNegative(Cover) > 0 For each feature  Features Refinement=Rule  feature If Heuristic(Refinement, Examples) > Heuristic(BestRule, Examples) BestRule = Refinement Rule = BestRule Cover = ApplyRule(Rule) TargetConcept = TargetConcept  Rule Examples = Examples - Cover

Trivial Example 1: a,b 2: b,c 3: c,d 4: d,e + - H(T)=2/4 H(a)=1/1 H(b)=2/2 H(c)=1/2 H(d)=0/1 H(e)=0/1 Say we pick a. Remove covered examples: 2: b,c 3: c,d 4: d,e + - H(a  b)=1/1 H(a  c)=1/2 H(a  d)=0/2 H(a  e)=0/1 Pick as our rule: a  b.

CN2 Rule Induction (Clark & Boswell, 1991) More specialized version of separate-and- conquer: CN2Unordered(allexamples, allclasses) Ruleset  {} For each class in allclasses Generate rules by CN2ForOneClass(allexamples, class) Add rules to ruleset Return ruleset

CN2 CN2ForOneClass(examples, class) Rules  {} Repeat Bestcond  FindBestCondition(examples, class) If bestcond <> null then Add the rule “IF bestcond THEN PREDICT class” Remove from examples all + cases in class covered by bestcond Until bestcond = null Return rules Keeps negative examples around so future rules won’t impact existing negatives (allows unordered rules)

CN2 FindBestCondition(examples, class) MGC  true ‘ most general condition Star  MGC, Newstar  {}, Bestcond  null While Star is not empty (or loopcount < MAXCONJUNCTS) For each rule R in Star For each possible feature F R’  specialization of Rule formed by adding F as an Extra conjunct to Rule (i.e. Rule’ = Rule AND F) Removing null conditions (i.e. A AND NOT A) Removing redundancies (i.e. A AND A) and previously generated rules. If LaPlaceHeuristic(R’,class) > LaPlaceHeuristic (Bestcond, class) Bestcond  R’ Add R’ to Newstar If size(NewStar) > MAXRULESIZE then Remove worst in Newstar until Size=MAXRULESIZE Star  Newstar Return Bestcond

LaPlace Heuristic In our case, NumClasses=2. A common problem is a specific rule that covers only 1 example. In this case, LaPlace = 1+1/1+2 = 0.6667. However, a rule that covers say 2 examples gets a higher value of 2+1/2+2 = 0.75.

Trivial Example Revisited 1: a,b 2: b,c 3: c,d 4: d,e + - L(T)=3/6 L(a)=2/3 L(b)=3/4 L(c)=1/4 L(d)=0/4 L(e)=0/3 Say we pick beam=3. Keep T, a, b. L(a  b)=2/3 L(a  c)=0 L(a  d)=0 L(a  e)=0 Our best rule out of all these is just “b”. Specialize T : (all already done) Specialize a: Specialize b: L(b  a)=2/3 L(b  c)=2/3 L(b  d)=0 L(b  e)=0 Continue until out of features, or max num of conjuncts reached.

Improvements to Rule Induction Better feature selection algorithm Add rule pruning phase –Problem of overfitting the data –Split training examples into a GrowSet (2/3) and PruneSet (1/3) Train on GrowSet Test on PruneSet with pruned rules, keep rule with best results –Needs more training examples!

Improvements to Rule Induction Ripper / Slipper –Rule induction with pruning, new heuristics on when to stop adding rules, prune rules –Slipper builds on Ripper, but uses boosting to reduce weight of negative examples instead of removing them entirely Other search approaches –Instead of beam search, genetic, pure hill climbing (would be faster), etc.

In-Class VB Demo Rule Induction for Multiplexer

Class1 Class2 The methods discussed so far are Linear Discriminants.

Similar presentations

Presentation on theme: "Class1 Class2 The methods discussed so far are Linear Discriminants."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Class1 Class2 The methods discussed so far are Linear Discriminants.

Similar presentations

Presentation on theme: "Class1 Class2 The methods discussed so far are Linear Discriminants."— Presentation transcript:

Similar presentations

About project

Feedback