# Data Mining using Decision Trees Professor J. F. Baldwin.

## Presentation on theme: "Data Mining using Decision Trees Professor J. F. Baldwin."— Presentation transcript:

Data Mining using Decision Trees Professor J. F. Baldwin

Decision Trees from Data Base ExAttAttAttConcept NumSizeColourShapeSatisfied 1medbluebrickyes 2smallredwedgeno 3smallredsphereyes 4largeredwedgeno 5largegreenpillaryes 6largeredpillarno 7largegreensphereyes Choose target : Concept satisfied Use all attributes except Ex Num

CLS - Concept Learning System - Hunt et al. Parent node Attribute V v1v2v3 Node with mixture of +ve and -ve examples Children nodes Tree Structure

CLS ALGORITHM 1. Initialise the tree T by setting it to consist of one node containing all the examples, both +ve and -ve, in the training set 2. If all the examples in T are +ve, create a YES node and HALT 3. If all the examples in T are -ve, create a NO node and HALT 4. Otherwise, select an attribute F with values v1,..., vn Partition T into subsets T1,..., Tn according to the values on F. Create branches with F as parent and T1,..., Tn as child nodes. 5. Apply the procedure recursively to each child node

Data Base Example Using attribute SIZE {1, 2, 3, 4, 5, 6, 7} SIZE med small large {1}{2, 3}{4, 5, 6, 7} YES Expand

Expanding {1, 2, 3, 4, 5, 6, 7} SIZE med small large {1}{2, 3} COLOUR {4, 5, 6, 7} SHAPE YES {2, 3} SHAPE wedge sphere {3} {2} noyes wedge sphere pillar {4} {7} {5, 6} COLOUR No Yes red {6} No green {5} Yes

Rules from Tree IF (SIZE = large AND ((SHAPE = wedge) OR (SHAPE = pillar AND COLOUR = red) ))) OR (SIZE = small AND SHAPE = wedge) THEN NO IF (SIZE = large AND ((SHAPE = pillar) AND COLOUR = green) OR SHAPE = sphere) ) OR (SIZE = small AND SHAPE = sphere) OR (SIZE = medium) THEN YES

Disjunctive Normal Form - DNF IF (SIZE = medium) OR (SIZE = small AND SHAPE = sphere) OR (SIZE = large AND SHAPE = sphere) OR (SIZE = large AND SHAPE = pillar AND COLOUR = green THEN CONCEPT = satisfied ELSE CIONCEPT = not satisfied

ID3 - Quinlan ID3 = CLS + efficient ordering of attributes Entropy is used to order the attributes. Attributes are chosen in any order for the CLS algorithm. This can result in large decision trees if the ordering is not optimal. Optimal ordering would result in smallest decision Tree. No method is known to determine optimal ordering. We use a heuristic to provide efficient ordering which will result in near optimal ordering

Entropy For random variable V which can take values {v 1, v 2, …, v n } with Pr(v i ) = p i, all i, the entropy of V is given by Entropy for a fair dice = Entropy for fair dice with even score = = 1.7917 = 1.0986 Information gain = 1.7917 - 1.0986 = 0.6931 Differences between entropies

Attribute Expansion AiAi T Expand attribute A i - a i1 a im T T Pr Equally likely unless specified Pr(A 1, …A i, …A n, T) Attributes Except Ai Pr(A 1, …A i-1, A i+1, …A n, T | A i = a i1 ) other attributes Pass probabilities corresponding to a i1 from above and re-normalise -equally likely again if previous equally likely

Expected Entropy for an Attribute AiAi T Attribute A i and target T - a i1 a im T T S(a i2 ) S(a i1 ) S(a im ) Expected Entropy for Ai = Pr Pass probabilities corresponding to t k from above for a i1 and re-normalise Pr(T | A i =a im )

How to choose attribute and Information gain Determine expected entropy for each attribute i.e. S(A i ), all i Choose s such that Expand attribute A s By choosing attribute A s the information gain is S - S(A s ) where Minimising expected entropy is equivalent to maximising Information gain

Previous Example ExAttAttAttConcept NumSizeColourShapeSatisfied 1medbluebrickyes 1/7 2smallredwedgeno 1/7 3smallredsphereyes 1/7 4largeredwedgeno 1/7 5largegreenpillaryes 1/7 6largeredpillarno 1/7 7largegreensphereyes 1/7 Pr Concept satisfied yes no Pr 4/7 3/7 S = (4/7)Log(4/7) + (3/7)Log(3/7) = 0.99

Entropy for attribute Size AttConcept SizeSatisfied medyes 1/7 smallno 1/7 smallyes 1/7 largeno 2/7 largeyes 2/7 Pr Concept Satisfied no 1/2 yes 1/2 Pr small med Concept Satisfied yes 1 Pr Concept Satisfied no 1/2 yes 1/2 Pr large S(small) = 1 S(med) = 0 S(large) = 1 Pr(small) = 2/7 Pr(large) = 4/7 Pr(med) = 1/7 S(Size) = (2/7)1 + (1/7)0 + (4/7)1 = 6/7 = 0.86 Information Gain for Size = 0.99 - 0.86 = 0.13

First Expansion AttributeInformation Gain SIZE0.13 COLOUR0.52 SHAPE0.7 choose max {1, 2, 3, 4, 5, 6, 7} SHAPE wedge brick pillar sphere {2, 4} NO {1} YES {5, 6} {3, 7} YES Expand

Complete Decision Tree {1, 2, 3, 4, 5, 6, 7} SHAPE wedge brick pillar sphere {2, 4} NO {1} YES {5, 6} {3, 7} YES COLOUR red green {6) NO {5} YES Rule: IF Shape is wedge OR Shape is brick OR Shape is pillar AND Colour is red OR Shape is sphere THEN NO ELSE YES

A new case AttAttAttConcept SizeColourShapeSatisfied medredpillar? SHAPE pillar COLOUR red ? = NO

Post Pruning Any Node S N examples in node n cases of C C is one of {YES, NO } Let C be class with most examples i.e majority E(S) Suppose we terminate this node and make it a leaf with classification C. What will be the expected error, E(S), if we use the tree for new cases and we reach this node. E(S) = Pr(class of new case is a class C)

Bayes Updating for Post Pruning Let p denote probability of class C for new case arriving at S We do not know p. Let f(p) be a prior probability distribution for p on [0, 1]. We can update this prior using Bayes updating with the information at node S. The information at node S is n C in S 1 0 Pr(n C in S | p) f(p) Pr(n C in S | p) f(p)dp f(p | n in S) =

Mathematics of Post Pruning Assume f(p) to be uniform over [0, 1] 1 0 dp f(p | n C in S) = p (1-p) nN – n p (1-p) nN – n E(S) = E (1 – p) f(p | n C in S) E(S) = 1 0 dp p (1-p) nN – n + 1 p (1-p) nN – n = N – n + 1 N + 2 dp using Beta Functions. The evaluation of the integral n! (N – n + 1)! (N + 2)! 1 0 dx x (1-x) ab = using Beta Functions

Post Pruning for Binary Case S S1S2Sm Error(S1) Error(S2) Error(Sm) P1 P2 Pm E(S) BackUpError(S) For any node S which is not a leaf node we can calculate BackUpError(S) = Pi Error(Si) i Error(S) = MIN {} P i = Num of examples in Si Num of examples in S For leaf nodes S i Error(S i ) = E(S i ) E(S) BackUpError(S) Decision: Prune at S if BackUpError(S) Error(S)

Example of Post Pruning Before Pruning a [6, 4] b [4, 2] c [2, 2] d [1, 2] [x, y] means x YES cases and y NO cases We underline Error(Sk) [3, 2] 0.429 [1, 0] 0.333 [1, 1] 0.5 [0, 1] 0.333 [1, 0] 0.333 0.375 0.413 0.417 0.378 0.5 0.383 0.4 0.444 PRUNE PRUNE means cut the sub- tree below this point

Result of Pruning After Pruning a [6, 4] [4, 2] c [2, 2] [1, 2] [1, 0]

Generalisation For the case in which we have k classes the generalisation for E(S) is = N – n + k – 1 N + k Otherwise, pruning method is the same. E(S)

Testing DataBase Training Set Test Set Learn rules using Training Set and Prune Test rules on this set and record % correct Test rules on Test Set record % correct % accuracy on test set should be close to that of training set. This indicates good generalisation Over-fitting can occur if noisy data is used or too specific attributes are used. Pruning will overcome noise to some extent but not completely. Too specific attributes must be dropped.