Presentation is loading. Please wait.

Presentation is loading. Please wait.

COMP3740 CR32: Knowledge Management and Adaptive Systems Supervised ML to learn Classifiers: Decision Trees and Classification Rules Eric Atwell, School.

Similar presentations


Presentation on theme: "COMP3740 CR32: Knowledge Management and Adaptive Systems Supervised ML to learn Classifiers: Decision Trees and Classification Rules Eric Atwell, School."— Presentation transcript:

1 COMP3740 CR32: Knowledge Management and Adaptive Systems Supervised ML to learn Classifiers: Decision Trees and Classification Rules Eric Atwell, School of Computing, University of Leeds (including re-use of teaching resources from other sources, esp. Knowledge Management by Stuart Roberts, School of Computing, University of Leeds)

2 Reminder: Objectives of data mining Data mining aims to find useful patterns in data. For this we need: –Data mining techniques, algorithms, tools, eg WEKA –A methodological framework to guide us, in collecting data and applying the best algorithms, CRISP-DM TODAYS objective: learn how to learn classifiers Decision Trees and Classification Rules Supervised Machine Learning: training set has the answer (class) for each example (instance)

3 Reminder: Concepts that can be learnt The types of concepts we try to learn include: Clusters or Natural partitions; –Eg we might cluster customers according to their shopping habits. Rules for classifying examples into pre-defined classes. –Eg Mature students studying information systems with high grade for General Studies A level are likely to get a 1 st class degree General Associations –Eg People who buy nappies are in general likely also to buy beer Numerical prediction –Eg Salary = a*A-level + b*Age + c*Gender + d*Prog + e*Degree (but are Gender, Programme really numbers???)

4

5 Output: decision tree Outlook Humidity sunny high Play = no normal Play = yes Windy rainy true Play = no false Play = yes

6

7

8

9

10 Decision Tree Analysis Example instance set Can we predict, from the first 3 columns, the risk of getting a virus? For convenience later: F = shares Files, S = uses Scanner, I = Infected before

11 Decision tree building method Forms a decision tree –tries for a small tree covering all or most of the training set –internal nodes represent a test on an attribute value –branches represent outcome of the test Decides which attribute to test at each node –this is based on a measure of entropy Must avoid over-fitting –if the tree is complex enough it might describe the training set exactly, but be no good for prediction May leave some exceptions

12 Building a decision tree (DT) The algorithm is recursive, at any step: T = set of (remaining) training instances, {C 1, …, C k } = set of classes If all instances in T belong to a single class C i, then DT is a leaf node identifying class C i. (done!) …continued

13 Building a decision tree (DT) …continued If T contains instances belonging to mixed classes, then choose a test based on a single attribute that will partition T into subsets {T 1, …, T n } according to n outcomes of the test. The DT for T comprises a root node identifying the test and one branch for each outcome of the test. The branches are formed by applying the rules above recursively on each of the subsets {T 1, …, T n }.

14 T = Classes = {High, Medium, Low} Choose a test based on F, number of outcomes, n = 2 (Yes or No) F yes no T1 = T2 = Tree Building example

15 T1 = Classes = {High, Medium, Low} Choose a test based on I, number of outcomes, n = 2 (Yes or No) T3 = T4 = I yes no F yes no

16 Tree Building example T1 = Classes = {High, Medium, Low} Choose a test based on I, number of outcomes, n = 2 (Yes or No) T3 = I yes no F yes no Risk = High

17 Tree Building example Classes = {High, Medium, Low} Choose a test based on S, number of outcomes, n = 2 (Yes or No) I yes no T3 = F yes no Risk = High S yes no

18 Tree Building example Classes = {High, Medium, Low} Choose a test based on S, number of outcomes, n = 2 (Yes or No) I yes no T3 = F yes no Risk = High S yes no Risk = Low Risk = High

19 Tree Building example Classes = {High, Medium, Low} Choose a test based on S, number of outcomes, n = 2 (Yes or No) I yes no T2 = F yes no S yes no Risk = Low Risk = High S yes no

20 Tree Building example Classes = {High, Medium, Low} Choose a test based on S, number of outcomes, n = 2 (Yes or No) I yes no T2 = F yes no S yes no Risk = Low Risk = High S yes no Risk = Low Risk = Medium

21 Example Decision Tree Shares files? noyes Uses scanner? noyes Infected before? yesno Uses scanner? no Yes mediumlow high

22 Which attribute to test? The ROOT could be S or I instead of F – leading to a different Decision Tree Best DT is the smallest, most concise model The search space in general is too large to find the smallest tree by exhaustive searching (try them all). Instead we look for the attribute which splits the training set into the most homogeneous sets. The measure used for homogeneity is based on entropy.

23 T = Classes = {Yes, No} Choose a test based on F, number of outcomes, n = 2 (Yes or No) F yes no Tree Building example (modified)

24 T = Classes = {Yes, No} Choose a test based on F, number of outcomes, n = 2 (Yes or No) F yes no Tree Building example (modified) High Risk = yes 5, 1 High Risk = no 2, 0

25 T = Classes = {Yes, No} Choose a test based on S, number of outcomes, n = 2 (Yes or No) S yes no Tree Building example (modified)

26 T = Classes = {Yes, No} Choose a test based on S, number of outcomes, n = 2 (Yes or No) S yes no Tree Building example (modified) High Risk = no 4,2 High Risk = yes 3,1

27 T = Classes = {Yes, No} Choose a test based on I, number of outcomes, n = 2 (Yes or No) I yes no Tree Building example (modified)

28 T = Classes = {Yes, No} Choose a test based on I, number of outcomes, n = 2 (Yes or No) I yes no Tree Building example (modified) High Risk = no 3, 1 High Risk = yes 4,1

29 Decision tree building algorithm For each decision point, –If all remaining examples are all +ve or all -ve, stop. –Else if there are some +ve and some -ve examples left and some attributes left, pick the remaining attribute with largest information gain –Else if there are no examples left, no such example has been observed; return default –Else if there are no attributes left, examples with the same description have different classifications: noise or insufficient attributes or nondeterministic domain

30 Evaluation of decision trees At the leaf nodes two numbers are given: –N: the coverage for that node: how many instances –E: the error rate: how many wrongly classified instances The whole tree can be evaluated in terms of its size (number of nodes) and overall error-rate expressed in terms of the number and percentage of cases wrongly classified. We seek small trees that have low error rates.

31 Evaluation of decision trees The error rate for the whole tree can also be displayed in terms of a confusion matrix: (A)(B)(C) Classified as 3521Class (A) = high 4415Class (B) = medium 2568Class (C) = low

32 Evaluation of decision trees The error rates mentioned on previous slides are normally computed using a.The training set of instances. b.A test set of instances – some different examples! If the decision tree algorithm has over-fitted the data, then the error rate based on the training set will be far less than that based on the test set.

33 Evaluation of decision trees 10-fold cross-validation can be used when the training set is limited in size: –Divide the test set randomly into 10 subsets. –Build a tree from 9 of the subsets and test using the 10 th. –Repeat the experiment 9 more times, using a different test set each time. –Overall error rate is average of 10 experiments 10-fold cross-validation will lead to up to 10 different decision trees being built. The method for selecting or constructing the best tree is not clear.

34 From decision trees to rules Decision trees may not be easy to interpret: –tests associated with lower nodes have to be read in the context of tests further up the tree –sub-concepts may sometimes be split up and distributed to different parts of the tree (see next slide) –Computer Scientists may prefer if … then … rules!

35 DT for F = G = 1 or J = K = 1 F= 0; J = 0; no J = 1; K = 0; no K = 1; yes F = 1; G = 1; yes G = 0; J = 0; no J = 1; K = 0; no K = 1; yes J=K=1 is split across two subtrees. F J K G J K 0 0 0 0 0 0 1 1 1 1 1 1 yes no

36 Converting DT to rules Step 1: Every path from root to leaf represents a rule: F= 0; J = 0; no J = 1; K = 0; no K = 1; yes F = 1; G = 1; yes G = 0; J = 0; no J = 1; K = 0; no K = 1; yes If F = 0 and J = 0 then class no; If F = 0 and J = 1 and K = 0 then class no If F = 0 and J = 1 and K = 1 then class yes …. If F = 1 and G = 0 and J = 1 and K = 1 then class yes

37 Generalising rules If F = 0 and J = 1 and K = 1 then class yes If F = 1 and G = 0 and J = 1 and K = 1 then class yes If G = 1 then class yes If J =1 and K = 1 then class yes

38 Tidying up rule sets Generalisation leads to 2 problems: Rules no longer mutually exclusive –Order rules and use the first matching rule used as the operative rule. –Ordering is based on how many false positive errors the rule makes Rule set no longer exhaustive –Choose a default value for the class when no rule applies –Default class is that which contains the most training cases not covered by any rule.

39 Decision Tree - Revision Decision tree builder algorithm discovers rules for classifying instances. At each step, it needs to decide which attribute to test at that point in the tree; a measure of information gain can be used. The output is a decision tree based on the training instances, evaluated with separate test instances. Leaf nodes which have a small coverage may be pruned if the error rate is small for the pruned tree.

40 Pruning example (from W & F) Health plan contribution 4 bad 2 good 1 bad 1 good 4 bad 2 good nonehalffull We replace the subtree with: Bad 14, 5 Number of instances number of errors

41 Decision trees v classification rules Decision trees can be used for prediction or interpretation. –Prediction: compare an unclassified instance against the tree and predict what class it is in (with error estimate) –Interpretation: examine tree and try to understand why instances end up in the class they are in. Rule sets are often better for interpretation. –Small, accurate rules can be examined, even if overall accuracy of the rule set is poor.

42 Self Check You should be able to: –Describe how the decision-trees are built from a set of instances. –Build a decision tree based on a given attribute –Explain what the training and test sets are for. –Explain what Supervised means, and why classification is an example of supervised ML


Download ppt "COMP3740 CR32: Knowledge Management and Adaptive Systems Supervised ML to learn Classifiers: Decision Trees and Classification Rules Eric Atwell, School."

Similar presentations


Ads by Google