Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Data Mining Chapter 4, Part 1 Algorithms: The Basic Methods Kirk Scott.

Similar presentations


Presentation on theme: "1 Data Mining Chapter 4, Part 1 Algorithms: The Basic Methods Kirk Scott."— Presentation transcript:

1 1 Data Mining Chapter 4, Part 1 Algorithms: The Basic Methods Kirk Scott

2 Dendrogramma, a New Phylum 2

3 Did you have any idea that symbols like these could be inserted into PowerPoint presentations? Ѿ ۞ ۩ ҂ 3

4 Basic Methods A good rule of thumb is try the simple things first Quite frequently the simple things will be good enough or will provide useful insights for further explorations One of the meta-tasks of data mining is figuring out which algorithm is the right one for a given data set 4

5 Certain data sets have a certain structure Certain algorithms are designed to elicit particular kinds of structures The right algorithm applied to the right set will give straightforward results A mismatch between algorithm and data set will give complicated, cloudy results 5

6 This is just another thing where you have to accept the fact that it’s magic, or a guessing game Or you could call it exploratory research Based on experience, you might have some idea of which data mining algorithm might be right for a given data set Otherwise, you just start trying them and seeing what kind of results you get 6

7 Chapter 4 is divided into 8 basic algorithm descriptions plus a 9 th topic The first four algorithms are covered by this set of overheads They are listed on the following overhead 7

8 4.1 Inferring Rudimentary Rules 4.2 Statistical Modeling 4.3 Divide and Conquer: Constructing Decision Trees 4.4 Covering Algorithms: Constructing Rules 8

9 4.1 Inferring Rudimentary Rules The 1R (1-rule) approach Given a training set with correct classifications, make a one-level decision tree based on one attribute Each value of the attribute generates a branch Each branch contains the collection of instances that have that attribute value 9

10 For each collection of instances belonging to an attribute value, count up the number of occurrences of each classification Let the predicted classification for that branch be the classification that has the greatest number of occurrences in the training set 10

11 Do this for each attribute Out of all of the trees generated, pick the one that has the lowest error rate The error rate is the count of the total number of misclassified instances across the training set for the rule set 11

12 This simple approach frequently works well This suggests that for a lot of data sets, one dominant attribute is a strong determinant 12

13 Missing Values Missing values are easily handled by 1R Missing is just one of the branches in the decision tree 1R is fundamentally nominal 13

14 Numeric Attributes If 1R is fundamentally nominal, how do you decide how to branch on a numeric attribute? One approach to branching on numerics: Sort the numeric instances Create a break point everywhere in the sequence that the classification changes This partitions the domain 14

15 Overfitting The problem is that you may end up with lots of break points/partitions If there are lots of break points, this is counterproductive Rather than grouping things into categories, you’re fragmenting them This is a sign that you’re overfitting to the existing, individual instances in the data set 15

16 In the extreme case, there are as many determinant values as there are classifications This is not good You’ve essentially determined a 1-1 coding from the attribute in question to the classification 16

17 If this happens, instances in the future that do not have these values for the attributes can’t be classified by the system They will not fall into any known partition This is an extreme case It is the classic case of overfitting 17

18 The less extreme case is when a model tends to have poor prediction performance due to overfitting to the existing data In other words, the model will make predictions, but the error rate is higher than a model that is less tightly fitted to the training data 18

19 Dealing with Overfitting in 1R This is a blanket rule of thumb to deal with the foregoing problem: When picking the numeric ranges that define the branches of the tree, specify that a minimum number, n, of instances from the training set have to be in each partition 19

20 The ranges are taken from the sorted values in order Wherever you’re at in the scheme, the problem that arises now is that the next n instances may include more than one classification The solution is to take the majority classification of the n as the rule for that branch 20

21 The latest rule of thumb, given above, may result in neighboring partitions with the same classification The solution to that is to merge those partitions This potentially will reduce the number of partitions significantly 21

22 Notice how rough and ready this is It’s a series of rules of thumb to fix problems cause by the previous rule of thumb You are essentially guaranteed some misclassifications However, on the whole, you hope that these heuristics result in a significant proportion of correct classifications 22

23 Discussion 1R is fast and easy 1R quite often performs only slightly less well than advanced techniques It makes sense to start simple in order to get a handle on a data set Go to something more complicated if desired 23

24 At the end of this section, the text describes a more complicated kind of 1R The details are unimportant Their concluding point is that, with experience, an initial examination of the data with simple techniques may give you insight into which more advanced technique might be suitable for it 24

25 4.2 Statistical Modeling 25

26 This is basically a discussion of an application of Bayes’ Theorem Bayes’ Theorem makes a statement about what is known as conditional probability I will cover the same ideas as the book, but I will do it in a slightly different way Whichever explanation makes the most sense to you is the “right” one 26

27 The book refers to this approach as Naïve Bayes It is based on the simplifying assumption that the attributes in a data set are independent Independence isn’t typical Otherwise there would be no associations to mine Even so, the technique gives good results 27

28 The Weather Example Table 4.2, shown on the following overhead, summarizes the outcome of play yes/no for each weather attribute for all instances in the training set 28

29 29

30 Note this in particular on first viewing: 9 total yeses, 5 total nos The middle part of the table shows the raw counts of attribute value occurrences in the data set The bottom part of the table is the instructive one 30

31 Take the Outlook attribute for example: –Sunny-yes 2/9no 3/5 –Overcast-yes 4/9no 0/5 –Rainy-yes 3/9no 2/5 Given the outcome, yes/no (the denominator), these fractions tell you the likelihood that there was a given outlook (the numerator) 31

32 Bayes’ Theorem Bayes’ Theorem involves a hypothesis, H, and some evidence, E, relevant to the hypothesis The theorem gives a formula for finding the probability that H is true under the condition that you know that E is true 32

33 The theorem is based on knowing some some probabilistic quantities related to the problem It is a statement of conditional probability, which makes use of conditional probabilities This is the notation: P(A|B) = the probability of a given that B is true 33

34 This is a statement of Bayes’ theorem: P(E|H)P(H) P(H|E) = --------------- P(E) 34

35 Illustrating the Application of the Theorem with the Weather Example The book does its example with all of the attributes at once I will do this with one attribute and then generalize I will use the Outlook attribute Let H = (play = yes) Let E = (outlook = sunny) 35

36 Then P(H|E) = P(play = yes | outlook = sunny) By Bayes’ Theorem this equals P(outlook = sunny | play = yes)P(play = yes) ---------------------------------------------------------- P(outlook = sunny) 36

37 Typing these things into PowerPoint is making me cry The following overheads show in words how you can intuitively understand what’s going on with the weather example They are based on the idea that you can express the probabilities in the expression in terms of fractions involving counts 37

38 38

39 39

40 After you’ve simplified like this, it’s apparent that you could do the calculation by just pulling 2 values out of the table However, the full formula where you can have multiple different E (weather attributes, for example) is based on using Bayes’ formula with all of the intermediate expressions 40

41 Before considering the case with multiple E, here is the simple case using the full Bayes’ formula Here are the fractions for the parts: P(E|H) = P(outlook = sunny | play = yes) = 2/9 P(H) = P(play = yes) = 9/14 P(E) = P(outlook = sunny) = 5/9 41

42 Then P(H|E) P(E|H)P(H) = --------------- P(E) 2/9 * 9/14 2/14 2 = ------------- = ------ = ---- =.4 5/14 5/14 5 42

43 In other words, there were 2 sunny days when play = yes out of 5 sunny days total Using the same approach, you can find P(H|E) where H = (play = no) The arithmetic gives this result:.6 43

44 Using Bayes’ Theorem to Classify with More Than One Attribute The preceding example illustrated Bayes’ Theorem applied to one attribute A full Bayesian expression (conditional probability) will be derived including multiple bits of evidence, E i, one for each of the i attributes The totality of evidence, including all attributes would be noted as E 44

45 Prediction can be done for a new instance with values for the i attributes The fractions from the weather table corresponding to the instance’s attribute values can be plugged into the Bayesian expression The result would be a probability, or prediction that play = yes or play = no for that set of attribute values 45

46 Statistical Independence Recall that Naïve Bayes assumed statistical independence of the attributes Stated simply, one of the results of statistics is that the probability of two events’ occurring is the product of their individual probabilities This is reflected when forming the expression for the general case 46

47 A Full Bayesian Example with Four Attributes The weather data had four attributes: outlook, temperature, humidity, windy Let E be the composite of E 1, E 2, E 3, and E 4 In other words, an instance has values for each of these attributes, and fractional values for each of the attributes can be read from the table based on the training set 47

48 Bayes’ Theorem extended to this case looks like this: P(E 1 |H)P(E 2 |H)P(E 3 |H)P(E 4 |H)P(H) P(H|E) = ---------------------------------------------- P(E) 48

49 Suppose you get a new data item and you’d like to make a prediction based on its attribute values Let them be: Outlook = sunny Temperature = cool Humidity = high Windy = true Let the hypothesis be that Play = yes 49

50 Referring back to the original table: P(E 1 |H) = P(sunny | play = yes) = 2/9 P(E 2 |H) = P(cool | play = yes) = 3/9 P(E 3 |H) = P(high humidity | play = yes) = 3/9 P(E 4 |H) = P(windy | play = yes) = 3/9 P(H) = P(play = yes) = 9/14 50

51 The product of the quantities given on the previous overhead is the numerator of the Bayesian expression This product =.0053 Doing the same calculation to find the numerator of the expression for play = no gives the value.0206 51

52 In general, you would also be interested in the denominator of the expressions, P[E] However, in the case where there are only two alternative predictions, you don’t have to do a separate calculation You can arrive at the needed value by other means 52

53 The universe of possible values is just yes or no, so you can form the denominator by just adding the two numerator values The denominator is.0053 +.0206 So the P[yes|E] = (.0053) / (.0053 +.0206) = 20.5% And the P[no|E] = (.0206) / (.0053 +.0206) = 79.5% 53

54 In effect, instead of computing the complete expressions, we’ve normalized them However we arrive at the numeric figures, it is their relative values that makes the prediction 54

55 Given the set of attribute values, it is approximately 4 times more likely that play will have the value no than that play will have the value yes Therefore, the prediction, or classification of the instance will be no (with approximately 80% confidence) 55

56 A Small Problem with the Bayes’ Theorem Formula If one of the probabilities in the numerator is 0, the whole numerator goes to 0 This would happen when the training set did not contain any instances with a particular value for an attribute i, but a new instance did You can’t compare yes/no probabilities if they have gone to 0 56

57 One solution approach is to add constants to the top and bottom of fractions in the expression This can be accomplished without changing the relative yes/no outcome I don’t propose to go into this in detail (now) To me it seems more appropriate for an advanced discussion later, if needed 57

58 Missing Values Missing values for one or more attributes in an instance to be classified are not a problem If an attribute value is missing, a fraction for its conditional probability is simply not included in the Bayesian formula In other words, you’re only doing prediction based on the attributes that do exist in the instance 58

59 Numeric Attributes The discussion so far has been based on the weather example As initially given, all of the attributes are categorical It is also possible to handle numeric attributes This involves a bit more work, but it’s straightforward 59

60 For the purposes of illustration, assume that the distribution of numeric attributes is normal In the summary of the training set, instead of forming fractions like for occurrences of nominal values, find the mean and standard deviation for numeric ones 60

61 The important thing to remember is that in the earlier example, you did summaries for both the yes and no cases For numeric data, you need to find the mean and standard deviation for both the yes and no cases These parameters, µ and σ, will appear in the calculation of the parts of the Bayesian expression 61

62 This is the normal probability density function (p.d.f.): If x is distributed according to this p.d.f., then f(x) is the probability that the value x will occur 62

63 Let µ and σ be the mean and standard deviation of some attribute for those cases in the training set where the value of play = yes Then put the value of x for that attribute into the equation 63

64 f(x) is the probability of x, given that the value of play = yes In other words, this is P(E|H), the kind of thing in the numerator of the formula for Bayes’ theorem 64

65 Now you can plug this into the Bayesian expression just like the fractions for the nominal attributes in the earlier example This procedure isn’t proven correct, but based on background knowledge in statistics it seems to make sense We’ll just accept it as given in the book and apply it as needed 65

66 Naïve Bayes for Document Classification The details for this appear in a box in the text, which means it’s advanced and not to be covered in detail in this course The basic idea is that documents can be classified by which words appear in them The occurrence of a word can be modeled as a Boolean yes/no 66

67 The classification can be improved if the frequency of words is also taken into account This is the barest introduction to the topic It may come up again later in the book 67

68 Discussion Like 1R, Naïve Bayes can often produce good results The rule of thumb remains, start simple It is true that dependency among attributes is a theoretical problem with Bayesian analysis and can lead to results which aren’t accurate 68

69 The presence of dependent attributes means multiple factors for the same feature in the Bayesian expression Potentially too much weight will be put on a feature with multiple factors One solution is to try and select only a subset of independent attributes to work with as part of preprocessing 69

70 For numeric attributes, if they’re not normally distributed, the normal p.d.f. shouldn’t be used If the attributes do fall into a known distribution, you can use its p.d.f. In the absence of any knowledge, the uniform distribution might be a starting point for an analysis, with statistical analysis revealing the actual distribution 70

71 4.3 Divide-and-Conquer: Constructing Decision Trees Note that like everything else in this course, this is a purely pragmatic presentation Ideas will be given Nothing will be proven The book gives things in a certain order I will try to cover pretty much the same things I will do it in a different order 71

72 When Forming a Tree… 1. The fundamental question at each level of the tree is always which attribute to split on In other words, given attributes x 1, x 2, x 3 …, do you branch first on x 1 or x 2 or x 3 …? Having chosen the first to branch on, which of the remaining ones do you branch on next, and so on? 72

73 2. Suppose you can come up with a function, the information (info) function This function is a measure of how much information is needed in order to make a decision at each node in a tree 3. You split on the attribute that gives the greatest information gain from level to level 73

74 4. A split is good if it means that little information will be needed at the next level down You measure the gain by subtracting the amount of information needed at the next level down from the amount needed at the current level 74

75 Defining an Information Function It is necessary to describe the information function more fully, first informally, then formally This is the guiding principle: The best split results in branches where the instances in each of the branches are all of the same classification The branches are leaves and you’ve arrived at a complete classification 75

76 No more splitting is needed No more information is needed Expressed somewhat formally: If the additional information needed is 0, then the information gain from the previous level(s) is 100% I.e., whatever information remained to be gained has been gained 76

77 Developing Some Notation Taking part of the book’s example as a starting point: Let some node have a total of 9 cases Suppose that eventually 2 classify as yes and 7 classify as no This notation represents the information function: info([2, 7]) 77

78 info([2, 7]) The idea is this: In practice, in advance we wouldn’t know that 9 would split into 2 and 7 This is a symbolic way of indicating the information that would be needed to classify the instances into 2 and 7 At this point we haven’t assigned a value to the expression info([2, 7]) 78

79 The next step in laying the groundwork of the function is simple arithmetic Let p 1 = 2/9 and p 2 = 7/9 These fractions represent the proportion of each case out of the total They will appear in calculations of information gain 79

80 Properties Required of the Information Function Remember the general description of the information function It is a measure of how much information is needed in order to make a decision at each node in a tree 80

81 This is a relatively formal description of the characteristics required of the information function 1. When a split gives a leaf that’s all one classification, no additional information should be needed at that leaf That is, the function applied to the leaf should evaluate to 0 81

82 2. Assuming you’re working with binary attributes, when a split gives a result node that’s exactly half and half in classification, the information needed should be a maximum We don’t know the function yet and how it computes a value, but however it does so, the half and half case should generate a maximum value for the information function at that node 82

83 The Multi-Stage Property 3. The function should have what is known as the multi-state property An attribute may not be binary If it is not binary, you can accomplish the overall splitting by a series of binary splits 83

84 The multi-stage property says that not only should you be able to accomplish the splitting with a series of binary splits It should also be possible to compute the information function value of the overall split based on the information function values of the series of binary splits 84

85 Here is an example of the multi-stage property Let there be 9 cases overall, with 3 different classifications Let there be 2, 3, and 4 instances of each of the cases, respectively 85

86 This is the multi-stage requirement: info([2, 3, 4]) = info([2, 7]) + 7/9 * info([3, 4]) How to understand this: Consider the first term The info needed to split [2, 3, 4] includes the full cost of splitting 9 instances into two classifications, [2, 7] 86

87 You could rewrite the first term in this way: info([2, 7]) = 2/9 * info([2, 7]) + 7/9 * info([2, 7]) In other words, the cost is apportioned on an instance-by-instance basis A decision is computed for each instance The first term is “full cost” because it involves all 9 instances 87

88 It is the apportionment idea that leads to the second term in the expression overall: 7/9 * info([3, 4]) Splitting [3, 4] involves a particular cost per instance After the [2, 7] split is made, the [3, 4] cost is incurred in only 7 out of the 9 total cases in the original problem 88

89 Reiterating the Multi-Stage Property You can summarize this verbally as follows: Any split can be arrived at by a series of binary splits At each branch there is a per instance cost of computing/splitting The total cost at each branch is proportional to the number of cases at that branch 89

90 What is the Information Function? It is Entropy The information function for splitting trees is based on a concept from physics, known as entropy In physics (thermodynamics), entropy can be regarded as a measure of how “disorganized” a system is For our purposes, an unclassified data set is disorganized, while a classified data set is organized 90

91 The book’s information function, based on logarithms, will be presented No derivation from physics will be given Also, no proof that this function meets the requirements for an information function will be given 91

92 For what it’s worth, I tried to show that the function has the desired properties using calculus I was not successful Kamal Narang looked at the problem and speculated that it could be done as a multi-variable problem rather than a single variable problem… 92

93 I didn’t have the time or energy to pursue the mathematical analysis any further We will simply accept that the formula given is what is used to compute the information function 93

94 Intuitively, when looking at this, keep in mind the following: The information function should be 0 when there is no split to be done The information function should be maximum when the split is half and half The multi-stage property has to hold 94

95 Definition of the entropy() Function by Example Using the concrete example presented so far, this defines the information function based on a definition of an entropy function: info([2, 7]) = entropy(2/9, 7/9) = -2/9 log 2 (2/9) – 7/9 log 2 (7/9) 95

96 Note that since the logarithms of values <1 are negative, the negative coefficients lead to a positive value overall 96

97 General Definition of the entropy() Function Recall that p i can be used to represent proportional fractions of classification at nodes in general Then the info(), entropy() function for 2 classifications can be written: -p 1 log 2 (p 1 ) – p 2 log 2 (p 2 ) 97

98 For multiple classifications you get: -p 1 log 2 (p 1 ) – p 2 log 2 (p 2 ) - … - p n log 2 (p n ) 98

99 Information, Entropy, with the Multi-Stage Property Remember that in the entropy version of the information function the p i are fractions Let p, q, and r represent fractions where p + q + r = 1 Then this is the book’s presentation of the formula based on the multi-stage property: 99

100 Characteristics of the Information, Entropy Function Each of the logarithms in the expression is taken on a positive fraction less than one The logarithms of these fractions are negative The minus signs on the terms of the expression reverse this The value of the information function is positive overall 100

101 Note also that each term consists of a fraction multiplied by a fraction, where the sum of the coefficient fractions is 1 This means that the expression overall gives a value no greater than 1 In other words, the information value is always <= 1 101

102 Logarithms base 2 are used If the properties for the information function hold for base 2 logarithms, they would hold for logarithms of any base In a binary world, it’s convenient to use 2 Incidentally, although we will use decimal numbers, the values of the information function can be referred to as “bits” of information 102

103 An Example Applying the Information Function to Tree Formation The book’s approach is to show you the formation of a tree by splitting and then explain where the information function came from My approach has been to tell you about the information function first Now I will work through the example, applying it to forming a tree 103

104 Start by considering Figure 4.2, shown on the following overhead The basic question is this: Which of the four attributes is best to branch on? If a decision leads to pure branches, that’s the best If the branches are not all pure, you use the information function to decide which branching is the best 104

105 105

106 Another way of posing the question is: Which branching option gives the greatest information gain? 1. Calculate the amount of information needed at the previous level 2. Calculate the information needed if you branch on each of the four attributes 3. Calculate the information gain by finding the difference 106

107 In the following presentation I am not going to show the arithmetic of finding the logs, multiplying by the fractions, and summing up the terms I will just present the numerical results given in the book 107

108 Basic Elements of the Example In this example there isn’t literally a previous level We are at the first split, deciding which of the 4 attributes to split on There are 14 instances The end result classification is either yes or no (binary) And in the training data set there are 9 yeses and 5 nos 108

109 The “Previous Level” in the Example The first measure, the so-called previous level, is simply a measure of the information needed overall to split 14 instances between 2 categories of 9 and 5 instances, respectively info([9, 5]) = entropy(9/14, 5/14) =.940 109

110 The “Next Level” Branching on the “outlook” Attribute in the Example Now consider branching on the first attribute, outlook It is a three-valued attribute, so it gives three branches You calculate the information needed for each branch You multiply each value by the proportion of instances for that branch You then add these values up 110

111 This sum represents the total information needed after branching You subtract The information still needed after branching From the information needed before branching To arrive at the information gained by branching 111

112 The Three “outlook” Branches Branch 1 gives: info([2, 3]) = entropy(2/5, 3/5) =.971 Branch 2 gives: info([4, 0]) = entropy(4/4, 0/4) = 0 Branch 3 gives: info([3, 2]) = entropy(3/5, 2/5) =.971 112

113 In total: info([2, 3], [4, 0], [3, 2]) = (5/14)(.971) + (4/14)(0) + (5/14)(.971) =.693 Information gain =.940 -.693 =.247 113

114 Branching on the Other Attributes If you do the same calculations for the other three attributes, you get this: Temperature: info gain =.029 Humidity: info gain =.152 Windy: info gain =.048 Branching on outlook gives the greatest information gain 114

115 Tree Formation Forming a tree in this way is a “greedy” algorithm You split on the attribute with the greatest information gain (outlook) You continue recursively with the remaining attributes/levels of the tree The desired outcome of this greedy approach is to have as small and simple a tree as possible 115

116 A Few More Things to Note 1. Intuitively you might suspect that outlook is the best choice because one of its branches is pure For the overcast outcome, there is no further branching to be done Intuition is nice, but you can’t say anything for sure until you’ve done the math 116

117 2. When you do this, the goal is to end up with leaves that are all pure Keep in mind that the instances in a training set may not be consistent It is possible to end up, after a series of splits, with both yes and no instances in the same leaf node It is simply the case that values for the attributes at hand don’t fully determine the classification outcome 117

118 Highly Branching Attributes Recall the following idea: It is possible to do data mining and “discover” a 1-1 mapping from an identifier to a corresponding class value This is correct information, but you have “overfitted” No future instance will have the same identifier, so this is useless for practical classification prediction 118

119 A similar problem can arise with trees From a node that represented an ID, you’d get a branch for each ID value, and one correctly classified instance in each child If such a key attribute existed in a data set, splitting based on information gain as described above would find it 119

120 This is because at each ID branch, the resulting leaf would be pure It would contain exactly one correctly classified instance No more information would be needed for any of the branches, so none would be needed for all of them collectively 120

121 Whatever the gain was, it would equal the total information still needed at the previous level The information gain would be 100% Recall that you start forming the tree by trying to find the best attribute to branch on The ID attribute will win every time and no further splitting will be needed 121

122 A Related Idea As noted, a tree based on an ID will have as many leaves as there are instances In general, this method for building trees will prefer attributes that have many branches, even if these aren’t ID attributes This goes against the grain of the goal of having small, simple trees 122

123 The preference for many branches can be informally explained The greater the number of branches, the fewer the number of instances per branch, on average The fewer the number of instances per branch, the greater the likelihood of a pure branch or nearly pure branch 123

124 Counteracting the Preference for Many Branches The book goes into further computational detail I’m ready to simplify The basic idea is that instead of calculating the desirability of a split based on information gain alone— You calculate the gain ratio of the different branches and choose on that basis 124

125 The gain ratio takes into account the number and size of the child nodes a split generates What happens, more or less, is that the information gain is divided by the number of branches a split generates 125

126 Once you go down this road, there are further complications to keep in mind First of all, this adjustment doesn’t protect you from a split on an ID attribute because it will still win Secondly, if you use the gain ratio this can lead to branching on a less desirable attribute 126

127 The book cites this rule of thumb: Provisionally pick the attribute with the highest gain ratio Find the average absolute information gain for branching on each attribute Check the absolute information gain of the provisional choice against the average Only take the ratio winner if it is greater than the average 127

128 Discussion Divide and conquer construction of decision trees is also known as top-down induction of decision trees The developers of this scheme have come up with methods for dealing with: –Numeric attributes –Missing values –Noisy data –The generation of rule sets from trees 128

129 For what it’s worth, the algorithms discussed above have names ID3 is the name for the basic algorithm C4.5 refers to a practical implementation of this algorithm, with improvements, for decision tree induction 129

130 4.4 Covering Algorithms: Constructing Rules Recall, in brief, how top-down tree construction worked: Attribute by attribute, pick the one that does the best job of separating instances into distinct classes Covering is a converse approach It works from the bottom up 130

131 The goal is to create a set of classification rules directly, without building a tree By definition, a training set with a classification attribute is explicitly grouped into classes The goal is to come up with (simple) rules that will correctly classify some new instance that comes along I.e., the goal is prediction 131

132 The process goes like this: Choose one of the classes present in the training data set Devise a rule based on one attribute that “covers” only instances of that class 132

133 In the ideal case, “cover” would mean: You wrote a single rule on a single attribute (a single condition) That rule identified all of the instances of one of the classes in a data set That rule identified only the instances of that class in the data set 133

134 In practice some rule may not identify or “capture” all of the instances of the class It may also not capture only instances of the class If so, the rule may be refined by adding conditions on other attributes (or other conditions on the attribute already used) 134

135 Measures of Goodness of Rules Recycling some of the terminology for trees, the idea can be expressed this way: When you pick a rule, you want its results to be “pure” “Pure” in this context is analogous to pure with trees The rule, when applied to instances, should ideally give only one classification as a result 135

136 The rule should ideally also cover all instances You make the rule stronger by adding conditions Each addition should improve purity/coverage 136

137 When building trees, picking the branching attribute was guided by an objective measure of information gain When building rule sets, you also need a measure of the goodness of the covering achieved by rules on different attributes 137

138 The Rule Building Process in More Detail Just to cut to the chase, evaluating the goodness of a rule just comes down to counting It doesn’t involve anything fancy like entropy You just compare rules based on how well they cover a classification 138

139 Suppose you pick some classification value, say Y The covering process generates rules of this kind: If(the attribute of interest takes on value X) Then the classification of the instance is Y 139

140 Notation In order to discuss this with some precision, let this notation be given: A given data set will have m attributes Let an individual attribute be identified by the subscript i Thus, one of the attributes would be identified in this way: attribute i 140

141 A given attribute will have n different values Let an individual value be identified by the subscript j Thus, the specific, j th value for the i th attribute could be identified in this way: X i,j 141

142 There could be many different classifications However, in describing what’s going on, we’re only concerned with one classification at a time There is no need to introduce subscripting on the classifications A classification will simply be known as Y 142

143 The kind of rule generated by covering could be made more specific and compact: If(the attribute of interest takes on value X) Then the classification of the instance is Y ≡ if(attribute i = X i,j ) then classification = Y 143

144 Then for some combination of attribute and value: attribute i = X i,j Let t represent the total number of instances in the training set where this condition holds true Let p represent the number of those instances that are classified as Y 144

145 The ratio p/t is like a success rate for predicting Y with each attribute and value pair Notice how this is pretty much like a confidence ratio in association rule mining Now you want to pick the best rule to add to your rule set 145

146 Find p/t for all of the attributes and values in the data set Pick the highest p/t (success) ratio Then the condition for the attribute and value with the highest p/t ratio becomes part of the covering rule: if(attribute i = X i,j ) then classification = Y 146

147 Costs of Doing This Note that it is not complicated to identify all of the cases, but potentially there are many of them to consider There are m attributes with a varying number of different values, represented as n In general, there are m X n cases For each case you have to count t and p 147

148 Reiteration of the Sources of Error Given the rule just devised: There may also be instances in the data set where attribute i = X i,j However, these may be instances that are not classified as Y 148

149 This means you’ve got a rule with “impure” results You want a high p/t ratio of success You want a low absolute t – p count of failures In reality, you would prefer to add only rules with a 100% confidence rate to your rule set 149

150 This is the second kind of problem with any interim rule: There may be instances in the data set where attribute i <> X i,j However, these may be instances that are classified as Y Here, the problem is less severe This just means that the rule’s coverage is not complete 150

151 Refining the Rule You refine the rule by repeating the process outlined above You just added a rule with this condition: attribute i = X i,j To that you want to add a new condition that does not involve attribute i 151

152 When counting successes for the new condition, you do not count instances in the data set where attribute i = X i,j holds Find p and t for the remaining attributes and their values Pick the attribute-value condition with the highest p/t ratio Add this condition to the existing rule using AND 152

153 Continue until you’re satisfied Or until you’ve run through all of the attributes Or until you’ve run out of instances to classify 153

154 Remember that what has been described thus far is the process of covering 1 classification You repeat the process for all classifications Or you do n – 1 of the classifications and leave the n th one as the default case 154

155 Suppose you do this exhaustively, completely and explicitly covering every class The rules will tend to have many conditions If you only added individual rules with p/t = 100% they will also be “perfect” In other words, for the training set there will be no ambiguity whatsoever 155

156 Compare this with trees Successive splitting on attributes from the top down doesn’t guarantee pure (perfect) leaves Working from the bottom up you can always devise sufficiently complex rule sets to cover all of the existing classes 156

157 Rules versus Decision Lists The rules derived from the process given above can be applied in any order For any one class, it’s true that the rule is composed of multiple conditions which successively classify more tightly However, the end result of the process is a single rule with conditions in conjunction 157

158 In theory, you could apply these parts of a rule in succession That would be the moral equivalent of testing the conditions in order, from left to right However, since the conditions are in conjunction, you can test them in any order with the same result 158

159 In the same vein, it doesn’t matter which order you handle the separate classes in If an instance doesn’t fall into one class, move on and try the next 159

160 The derived rules are “perfect” at most for all of the cases in the training set (only) It’s possible to get an instance in the future where >1 rule applies or no rule applies As usual, the solutions to these problems are of the form, “Assign it to the most frequently occurring…” 160

161 The book summarizes the approach given above as a separate-and-conquer algorithm This is pretty much analogous to a divide and conquer algorithm Working class by class is clearly divide and conquer 161

162 Within a given class the process is also progressive, step-by-step You bite off a chunk with one rule, then bite off the next chunk with another rule, until you’ve eaten everything in the class Notice that like with trees, the motivation is “greedy” You always take the best p/t ratio first, and then refine from there 162

163 For what it’s worth, this algorithm for finding a covering rule set is known as PRISM 163

164 The End 164


Download ppt "1 Data Mining Chapter 4, Part 1 Algorithms: The Basic Methods Kirk Scott."

Similar presentations


Ads by Google