1 Data Mining Chapter 4, Part 1 Algorithms: The Basic Methods Kirk Scott.

1 Data Mining Chapter 4, Part 1 Algorithms: The Basic Methods Kirk Scott

Dendrogramma, a New Phylum 2

Did you have any idea that symbols like these could be inserted into PowerPoint presentations? Ѿ ۞ ۩ ҂ 3

Basic Methods A good rule of thumb is try the simple things first Quite frequently the simple things will be good enough or will provide useful insights for further explorations One of the meta-tasks of data mining is figuring out which algorithm is the right one for a given data set 4

Certain data sets have a certain structure Certain algorithms are designed to elicit particular kinds of structures The right algorithm applied to the right set will give straightforward results A mismatch between algorithm and data set will give complicated, cloudy results 5

This is just another thing where you have to accept the fact that it’s magic, or a guessing game Or you could call it exploratory research Based on experience, you might have some idea of which data mining algorithm might be right for a given data set Otherwise, you just start trying them and seeing what kind of results you get 6

Chapter 4 is divided into 8 basic algorithm descriptions plus a 9 th topic The first four algorithms are covered by this set of overheads They are listed on the following overhead 7

4.1 Inferring Rudimentary Rules 4.2 Statistical Modeling 4.3 Divide and Conquer: Constructing Decision Trees 4.4 Covering Algorithms: Constructing Rules 8

4.1 Inferring Rudimentary Rules The 1R (1-rule) approach Given a training set with correct classifications, make a one-level decision tree based on one attribute Each value of the attribute generates a branch Each branch contains the collection of instances that have that attribute value 9

For each collection of instances belonging to an attribute value, count up the number of occurrences of each classification Let the predicted classification for that branch be the classification that has the greatest number of occurrences in the training set 10

Do this for each attribute Out of all of the trees generated, pick the one that has the lowest error rate The error rate is the count of the total number of misclassified instances across the training set for the rule set 11

This simple approach frequently works well This suggests that for a lot of data sets, one dominant attribute is a strong determinant 12

Missing Values Missing values are easily handled by 1R Missing is just one of the branches in the decision tree 1R is fundamentally nominal 13

Numeric Attributes If 1R is fundamentally nominal, how do you decide how to branch on a numeric attribute? One approach to branching on numerics: Sort the numeric instances Create a break point everywhere in the sequence that the classification changes This partitions the domain 14

Overfitting The problem is that you may end up with lots of break points/partitions If there are lots of break points, this is counterproductive Rather than grouping things into categories, you’re fragmenting them This is a sign that you’re overfitting to the existing, individual instances in the data set 15

In the extreme case, there are as many determinant values as there are classifications This is not good You’ve essentially determined a 1-1 coding from the attribute in question to the classification 16

If this happens, instances in the future that do not have these values for the attributes can’t be classified by the system They will not fall into any known partition This is an extreme case It is the classic case of overfitting 17

The less extreme case is when a model tends to have poor prediction performance due to overfitting to the existing data In other words, the model will make predictions, but the error rate is higher than a model that is less tightly fitted to the training data 18

Dealing with Overfitting in 1R This is a blanket rule of thumb to deal with the foregoing problem: When picking the numeric ranges that define the branches of the tree, specify that a minimum number, n, of instances from the training set have to be in each partition 19

The ranges are taken from the sorted values in order Wherever you’re at in the scheme, the problem that arises now is that the next n instances may include more than one classification The solution is to take the majority classification of the n as the rule for that branch 20

The latest rule of thumb, given above, may result in neighboring partitions with the same classification The solution to that is to merge those partitions This potentially will reduce the number of partitions significantly 21

Notice how rough and ready this is It’s a series of rules of thumb to fix problems cause by the previous rule of thumb You are essentially guaranteed some misclassifications However, on the whole, you hope that these heuristics result in a significant proportion of correct classifications 22

Discussion 1R is fast and easy 1R quite often performs only slightly less well than advanced techniques It makes sense to start simple in order to get a handle on a data set Go to something more complicated if desired 23

At the end of this section, the text describes a more complicated kind of 1R The details are unimportant Their concluding point is that, with experience, an initial examination of the data with simple techniques may give you insight into which more advanced technique might be suitable for it 24

4.2 Statistical Modeling 25

This is basically a discussion of an application of Bayes’ Theorem Bayes’ Theorem makes a statement about what is known as conditional probability I will cover the same ideas as the book, but I will do it in a slightly different way Whichever explanation makes the most sense to you is the “right” one 26

The book refers to this approach as Naïve Bayes It is based on the simplifying assumption that the attributes in a data set are independent Independence isn’t typical Otherwise there would be no associations to mine Even so, the technique gives good results 27

The Weather Example Table 4.2, shown on the following overhead, summarizes the outcome of play yes/no for each weather attribute for all instances in the training set 28

Note this in particular on first viewing: 9 total yeses, 5 total nos The middle part of the table shows the raw counts of attribute value occurrences in the data set The bottom part of the table is the instructive one 30

Take the Outlook attribute for example: –Sunny-yes 2/9no 3/5 –Overcast-yes 4/9no 0/5 –Rainy-yes 3/9no 2/5 Given the outcome, yes/no (the denominator), these fractions tell you the likelihood that there was a given outlook (the numerator) 31

Bayes’ Theorem Bayes’ Theorem involves a hypothesis, H, and some evidence, E, relevant to the hypothesis The theorem gives a formula for finding the probability that H is true under the condition that you know that E is true 32

The theorem is based on knowing some some probabilistic quantities related to the problem It is a statement of conditional probability, which makes use of conditional probabilities This is the notation: P(A|B) = the probability of a given that B is true 33

This is a statement of Bayes’ theorem: P(E|H)P(H) P(H|E) = --------------- P(E) 34

Illustrating the Application of the Theorem with the Weather Example The book does its example with all of the attributes at once I will do this with one attribute and then generalize I will use the Outlook attribute Let H = (play = yes) Let E = (outlook = sunny) 35

Then P(H|E) = P(play = yes | outlook = sunny) By Bayes’ Theorem this equals P(outlook = sunny | play = yes)P(play = yes) ---------------------------------------------------------- P(outlook = sunny) 36

Typing these things into PowerPoint is making me cry The following overheads show in words how you can intuitively understand what’s going on with the weather example They are based on the idea that you can express the probabilities in the expression in terms of fractions involving counts 37

After you’ve simplified like this, it’s apparent that you could do the calculation by just pulling 2 values out of the table However, the full formula where you can have multiple different E (weather attributes, for example) is based on using Bayes’ formula with all of the intermediate expressions 40

Before considering the case with multiple E, here is the simple case using the full Bayes’ formula Here are the fractions for the parts: P(E|H) = P(outlook = sunny | play = yes) = 2/9 P(H) = P(play = yes) = 9/14 P(E) = P(outlook = sunny) = 5/9 41

Then P(H|E) P(E|H)P(H) = --------------- P(E) 2/9 * 9/14 2/14 2 = ------------- = ------ = ---- =.4 5/14 5/14 5 42

In other words, there were 2 sunny days when play = yes out of 5 sunny days total Using the same approach, you can find P(H|E) where H = (play = no) The arithmetic gives this result:.6 43

Using Bayes’ Theorem to Classify with More Than One Attribute The preceding example illustrated Bayes’ Theorem applied to one attribute A full Bayesian expression (conditional probability) will be derived including multiple bits of evidence, E i, one for each of the i attributes The totality of evidence, including all attributes would be noted as E 44

Prediction can be done for a new instance with values for the i attributes The fractions from the weather table corresponding to the instance’s attribute values can be plugged into the Bayesian expression The result would be a probability, or prediction that play = yes or play = no for that set of attribute values 45

Statistical Independence Recall that Naïve Bayes assumed statistical independence of the attributes Stated simply, one of the results of statistics is that the probability of two events’ occurring is the product of their individual probabilities This is reflected when forming the expression for the general case 46

A Full Bayesian Example with Four Attributes The weather data had four attributes: outlook, temperature, humidity, windy Let E be the composite of E 1, E 2, E 3, and E 4 In other words, an instance has values for each of these attributes, and fractional values for each of the attributes can be read from the table based on the training set 47

Bayes’ Theorem extended to this case looks like this: P(E 1 |H)P(E 2 |H)P(E 3 |H)P(E 4 |H)P(H) P(H|E) = ---------------------------------------------- P(E) 48

Suppose you get a new data item and you’d like to make a prediction based on its attribute values Let them be: Outlook = sunny Temperature = cool Humidity = high Windy = true Let the hypothesis be that Play = yes 49

The product of the quantities given on the previous overhead is the numerator of the Bayesian expression This product =.0053 Doing the same calculation to find the numerator of the expression for play = no gives the value.0206 51

In general, you would also be interested in the denominator of the expressions, P[E] However, in the case where there are only two alternative predictions, you don’t have to do a separate calculation You can arrive at the needed value by other means 52

The universe of possible values is just yes or no, so you can form the denominator by just adding the two numerator values The denominator is.0053 +.0206 So the P[yes|E] = (.0053) / (.0053 +.0206) = 20.5% And the P[no|E] = (.0206) / (.0053 +.0206) = 79.5% 53

In effect, instead of computing the complete expressions, we’ve normalized them However we arrive at the numeric figures, it is their relative values that makes the prediction 54

Given the set of attribute values, it is approximately 4 times more likely that play will have the value no than that play will have the value yes Therefore, the prediction, or classification of the instance will be no (with approximately 80% confidence) 55

A Small Problem with the Bayes’ Theorem Formula If one of the probabilities in the numerator is 0, the whole numerator goes to 0 This would happen when the training set did not contain any instances with a particular value for an attribute i, but a new instance did You can’t compare yes/no probabilities if they have gone to 0 56

One solution approach is to add constants to the top and bottom of fractions in the expression This can be accomplished without changing the relative yes/no outcome I don’t propose to go into this in detail (now) To me it seems more appropriate for an advanced discussion later, if needed 57

Missing Values Missing values for one or more attributes in an instance to be classified are not a problem If an attribute value is missing, a fraction for its conditional probability is simply not included in the Bayesian formula In other words, you’re only doing prediction based on the attributes that do exist in the instance 58

Numeric Attributes The discussion so far has been based on the weather example As initially given, all of the attributes are categorical It is also possible to handle numeric attributes This involves a bit more work, but it’s straightforward 59

For the purposes of illustration, assume that the distribution of numeric attributes is normal In the summary of the training set, instead of forming fractions like for occurrences of nominal values, find the mean and standard deviation for numeric ones 60

The important thing to remember is that in the earlier example, you did summaries for both the yes and no cases For numeric data, you need to find the mean and standard deviation for both the yes and no cases These parameters, µ and σ, will appear in the calculation of the parts of the Bayesian expression 61

This is the normal probability density function (p.d.f.): If x is distributed according to this p.d.f., then f(x) is the probability that the value x will occur 62

Let µ and σ be the mean and standard deviation of some attribute for those cases in the training set where the value of play = yes Then put the value of x for that attribute into the equation 63

f(x) is the probability of x, given that the value of play = yes In other words, this is P(E|H), the kind of thing in the numerator of the formula for Bayes’ theorem 64

Now you can plug this into the Bayesian expression just like the fractions for the nominal attributes in the earlier example This procedure isn’t proven correct, but based on background knowledge in statistics it seems to make sense We’ll just accept it as given in the book and apply it as needed 65

Naïve Bayes for Document Classification The details for this appear in a box in the text, which means it’s advanced and not to be covered in detail in this course The basic idea is that documents can be classified by which words appear in them The occurrence of a word can be modeled as a Boolean yes/no 66

The classification can be improved if the frequency of words is also taken into account This is the barest introduction to the topic It may come up again later in the book 67

Discussion Like 1R, Naïve Bayes can often produce good results The rule of thumb remains, start simple It is true that dependency among attributes is a theoretical problem with Bayesian analysis and can lead to results which aren’t accurate 68

The presence of dependent attributes means multiple factors for the same feature in the Bayesian expression Potentially too much weight will be put on a feature with multiple factors One solution is to try and select only a subset of independent attributes to work with as part of preprocessing 69

For numeric attributes, if they’re not normally distributed, the normal p.d.f. shouldn’t be used If the attributes do fall into a known distribution, you can use its p.d.f. In the absence of any knowledge, the uniform distribution might be a starting point for an analysis, with statistical analysis revealing the actual distribution 70

4.3 Divide-and-Conquer: Constructing Decision Trees Note that like everything else in this course, this is a purely pragmatic presentation Ideas will be given Nothing will be proven The book gives things in a certain order I will try to cover pretty much the same things I will do it in a different order 71

When Forming a Tree… 1. The fundamental question at each level of the tree is always which attribute to split on In other words, given attributes x 1, x 2, x 3 …, do you branch first on x 1 or x 2 or x 3 …? Having chosen the first to branch on, which of the remaining ones do you branch on next, and so on? 72

2. Suppose you can come up with a function, the information (info) function This function is a measure of how much information is needed in order to make a decision at each node in a tree 3. You split on the attribute that gives the greatest information gain from level to level 73

4. A split is good if it means that little information will be needed at the next level down You measure the gain by subtracting the amount of information needed at the next level down from the amount needed at the current level 74

Defining an Information Function It is necessary to describe the information function more fully, first informally, then formally This is the guiding principle: The best split results in branches where the instances in each of the branches are all of the same classification The branches are leaves and you’ve arrived at a complete classification 75

No more splitting is needed No more information is needed Expressed somewhat formally: If the additional information needed is 0, then the information gain from the previous level(s) is 100% I.e., whatever information remained to be gained has been gained 76

Developing Some Notation Taking part of the book’s example as a starting point: Let some node have a total of 9 cases Suppose that eventually 2 classify as yes and 7 classify as no This notation represents the information function: info([2, 7]) 77

info([2, 7]) The idea is this: In practice, in advance we wouldn’t know that 9 would split into 2 and 7 This is a symbolic way of indicating the information that would be needed to classify the instances into 2 and 7 At this point we haven’t assigned a value to the expression info([2, 7]) 78

The next step in laying the groundwork of the function is simple arithmetic Let p 1 = 2/9 and p 2 = 7/9 These fractions represent the proportion of each case out of the total They will appear in calculations of information gain 79

Properties Required of the Information Function Remember the general description of the information function It is a measure of how much information is needed in order to make a decision at each node in a tree 80

This is a relatively formal description of the characteristics required of the information function 1. When a split gives a leaf that’s all one classification, no additional information should be needed at that leaf That is, the function applied to the leaf should evaluate to 0 81

2. Assuming you’re working with binary attributes, when a split gives a result node that’s exactly half and half in classification, the information needed should be a maximum We don’t know the function yet and how it computes a value, but however it does so, the half and half case should generate a maximum value for the information function at that node 82

The Multi-Stage Property 3. The function should have what is known as the multi-state property An attribute may not be binary If it is not binary, you can accomplish the overall splitting by a series of binary splits 83

The multi-stage property says that not only should you be able to accomplish the splitting with a series of binary splits It should also be possible to compute the information function value of the overall split based on the information function values of the series of binary splits 84

Here is an example of the multi-stage property Let there be 9 cases overall, with 3 different classifications Let there be 2, 3, and 4 instances of each of the cases, respectively 85

This is the multi-stage requirement: info([2, 3, 4]) = info([2, 7]) + 7/9 * info([3, 4]) How to understand this: Consider the first term The info needed to split [2, 3, 4] includes the full cost of splitting 9 instances into two classifications, [2, 7] 86

You could rewrite the first term in this way: info([2, 7]) = 2/9 * info([2, 7]) + 7/9 * info([2, 7]) In other words, the cost is apportioned on an instance-by-instance basis A decision is computed for each instance The first term is “full cost” because it involves all 9 instances 87

It is the apportionment idea that leads to the second term in the expression overall: 7/9 * info([3, 4]) Splitting [3, 4] involves a particular cost per instance After the [2, 7] split is made, the [3, 4] cost is incurred in only 7 out of the 9 total cases in the original problem 88

Reiterating the Multi-Stage Property You can summarize this verbally as follows: Any split can be arrived at by a series of binary splits At each branch there is a per instance cost of computing/splitting The total cost at each branch is proportional to the number of cases at that branch 89

What is the Information Function? It is Entropy The information function for splitting trees is based on a concept from physics, known as entropy In physics (thermodynamics), entropy can be regarded as a measure of how “disorganized” a system is For our purposes, an unclassified data set is disorganized, while a classified data set is organized 90

The book’s information function, based on logarithms, will be presented No derivation from physics will be given Also, no proof that this function meets the requirements for an information function will be given 91

For what it’s worth, I tried to show that the function has the desired properties using calculus I was not successful Kamal Narang looked at the problem and speculated that it could be done as a multi-variable problem rather than a single variable problem… 92

I didn’t have the time or energy to pursue the mathematical analysis any further We will simply accept that the formula given is what is used to compute the information function 93

Intuitively, when looking at this, keep in mind the following: The information function should be 0 when there is no split to be done The information function should be maximum when the split is half and half The multi-stage property has to hold 94

Definition of the entropy() Function by Example Using the concrete example presented so far, this defines the information function based on a definition of an entropy function: info([2, 7]) = entropy(2/9, 7/9) = -2/9 log 2 (2/9) – 7/9 log 2 (7/9) 95

Note that since the logarithms of values <1 are negative, the negative coefficients lead to a positive value overall 96

General Definition of the entropy() Function Recall that p i can be used to represent proportional fractions of classification at nodes in general Then the info(), entropy() function for 2 classifications can be written: -p 1 log 2 (p 1 ) – p 2 log 2 (p 2 ) 97

For multiple classifications you get: -p 1 log 2 (p 1 ) – p 2 log 2 (p 2 ) - … - p n log 2 (p n ) 98

Information, Entropy, with the Multi-Stage Property Remember that in the entropy version of the information function the p i are fractions Let p, q, and r represent fractions where p + q + r = 1 Then this is the book’s presentation of the formula based on the multi-stage property: 99

Characteristics of the Information, Entropy Function Each of the logarithms in the expression is taken on a positive fraction less than one The logarithms of these fractions are negative The minus signs on the terms of the expression reverse this The value of the information function is positive overall 100

Note also that each term consists of a fraction multiplied by a fraction, where the sum of the coefficient fractions is 1 This means that the expression overall gives a value no greater than 1 In other words, the information value is always <= 1 101

Logarithms base 2 are used If the properties for the information function hold for base 2 logarithms, they would hold for logarithms of any base In a binary world, it’s convenient to use 2 Incidentally, although we will use decimal numbers, the values of the information function can be referred to as “bits” of information 102

An Example Applying the Information Function to Tree Formation The book’s approach is to show you the formation of a tree by splitting and then explain where the information function came from My approach has been to tell you about the information function first Now I will work through the example, applying it to forming a tree 103

Start by considering Figure 4.2, shown on the following overhead The basic question is this: Which of the four attributes is best to branch on? If a decision leads to pure branches, that’s the best If the branches are not all pure, you use the information function to decide which branching is the best 104

Another way of posing the question is: Which branching option gives the greatest information gain? 1. Calculate the amount of information needed at the previous level 2. Calculate the information needed if you branch on each of the four attributes 3. Calculate the information gain by finding the difference 106

In the following presentation I am not going to show the arithmetic of finding the logs, multiplying by the fractions, and summing up the terms I will just present the numerical results given in the book 107

Basic Elements of the Example In this example there isn’t literally a previous level We are at the first split, deciding which of the 4 attributes to split on There are 14 instances The end result classification is either yes or no (binary) And in the training data set there are 9 yeses and 5 nos 108

The “Previous Level” in the Example The first measure, the so-called previous level, is simply a measure of the information needed overall to split 14 instances between 2 categories of 9 and 5 instances, respectively info([9, 5]) = entropy(9/14, 5/14) =.940 109

The “Next Level” Branching on the “outlook” Attribute in the Example Now consider branching on the first attribute, outlook It is a three-valued attribute, so it gives three branches You calculate the information needed for each branch You multiply each value by the proportion of instances for that branch You then add these values up 110

This sum represents the total information needed after branching You subtract The information still needed after branching From the information needed before branching To arrive at the information gained by branching 111

The Three “outlook” Branches Branch 1 gives: info([2, 3]) = entropy(2/5, 3/5) =.971 Branch 2 gives: info([4, 0]) = entropy(4/4, 0/4) = 0 Branch 3 gives: info([3, 2]) = entropy(3/5, 2/5) =.971 112

In total: info([2, 3], [4, 0], [3, 2]) = (5/14)(.971) + (4/14)(0) + (5/14)(.971) =.693 Information gain =.940 -.693 =.247 113

Branching on the Other Attributes If you do the same calculations for the other three attributes, you get this: Temperature: info gain =.029 Humidity: info gain =.152 Windy: info gain =.048 Branching on outlook gives the greatest information gain 114

Tree Formation Forming a tree in this way is a “greedy” algorithm You split on the attribute with the greatest information gain (outlook) You continue recursively with the remaining attributes/levels of the tree The desired outcome of this greedy approach is to have as small and simple a tree as possible 115

A Few More Things to Note 1. Intuitively you might suspect that outlook is the best choice because one of its branches is pure For the overcast outcome, there is no further branching to be done Intuition is nice, but you can’t say anything for sure until you’ve done the math 116

2. When you do this, the goal is to end up with leaves that are all pure Keep in mind that the instances in a training set may not be consistent It is possible to end up, after a series of splits, with both yes and no instances in the same leaf node It is simply the case that values for the attributes at hand don’t fully determine the classification outcome 117

Highly Branching Attributes Recall the following idea: It is possible to do data mining and “discover” a 1-1 mapping from an identifier to a corresponding class value This is correct information, but you have “overfitted” No future instance will have the same identifier, so this is useless for practical classification prediction 118

A similar problem can arise with trees From a node that represented an ID, you’d get a branch for each ID value, and one correctly classified instance in each child If such a key attribute existed in a data set, splitting based on information gain as described above would find it 119

This is because at each ID branch, the resulting leaf would be pure It would contain exactly one correctly classified instance No more information would be needed for any of the branches, so none would be needed for all of them collectively 120

Whatever the gain was, it would equal the total information still needed at the previous level The information gain would be 100% Recall that you start forming the tree by trying to find the best attribute to branch on The ID attribute will win every time and no further splitting will be needed 121

A Related Idea As noted, a tree based on an ID will have as many leaves as there are instances In general, this method for building trees will prefer attributes that have many branches, even if these aren’t ID attributes This goes against the grain of the goal of having small, simple trees 122

The preference for many branches can be informally explained The greater the number of branches, the fewer the number of instances per branch, on average The fewer the number of instances per branch, the greater the likelihood of a pure branch or nearly pure branch 123

Counteracting the Preference for Many Branches The book goes into further computational detail I’m ready to simplify The basic idea is that instead of calculating the desirability of a split based on information gain alone— You calculate the gain ratio of the different branches and choose on that basis 124

The gain ratio takes into account the number and size of the child nodes a split generates What happens, more or less, is that the information gain is divided by the number of branches a split generates 125

Once you go down this road, there are further complications to keep in mind First of all, this adjustment doesn’t protect you from a split on an ID attribute because it will still win Secondly, if you use the gain ratio this can lead to branching on a less desirable attribute 126

The book cites this rule of thumb: Provisionally pick the attribute with the highest gain ratio Find the average absolute information gain for branching on each attribute Check the absolute information gain of the provisional choice against the average Only take the ratio winner if it is greater than the average 127

Discussion Divide and conquer construction of decision trees is also known as top-down induction of decision trees The developers of this scheme have come up with methods for dealing with: –Numeric attributes –Missing values –Noisy data –The generation of rule sets from trees 128

For what it’s worth, the algorithms discussed above have names ID3 is the name for the basic algorithm C4.5 refers to a practical implementation of this algorithm, with improvements, for decision tree induction 129

4.4 Covering Algorithms: Constructing Rules Recall, in brief, how top-down tree construction worked: Attribute by attribute, pick the one that does the best job of separating instances into distinct classes Covering is a converse approach It works from the bottom up 130

The goal is to create a set of classification rules directly, without building a tree By definition, a training set with a classification attribute is explicitly grouped into classes The goal is to come up with (simple) rules that will correctly classify some new instance that comes along I.e., the goal is prediction 131

The process goes like this: Choose one of the classes present in the training data set Devise a rule based on one attribute that “covers” only instances of that class 132

In the ideal case, “cover” would mean: You wrote a single rule on a single attribute (a single condition) That rule identified all of the instances of one of the classes in a data set That rule identified only the instances of that class in the data set 133

In practice some rule may not identify or “capture” all of the instances of the class It may also not capture only instances of the class If so, the rule may be refined by adding conditions on other attributes (or other conditions on the attribute already used) 134

Measures of Goodness of Rules Recycling some of the terminology for trees, the idea can be expressed this way: When you pick a rule, you want its results to be “pure” “Pure” in this context is analogous to pure with trees The rule, when applied to instances, should ideally give only one classification as a result 135

The rule should ideally also cover all instances You make the rule stronger by adding conditions Each addition should improve purity/coverage 136

When building trees, picking the branching attribute was guided by an objective measure of information gain When building rule sets, you also need a measure of the goodness of the covering achieved by rules on different attributes 137

The Rule Building Process in More Detail Just to cut to the chase, evaluating the goodness of a rule just comes down to counting It doesn’t involve anything fancy like entropy You just compare rules based on how well they cover a classification 138

Suppose you pick some classification value, say Y The covering process generates rules of this kind: If(the attribute of interest takes on value X) Then the classification of the instance is Y 139

Notation In order to discuss this with some precision, let this notation be given: A given data set will have m attributes Let an individual attribute be identified by the subscript i Thus, one of the attributes would be identified in this way: attribute i 140

A given attribute will have n different values Let an individual value be identified by the subscript j Thus, the specific, j th value for the i th attribute could be identified in this way: X i,j 141

There could be many different classifications However, in describing what’s going on, we’re only concerned with one classification at a time There is no need to introduce subscripting on the classifications A classification will simply be known as Y 142

The kind of rule generated by covering could be made more specific and compact: If(the attribute of interest takes on value X) Then the classification of the instance is Y ≡ if(attribute i = X i,j ) then classification = Y 143

Then for some combination of attribute and value: attribute i = X i,j Let t represent the total number of instances in the training set where this condition holds true Let p represent the number of those instances that are classified as Y 144

The ratio p/t is like a success rate for predicting Y with each attribute and value pair Notice how this is pretty much like a confidence ratio in association rule mining Now you want to pick the best rule to add to your rule set 145

Find p/t for all of the attributes and values in the data set Pick the highest p/t (success) ratio Then the condition for the attribute and value with the highest p/t ratio becomes part of the covering rule: if(attribute i = X i,j ) then classification = Y 146

Costs of Doing This Note that it is not complicated to identify all of the cases, but potentially there are many of them to consider There are m attributes with a varying number of different values, represented as n In general, there are m X n cases For each case you have to count t and p 147

Reiteration of the Sources of Error Given the rule just devised: There may also be instances in the data set where attribute i = X i,j However, these may be instances that are not classified as Y 148

This means you’ve got a rule with “impure” results You want a high p/t ratio of success You want a low absolute t – p count of failures In reality, you would prefer to add only rules with a 100% confidence rate to your rule set 149

This is the second kind of problem with any interim rule: There may be instances in the data set where attribute i <> X i,j However, these may be instances that are classified as Y Here, the problem is less severe This just means that the rule’s coverage is not complete 150

Refining the Rule You refine the rule by repeating the process outlined above You just added a rule with this condition: attribute i = X i,j To that you want to add a new condition that does not involve attribute i 151

When counting successes for the new condition, you do not count instances in the data set where attribute i = X i,j holds Find p and t for the remaining attributes and their values Pick the attribute-value condition with the highest p/t ratio Add this condition to the existing rule using AND 152

Continue until you’re satisfied Or until you’ve run through all of the attributes Or until you’ve run out of instances to classify 153

Remember that what has been described thus far is the process of covering 1 classification You repeat the process for all classifications Or you do n – 1 of the classifications and leave the n th one as the default case 154

Suppose you do this exhaustively, completely and explicitly covering every class The rules will tend to have many conditions If you only added individual rules with p/t = 100% they will also be “perfect” In other words, for the training set there will be no ambiguity whatsoever 155

Compare this with trees Successive splitting on attributes from the top down doesn’t guarantee pure (perfect) leaves Working from the bottom up you can always devise sufficiently complex rule sets to cover all of the existing classes 156

Rules versus Decision Lists The rules derived from the process given above can be applied in any order For any one class, it’s true that the rule is composed of multiple conditions which successively classify more tightly However, the end result of the process is a single rule with conditions in conjunction 157

In theory, you could apply these parts of a rule in succession That would be the moral equivalent of testing the conditions in order, from left to right However, since the conditions are in conjunction, you can test them in any order with the same result 158

In the same vein, it doesn’t matter which order you handle the separate classes in If an instance doesn’t fall into one class, move on and try the next 159

The derived rules are “perfect” at most for all of the cases in the training set (only) It’s possible to get an instance in the future where >1 rule applies or no rule applies As usual, the solutions to these problems are of the form, “Assign it to the most frequently occurring…” 160

The book summarizes the approach given above as a separate-and-conquer algorithm This is pretty much analogous to a divide and conquer algorithm Working class by class is clearly divide and conquer 161

Within a given class the process is also progressive, step-by-step You bite off a chunk with one rule, then bite off the next chunk with another rule, until you’ve eaten everything in the class Notice that like with trees, the motivation is “greedy” You always take the best p/t ratio first, and then refine from there 162

For what it’s worth, this algorithm for finding a covering rule set is known as PRISM 163

The End 164

1 Data Mining Chapter 4, Part 1 Algorithms: The Basic Methods Kirk Scott.

Similar presentations

Presentation on theme: "1 Data Mining Chapter 4, Part 1 Algorithms: The Basic Methods Kirk Scott."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Data Mining Chapter 4, Part 1 Algorithms: The Basic Methods Kirk Scott.

Similar presentations

Presentation on theme: "1 Data Mining Chapter 4, Part 1 Algorithms: The Basic Methods Kirk Scott."— Presentation transcript:

Similar presentations

About project

Feedback