Data Mining Chapter 4, Part 1 Algorithms: The Basic Methods

Data Mining Chapter 4, Part 1 Algorithms: The Basic Methods
Kirk Scott

Dendrogramma, a New Phylum

Did you have any idea that symbols like these could be inserted into PowerPoint presentations?
Ѿ ҉ ҈ ۞ ۩ ҂

Basic Methods A good rule of thumb is try the simple things first
Quite frequently the simple things will be good enough or will provide useful insights for further explorations One of the meta-tasks of data mining is figuring out which algorithm is the right one for a given data set

Certain data sets have a certain structure
Certain algorithms are designed to elicit particular kinds of structures The right algorithm applied to the right set will give straightforward results A mismatch between algorithm and data set will give complicated, cloudy results

This is just another thing where you have to accept the fact that it’s magic, or a guessing game
Or you could call it exploratory research Based on experience, you might have some idea of which data mining algorithm might be right for a given data set Otherwise, you just start trying them and seeing what kind of results you get

Chapter 4 is divided into 8 basic algorithm descriptions plus a 9th topic
The first four algorithms are covered by this set of overheads They are listed on the following overhead

4.1 Inferring Rudimentary Rules
4.2 Statistical Modeling 4.3 Divide and Conquer: Constructing Decision Trees 4.4 Covering Algorithms: Constructing Rules

4.1 Inferring Rudimentary Rules
The 1R (1-rule) approach Given a training set with correct classifications, make a one-level decision tree based on one attribute Each value of the attribute generates a branch Each branch contains the collection of instances that have that attribute value

For each collection of instances belonging to an attribute value, count up the number of occurrences of each classification Let the predicted classification for that branch be the classification that has the greatest number of occurrences in the training set

Do this for each attribute
Out of all of the trees generated, pick the one that has the lowest error rate The error rate is the count of the total number of misclassified instances across the training set for the rule set

This simple approach frequently works well
This suggests that for a lot of data sets, one dominant attribute is a strong determinant

Missing Values Missing values are easily handled by 1R
Missing is just one of the branches in the decision tree 1R is fundamentally nominal

Numeric Attributes If 1R is fundamentally nominal, how do you decide how to branch on a numeric attribute? One approach to branching on numerics: Sort the numeric instances Create a break point everywhere in the sequence that the classification changes This partitions the domain

Overfitting The problem is that you may end up with lots of break points/partitions If there are lots of break points, this is counterproductive Rather than grouping things into categories, you’re fragmenting them This is a sign that you’re overfitting to the existing, individual instances in the data set

In the extreme case, there are as many determinant values as there are classifications
This is not good You’ve essentially determined a 1-1 coding from the attribute in question to the classification

If this happens, instances in the future that do not have these values for the attributes can’t be classified by the system They will not fall into any known partition This is an extreme case It is the classic case of overfitting

The less extreme case is when a model tends to have poor prediction performance due to overfitting to the existing data In other words, the model will make predictions, but the error rate is higher than a model that is less tightly fitted to the training data

Dealing with Overfitting in 1R
This is a blanket rule of thumb to deal with the foregoing problem: When picking the numeric ranges that define the branches of the tree, specify that a minimum number, n, of instances from the training set have to be in each partition

The ranges are taken from the sorted values in order
Wherever you’re at in the scheme, the problem that arises now is that the next n instances may include more than one classification The solution is to take the majority classification of the n as the rule for that branch

The latest rule of thumb, given above, may result in neighboring partitions with the same classification The solution to that is to merge those partitions This potentially will reduce the number of partitions significantly

Notice how rough and ready this is
It’s a series of rules of thumb to fix problems cause by the previous rule of thumb You are essentially guaranteed some misclassifications However, on the whole, you hope that these heuristics result in a significant proportion of correct classifications

Discussion 1R is fast and easy
1R quite often performs only slightly less well than advanced techniques It makes sense to start simple in order to get a handle on a data set Go to something more complicated if desired

At the end of this section, the text describes a more complicated kind of 1R
The details are unimportant Their concluding point is that, with experience, an initial examination of the data with simple techniques may give you insight into which more advanced technique might be suitable for it

4.2 Statistical Modeling

This is basically a discussion of an application of Bayes’ Theorem
Bayes’ Theorem makes a statement about what is known as conditional probability I will cover the same ideas as the book, but I will do it in a slightly different way Whichever explanation makes the most sense to you is the “right” one

The book refers to this approach as Naïve Bayes
It is based on the simplifying assumption that the attributes in a data set are independent Independence isn’t typical Otherwise there would be no associations to mine Even so, the technique gives good results

The Weather Example The next overhead shows the table of weather/game play data as a starting point for considering the application of Bayes’ Theorem

Table 4.2, shown on the following overhead, summarizes the outcome of play yes/no for each weather attribute for all instances in the training set

Note this in particular on first viewing:
9 total yeses, 5 total nos The middle part of the table shows the raw counts of attribute value occurrences in the data set The bottom part of the table is the instructive one, containing ratios of yes/no for given cases to total yes/no overall

Take the Outlook attribute for example:
Sunny-yes 2/9 no 3/5 Overcast-yes 4/9 no 0/5 Rainy-yes 3/9 no 2/5 Given the outcome, yes/no (the denominator), these fractions tell you the likelihood that there was a given outlook (the numerator)

Bayes’ Theorem Bayes’ Theorem involves a hypothesis, H, and some evidence, E, relevant to the hypothesis The theorem gives a formula for finding the probability that H is true under the condition that you know that E is true

The theorem is based on knowing some probabilistic quantities related to the problem
These known quantities are called conditions The thinking is based on the concept, “if a certain condition holds, what is the probability that…?”

Bayes’ Theorem is a statement about what is called conditional probability
This is the notation: P(H|E) = the probability of H given that E is true

These are the other probabilistic factors which appear in Bayes’s Theorem:
P(H), the probability of the hypothesis without any conditions P(E), the probability of the condition by itself P(E|H), the probability that the condition holds, given that the hypothesis does hold

This is a statement of Bayes’ theorem:
P(E|H)P(H) P(H|E) = P(E)

Note 1: No proof will be given
However, why this works may become intuitively apparent when working an example with real numbers Note 2: The theorem is given in terms of P(H|E) But notice that if you have any 3 of the factors, you can solve for the 4th one

Illustrating the Application of the Theorem with the Weather Example
The book does its example with all of the attributes at once I will do this with one attribute and then generalize I will use the Outlook attribute Let H = (play = yes) Let E = (outlook = sunny)

= P(play = yes | outlook = sunny) By Bayes’ Theorem this equals
Then P(H|E) = P(play = yes | outlook = sunny) By Bayes’ Theorem this equals P(outlook = sunny | play = yes)P(play = yes) P(outlook = sunny)

Typing these things into PowerPoint is making me cry
The following overheads show in words how you can intuitively understand what’s going on with the weather example They are based on the idea that you can express the probabilities in the expression in terms of fractions involving counts

This is the result for just one attribute
After you’ve simplified like this, it’s apparent that you could form the fraction by just pulling 2 values out of the table

However, Bayes’ Theorem can be extended to included all of the attributes
At that point, you need to apply the formula and it will be necessary to calculate intermediate expressions It is not as simple as just pulling two values out of the table

Based on the knowledge that it will become necessary to apply the formula, here is the previous example done with the formula These are the fractions for the parts: P(E|H) = P(outlook = sunny | play = yes) = 2/9 P(H) = P(play = yes) = 9/14 P(E) = P(outlook = sunny) = 5/9

Then P(H|E) P(E|H)P(H) = P(E) 2/9 * 9/ / = = = = .4 5/ /

In other words, there were 2 sunny days when play = yes out of 5 sunny days total
Using the same approach, you can find P(H|E) where H = (play = no) The arithmetic gives this result: .6

Of course, .4 + .6 = 1, the total probability
But note that before we’re finished we will be interested in situations where the probabilistic quantities we’re working with don’t sum up to 1

Using Bayes’ Theorem to Classify with More Than One Attribute
The preceding example illustrated Bayes’ Theorem applied to one attribute A full Bayesian expression (conditional probability) will be derived, including multiple pieces of evidence, Ei, one for each of the i attributes The totality of evidence, including all attributes would be shown as E without a subscript

Prediction can be done for a new instance with values for the i attributes
The fractions from the weather table corresponding to the instance’s attribute values can be plugged into the Bayesian expression

The result would be a probability, or prediction that play = yes or play = no, for that set of attribute values Step back and consider what this means We’re saying that based on conditional probabilities in a given (training) data set, it is possible to make predictions

We’re not talking about a statistical technique, like multivariate regression analysis
This is purely probabilistic reasoning based on counts of existing outcomes The application of Bayes’ Theorem is a form of data mining

Statistical Independence
Recall that Naïve Bayes assumed statistical independence of the attributes Stated simply, one of the results of statistics is that the probability of two events’ occurring is the product of their individual probabilities This is reflected when forming the expression for the general case

A Full Bayesian Example with Four Attributes
The weather data had four attributes: outlook, temperature, humidity, windy Let E be the composite of E1, E2, E3, and E4 In other words, an instance has values for each of these attributes, and fractions for the proportion of values for each of the attributes can be read from the table based on the training set

Bayes’ Theorem extended to this case looks like this:
P(E1|H)P(E2|H)P(E3|H)P(E4|H)P(H) P(H|E) = P(E)

Note how statistical independence comes into the numerator of the expression
What could be expressed overall as P(E|H) (where E is now composite) is expressed as follows: P(E1|H)P(E2|H)P(E3|H)P(E4|H) This is true if the Ei are independent

Suppose you get a new data item and you’d like to make a prediction based on its attribute values
Let them be: Outlook = sunny Temperature = cool Humidity = high Windy = true Let the hypothesis be that Play = yes

The product of the quantities given on the previous overhead is the numerator of the Bayesian expression This product = .0053 Doing the same calculation to find the numerator of the expression for play = no gives the value .0206

This is the case where you might have expected the two values to sum to 1
These are conditional probabilities and they only sum to .0259 However, together, th sum of these values represents the “universe” of possible outcomes, yes or no, for the given E

In other words, .0259 = P(E), the overall probability of the condition(s)
And P(E) is the value needed in the denominator when applying the formula for Bayes’ Theorem There is no need to calculate P(E) in some other way—it is this sum

When you complete the formula by dividing the yes/no numerator values by P(E) = .0259, you get:
P(yes|E) = (.0053) / (.0259) = 20.5% P(no|E) = (.0206) / (.0259) = 79.5% Notice that now the probabilities sum to 1

The products that formed the numerators of the Bayesian expressions for yes and no did not sum to 1
In effect, by dividing P(E), their sum, we’ve normalized them The end result is a set of relative values that makes it possible to make a probabilistic prediction of H given E

Given the set of attribute values we started with, it is approximately 4 times more likely that play will have the value no than that play will have the value yes Therefore, the prediction, or classification of the instance will be no, with a confidence of 79.5% Note that this “confidence” is not the same as the formal concept of confidence in other areas of statistics

A Small Problem with the Bayes’ Theorem Formula
If one of the probabilities in the numerator is 0, the whole numerator goes to 0 This would happen when the training set did not contain any instances with a particular value for an attribute i, but a new instance did You can’t compare yes/no probabilities if they have gone to 0

One solution approach is to add constants to the top and bottom of fractions in the expression
This can be accomplished without changing the relative yes/no outcome I don’t propose to go into this in detail now A similar idea will come up again in a later part of the text

Missing Values Missing values for one or more attributes in an instance to be classified are not a problem Do not confuse this with the previous problem We’re saying now that a given instance is missing an attribute value, not that the entire data set is missing an attribute value

If an instance attribute value is missing, a fraction for its conditional probability is simply not included in the Bayesian formula In other words, you’re only doing prediction based on the attributes that do exist in the instance

Numeric Attributes The discussion so far has been based on the weather example As initially given, all of the attributes are categorical It is also possible to handle numeric attributes with Bayes’ Theorem This involves a bit more work, but it’s straightforward

In the summary of the training set of nominal values, fractions representing frequencies of occurrence were formed For a data set containing numeric attributes, assume that the distributions of values for those attributes are normal Find the mean and standard deviation of the distributions of the numeric attributes

The important thing to remember is that in the earlier example, you did summaries for both the yes and no cases For numeric data, you need to find the mean and standard deviation for both the yes and no cases These parameters, µ and σ, will appear in the calculation of the parts of the Bayesian expression

This is the normal probability density function (p.d.f.):
If x is distributed according to this p.d.f., then f(x) is the probability that the value x will occur

Let µ and σ be the mean and standard deviation of some attribute for those cases in the training set where the value of play = yes Then put the value of x for that attribute into the equation

f(x) is the probability of x, given that the value of play = yes
In other words, this is P(E|H), the kind of thing in the numerator of the formula for Bayes’ theorem

Now you can plug this into the Bayesian expression just like the fractions for the nominal attributes in the earlier example This procedure isn’t proven correct, but based on background knowledge in statistics it seems to make sense We’ll just accept it as given in the book and apply it as needed

Naïve Bayes for Document Classification
The details for this appear in a box in the text, which means it’s advanced and not to be covered in detail in this course The basic idea is that documents can be classified by which words appear in them The occurrence of a word can be modeled as a Boolean yes/no

The classification can be improved if the frequency of words is also taken into account
This is the barest introduction to the topic It may come up again later in the book

Discussion Like 1R, Naïve Bayes can often produce good results
The rule of thumb remains, start simple It is true that dependency among attributes is a theoretical problem with Bayesian analysis and can lead to results which aren’t accurate

The presence of dependent attributes means multiple factors for the same feature in the Bayesian expression Potentially too much weight will be put on a feature with multiple factors One solution is to try and select only a subset of independent attributes to work with as part of preprocessing

For numeric attributes, if they’re not normally distributed, the normal p.d.f. shouldn’t be used
If the attributes do fall into a known distribution, you can use its p.d.f. In the absence of any knowledge, the uniform distribution might be a starting point for an analysis, with statistical analysis revealing the actual distribution

4.3 Divide-and-Conquer: Constructing Decision Trees
Note that like everything else in this course, this is a purely pragmatic presentation Ideas will be given Nothing will be proven The book gives things in a certain order I will try to cover pretty much the same things I will do it in a different order

When Forming a Tree… 1. The fundamental question at each level of the tree is always which attribute to split on In other words, given attributes x1, x2, x3…, do you branch first on x1 or x2 or x3…? Having chosen the first to branch on, which of the remaining ones do you branch on next, and so on?

2. Suppose you can come up with a function, the information (info) function
This function is a measure of how much information is needed in order to make a decision at each node in a tree

3. You split on the attribute that gives the greatest information gain from level to level
In other words, you split on the attribute that represents the greatest use of information at a given level

4. A split is good if it means that little information will be needed at the next level down
You measure the gain by subtracting the amount of information still needed at the next level down from the amount needed at the current level

Defining an Information Function
It is necessary to describe the information function more fully, first informally, then formally This is the guiding principle: The best possible split results in branches where the instances in each of the branches are all of the same classification In this case the branches are leaves and you’ve arrived at a complete classification

No more splitting is needed
No more information is needed Expressed somewhat formally: If the additional information needed is 0, then the information gain from the previous level(s) is 100% I.e., whatever information remained to be gained has been gained

Developing Some Notation
Taking part of the book’s example as a starting point: Let some node have a total of 9 cases Suppose that eventually 2 classify as yes and 7 classify as no

This is the notation for a 2, 7 split:
[2, 7] This is the notation for the information function for the split: info([2, 7])

The idea is this: In practice, in advance we wouldn’t know that 9 would split into [2, 7] info([2, 7]) is a symbolic way of indicating the information that would be needed to classify the instances into [2, 7] At this point we haven’t assigned a value to the expression info([2, 7])

The next step in laying the groundwork of the function is simple arithmetic
Let p1 = 2/9 and p2 = 7/9 These fractions represent the proportion of each case out of the total They will appear in calculations of information gain

Properties Required of the Information Function
Remember the general description of the information function It is a measure of how much information is needed in order to make a decision at each node in a tree

This is a relatively formal description of the characteristics required of the information function
1. When a split gives a branch/leaf that’s all one classification, no additional information should be needed at that leaf That is, the function applied to the leaf should evaluate to 0

2. Assuming you’re working with binary attributes, when a split gives a branch/result node that’s exactly half and half in classification, the information needed should be a maximum In a binary world, achieving half is like flipping a coin This is the “no knowledge” gained outcome

We don’t know the function yet and how it computes a value, but however it does so, the half and half case should generate a maximum value for the information function at that node

The Multi-Stage Property
3. The function should have what is known as the multi-state property An attribute may not be binary If it is not binary, you can accomplish the overall splitting by a series of binary splits

The multi-stage property says that not only should you be able to accomplish the splitting with a series of binary splits It should also be possible to compute the information function value of the overall split based on the information function values of the series of binary splits

Here is an example of the multi-stage property
Let there be 9 cases overall, with 3 different classifications Let there be 2, 3, and 4 instances of each of the cases, respectively

This is the multi-stage requirement:
info([2, 3, 4]) = info([2, 7]) + 7/9 * info([3, 4]) How to understand this: Consider the first term The info needed to split [2, 3, 4] includes the full cost of splitting 9 instances into two classifications, [2, 7]

You could rewrite the first term in this way:
info([2, 7]) = 2/9 * info([2, 7]) + 7/9 * info([2, 7]) In other words, the cost is apportioned on an instance-by-instance basis A decision is computed for each instance The first term is “full cost” because it involves all 9 instances

It is the apportionment idea that leads to the second term in the expression overall:
7/9 * info([3, 4]) Splitting [3, 4] involves a particular cost per instance After the [2, 7] split is made, the [3, 4] cost is incurred in only 7 out of the 9 total cases in the original problem

Reiterating the Multi-Stage Property
You can summarize this verbally as follows: Any split can be arrived at by a series of binary splits At each branch there is a per instance cost of computing/splitting The total cost at each branch is proportional to the number of cases at that branch

What is the Information Function? It is Entropy
The information function for splitting trees is based on a concept from physics, known as entropy In physics (thermodynamics), entropy can be regarded as a measure of how “disorganized” a system is For our purposes, an unclassified data set is disorganized, while a classified data set is organized

The book’s information function, based on logarithms, will be presented
No derivation from physics will be given Also, no proof that this function meets the requirements for an information function will be given

For what it’s worth, I tried to show that the function has the desired properties using calculus
I was not successful Kamal Narang looked at the problem and speculated that it could be done as a multi-variable problem rather than a single variable problem…

I didn’t have the time or energy to pursue the mathematical analysis any further
We will simply accept that the formula given is what is used to compute the information function

Intuitively, when looking at this, keep in mind the following:
The information function should be 0 when there is no split to be done The information function should be maximum when the split is half and half The multi-stage property has to hold

Definition of the entropy() Function by Example
Using the concrete example presented so far, this defines the information function based on a definition of an entropy function: info([2, 7]) = entropy(2/9, 7/9) = -2/9 log2(2/9) – 7/9 log2(7/9)

Note that since the logarithms of values <1 are negative, the negative coefficients lead to a positive value overall

General Definition of the entropy() Function
Recall that pi can be used to represent proportional fractions of classification at nodes in general Then the info(), entropy() function for 2 classifications can be written: -p1 log2(p1) – p2 log2(p2)

For multiple classifications you get:
-p1 log2(p1) – p2 log2(p2) - … - pn log2(pn)

Information, Entropy, with the Multi-Stage Property
Remember that in the entropy version of the information function the pi are fractions Let p, q, and r represent fractions where p + q + r = 1 Then this is the book’s presentation of the formula based on the multi-stage property:

Characteristics of the Information, Entropy Function
Refer back to the general expression for the case with multiple classifications: -p1 log2(p1) – p2 log2(p2) - … - pn log2(pn)

Each of the logarithms in the expression is taken on a positive fraction less than one
The logarithms of these fractions are negative The minus signs on the terms of the expression reverse this The value of the information function is positive overall

Note also that each term consists of a fraction multiplied by a fraction, where the sum of the coefficient fractions is 1 This means that the expression overall gives a value no greater than 1 In other words, the information value is always <= 1

Logarithms base 2 are used
If the properties for the information function hold for base 2 logarithms, they would hold for logarithms of any base In a binary world, it’s convenient to use 2 Incidentally, although we will use decimal numbers, the values of the information function can be referred to as “bits” of information

Information Gain Level-by-level, when forming a tree, you choose which attribute to branch on based on the greatest information gain Information gain is found by subtracting the information still needed at the next level from the information needed at the previous level

Remember that the information function is entropy
It is a measure of how disorganized a data set is In our case, it is a measure of how far from purity the branches are The more mixed the branches, the more information needed to arrive at purity

Information gain is a measure of this:
How much closer to purity have you come in the branches by splitting on a given attribute Again, you measure the gain by subtracting the information you need “now” (after branching) from what you needed “before” (before branching on the given attribute)

An Example Applying the Information Function to Tree Formation
The book’s approach is to show you the formation of a tree by splitting and then explain where the information function came from My approach has been to tell you about the information function first Now I will work through the example, applying it to forming a tree

Consider a case where there are 4 attributes
The basic question is this: Which attribute do you split on first? From then on, which do you split on next? The algorithm for deciding is greedy Always do the biggest gain split first/next

The book illustrates with the weather example
A decision that leads to pure branches is the best The four possible choices of splitting are shown in Figure 4.2, on the following overhead None of the choices gives pure results

You have to pick the best option based on which gives the greatest information gain
1. Calculate the amount of information needed at the previous level 2. Calculate the information still needed if you branch on each of the four attributes 3. Calculate the information gain by finding the difference, previous - next

In the following presentation I am not going to show the arithmetic of finding the logarithms, multiplying by the fractions, and summing up the terms I will just present the numerical results given in the book

Basic Elements of the Example
In this example there isn’t literally a previous level We are at the first split, deciding which of the 4 attributes to split on There are 14 instances The end result classification is either yes or no (binary) And in the training data set there are 9 yeses and 5 nos

The “Previous Level” in the Example
The first measure, the so-called previous level, is simply a measure of the information needed overall to split 14 instances between 2 categories of 9 and 5 pure instances, respectively info([9, 5]) = entropy(9/14, 5/14) = .940

The “Next Level” Branching on the “outlook” Attribute in the Example
Now consider branching on the first attribute, outlook It is a three-valued attribute, so it gives three branches

You calculate the information needed for each branch
You multiply each value by the proportion of instances for that branch You then add these values up This sum represents the total information needed after branching

You subtract: The information still needed after branching - the information needed before branching This gives the information gained by branching

The Three “outlook” Branches
Branch 1 gives: info([2, 3]) = entropy(2/5, 3/5) = .971 Branch 2 gives: info([4, 0]) = entropy(4/4, 0/4) = 0 Branch 3 gives: info([3, 2]) = entropy(3/5, 2/5) = .971

In total: info([2, 3], [4, 0], [3, 2]) = (5/14)(.971) + (4/14)(0) + (5/14)(.971) = .693 Information gain = = .247

Branching on the Other Attributes
If you do the same calculations for the other three attributes, you get this: Temperature: info gain = .029 Humidity: info gain = .152 Windy: info gain = .048 Branching on outlook gives the greatest information gain

Tree Formation As noted, forming a tree in this way is a “greedy” algorithm You check all four attributes You split on the attribute with the greatest information gain (outlook)

Now you check all three remaining attributes
You continue recursively with the remaining attributes/levels of the tree The computations are clear, and the number of computations is also clear The outcome of this algorithm is a tree that potentially branches on all attributes, and (hopefully) achieves purity of leaf nodes

A Few More Things to Note
1. Looking back at Figure 4.2, you might intuitively suspect that outlook is the best choice because one of its branches is pure For the overcast outcome, there is no further branching to be done Intuition is nice, but you can’t say anything for sure until you’ve done the math

2. When you do this, the goal is to end up with leaves that are all pure
Keep in mind that the instances in a training set may not be consistent It is possible to end up, after a series of splits, with both yes and no instances in the same leaf node It is simply the case that values for the attributes at hand don’t fully determine the classification outcome

Also keep in mind that with a particular problem domain or data set, you may not have to branch on all attributes in order to achieve purity If you choose to branch on a particular attribute first, it’s possible that you’ll get a pure branch and not have to continue checking the other attributes

Also keep in mind that a greedy algorithm is not necessarily optimal
It is always possible that if you had branched on a less desirable attribute first, you would have gotten better results later The thinking is that this is unlikely, but it’s still possible

Highly Branching Attributes
Consider the following idea: It is possible to do data mining and “discover” a 1-1 mapping from a unique identifier to a corresponding class value This is correct information, but you have “overfitted” No future instance will have the same identifier, so this is useless for practical classification prediction

A similar problem can arise with trees
From a node that represented an ID, you’d get a branch for each ID value, and one correctly classified instance in each child If such a key attribute existed in a data set, splitting based on information gain as described above would find it

This is because at each ID branch, the resulting leaf would be pure
It would contain exactly one correctly classified instance The algorithm would tell you not to branch anymore, on any other attribute You already got a complete classification based on the one attribute

Whatever the information gain was, it would equal the total information still needed at the previous level The information gain would be 100% Recall that you start forming the tree by trying to find the best attribute to branch on The ID attribute will win every time and no further splitting will be needed

Some addition will have to be made to the algorithm in order to prevent this outcome

A Related Idea A tree with one level of branching on an ID attribute, as just described, will have as many leaves as there are instances The tree algorithm as given will tend to prefer attributes that have many branches, even if these aren’t ID attributes

The algorithm’s preference for many branches can be informally explained
The greater the number of branches, the fewer the number of instances per branch, on average The fewer the number of instances per branch, the greater the likelihood of a pure branch or nearly pure branch

This tendency of the algorithm goes against the grain of building small, simple trees
All else being equal (i.e., if the predictions are still good): The fewer the branches, the shorter the branches, the bigger the leaf nodes, the better the tree is for practical purposes

Counteracting the Preference for Many Branches
The book goes into further computational detail I’m ready to simplify The basic idea is that instead of calculating the desirability of a split based on information gain alone— You decide on the basis of the ratio of gain to branches

The gain ratio takes into account the number and size of the child nodes a split generates
In simple terms, the information gain is divided by the number of branches a split generates The more fragmented the branching, the less desirable the split

Once you go down this road, there are further complications to keep in mind
First of all, this adjustment doesn’t necessarily protect you from a split on an ID attribute Because it achieves purity in one fell swoop, it can still win

Secondly, if you use the gain ratio this can lead to branching on a less desirable attribute
This is the curse of all heuristics You are trying to solve one problem, but by definition, you’re devaluing information gain, which may lead to sub-optimal results

As in all games of this nature, you pile heuristics on top of heuristic, “solving” problems, until you decide you’ve gone as far as you can go The book ultimately suggests the following approach:

Provisionally pick the attribute with the highest gain ratio
Find the average absolute information gain for branching on each attribute Check the absolute information gain of the provisional choice against the average Only take the ratio winner if it is greater than the average

Discussion Divide and conquer construction of decision trees is also known as top-down induction of decision trees The developers of this scheme have come up with methods for dealing with: Numeric attributes Missing values Noisy data The generation of rule sets from trees

For what it’s worth, the algorithms discussed above have names
ID3 is the name for the basic algorithm C4.5 refers to a practical implementation of this algorithm, with improvements, for decision tree induction

4.4 Covering Algorithms: Constructing Rules
Recall, in brief, how top-down tree construction worked: Attribute by attribute, pick the one that does the best job of separating instances into distinct classes Covering is a converse approach It works from the bottom up

The goal is to create a set of classification rules directly, without building a tree
By definition, a training set with a classification attribute is explicitly grouped into classes The goal is to come up with (simple) rules that will correctly classify some new instance that comes along I.e., the goal is prediction

The process goes like this:
Choose one of the classes present in the training data set Devise a rule based on one attribute that “covers” only instances of that class

In the ideal case, “cover” would mean:
You wrote a single rule on a single attribute (a single condition) That rule identified all of the instances of one of the classes in a data set That rule identified only the instances of that class in the data set

In practice some rule may not identify or “capture” all of the instances of the class
It may also not capture only instances of the class If so, the rule may be refined by adding conditions on other attributes (or other conditions on the attribute already used)

Measures of Goodness of Rules
Recycling some of the terminology for trees, the idea can be expressed this way: When you pick a rule, you want its results to be “pure” “Pure” in this context is analogous to pure with trees The rule, when applied to instances, should ideally give only one classification as a result

The rule should ideally also cover all instances
You make the rule stronger by adding conditions Each addition should improve purity/coverage

When building trees, picking the branching attribute was guided by an objective measure of information gain When building rule sets, you also need a measure of the goodness of the covering achieved by rules on different attributes

The Rule Building Process in More Detail
To cut to the chase, evaluating the goodness of a rule just comes down to counting It doesn’t involve anything fancy like entropy You compare rules based on how well they cover a classification

Suppose you pick some classification value, say Y
The covering process generates rules of this kind: If(the attribute of interest takes on value X) Then the classification of the instance is Y

Notation In order to discuss this with some precision, let this notation be given: A given data set will have m attributes Let an individual attribute be identified by the subscript i Thus, one of the attributes would be identified in this way: attributei

A given attribute will have n different values
Let an individual value be identified by the subscript j Thus, the specific, jth value for the ith attribute could be identified in this way: Xi,j

There could be many different classifications
However, in describing what’s going on, we’re only concerned with one classification at a time There is no need to introduce subscripting on the classifications A classification will simply be known as Y

The kind of rule generated by covering could be made more specific and compact:
If(the attribute of interest takes on value X) Then the classification of the instance is Y ≡ if(attributei = Xi,j) then classification = Y

Then for some combination of attribute and value:
attributei = Xi,j Let t represent the total number of instances in the training set where this condition holds true Let p represent the number of those instances that are classified as Y

The ratio p/t is like a success rate for predicting Y with each attribute and value pair
Notice how this is pretty much like a confidence ratio in association rule mining Now you want to pick the best rule to add to your rule set

Find p/t for all of the attributes and values in the data set
Pick the highest p/t (success) ratio Then the condition for the attribute and value with the highest p/t ratio becomes part of the covering rule: if(attributei = Xi,j) then classification = Y

Costs of Doing This Note that it is not complicated to identify all of the cases, but potentially there are many of them to consider There are m attributes with a varying number of different values, represented as n In general, there are m X n cases For each case you have to count t and p

Reiteration of the Sources of Error
Given the rule just devised: There may also be instances in the data set where attributei = Xi,j However, these may be instances that are not classified as Y

This means you’ve got a rule with “impure” results
You want a high p/t ratio of success You want a low absolute t – p count of failures In reality, you would prefer to add only rules with a 100% confidence rate to your rule set

This is the second kind of problem with any interim rule:
There may be instances in the data set where attributei <> Xi,j However, these may be instances that are classified as Y Here, the problem is less severe This just means that the rule’s coverage is not complete

Refining the Rule You refine the rule by repeating the process outlined above You just added a rule with this condition: attributei = Xi,j To that you want to add a new condition that does not involve attributei

Find p and t for the remaining attributes and their values
Pick the attribute-value condition which gives the composite rule the highest p/t ratio Add this condition to the existing rule using AND

Continue until you’re satisfied with the composite p/t ratio
Or until you’ve run through all of the attributes, potentially adding up to as many conditions as there are attributes Or until you’ve run out of instances to classify—i.e., until you’ve developed rules which correctly classify each instance in that classification

Remember that what has been described thus far is the process of covering 1 classification
You repeat the process for all classifications Or you do n – 1 of the classifications and leave the nth one as the default case

Suppose you do this exhaustively, completely and explicitly covering every class
The rules will tend to have many conditions If you only added individual rules with p/t = 100% they will also be “perfect” In other words, for the training set there will be no ambiguity whatsoever

Compare this with trees
Successive splitting on attributes from the top down doesn’t guarantee pure (perfect) leaves

Working from the bottom up you can always devise sufficiently complex rule sets to cover all of the existing classes Keep in mind that if you devise rules that are perfect for a given training set, you may have overfitted

Rules versus Decision Lists
The rules derived from the process given above can be applied in any order For any one class, it’s true that the rule is composed of multiple conditions which successively classify more tightly However, the end result of the process is a single rule with conditions in conjunction

In theory, you could apply these parts of a rule in succession
That would be the moral equivalent of testing the conditions in order, from left to right However, since the conditions are in conjunction, you can test them in any order with the same result

In the same vein, it doesn’t matter which order you handle the separate classes in
If an instance doesn’t fall into one class, move on and try the next

The derived rules are “perfect” at most for all of the cases in the training set (only)
It’s possible to get an instance in the future where >1 rule applies or no rule applies As usual, the solutions to these problems are of the form, “Assign it to the most frequently occurring…”

The book summarizes the approach given above as a separate-and-conquer algorithm
This is pretty much analogous to a divide and conquer algorithm Working class by class is clearly divide and conquer

Within a given class the process is also progressive, step-by-step
You bite off a chunk with one rule, then bite off the next chunk with another rule, until you’ve eaten everything in the class Notice that like with trees, the motivation is “greedy” You always take the best p/t ratio first, and then refine from there

For what it’s worth, this algorithm for finding a covering rule set is known as PRISM

The End

Data Mining Chapter 4, Part 1 Algorithms: The Basic Methods

Similar presentations

Presentation on theme: "Data Mining Chapter 4, Part 1 Algorithms: The Basic Methods"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining Chapter 4, Part 1 Algorithms: The Basic Methods

Similar presentations

Presentation on theme: "Data Mining Chapter 4, Part 1 Algorithms: The Basic Methods"— Presentation transcript:

Similar presentations

About project

Feedback