May 27, 2016Data Mining: Concepts and Techniques1 Chapter 6. Classification and Prediction What is classification? What is prediction? Issues regarding.

May 27, 2016Data Mining: Concepts and Techniques1 Chapter 6. Classification and Prediction What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by backpropagation Classification based on concepts from association rule mining Other Classification Methods Prediction Classification accuracy Summary

May 27, 2016Data Mining: Concepts and Techniques2 Classification by Decision Tree Induction Decision tree A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes Tree pruning Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the decision tree

May 27, 2016Data Mining: Concepts and Techniques3 Terminology A Simple Decision Tree Example: Predicting Responses to a Credit Card Marketing Campaign 3 Income Debts Gender Children Root node Nodes Leaf nodes Branch / Arc Responder Non-Responder Low Male High Low High Female Many Few This decision tree says that people with low income and high debts, and high income males with many children are likely responders. Non-Responder Responder

May 27, 2016Data Mining: Concepts and Techniques4 Decision Tree (Formally) Given: D = {t 1, …, t n } where t i = Database schema contains {A 1, A 2, …, A h } Classes C={C 1, …., C m } Decision or Classification Tree is a tree associated with D such that Each internal node is labeled with attribute, A i Each arc is labeled with predicate which can be applied to attribute at parent Each leaf node is labeled with a class, C j

May 27, 2016Data Mining: Concepts and Techniques5 Using a Decision Tree Classifying a New Example for our Credit Card Marketing Campaign 5 Income Debts Gender Children Responder Non-Responder Low Male High Low High Female Many Few Assume Sarah has high income. The tree predicts that Sarah will not respond to our campaign. Feed the new example into the root of the tree and follow the relevant path towards a leaf, based on the attributes of the example. Non-Responder Responder

May 27, 2016Data Mining: Concepts and Techniques6 Algorithms for Decision Trees 6 ID3, ID4, ID5, C4.0, C4.5, C5.0, ACLS, and ASSISTANT: Use information gain or gain ratio as splitting criterion CART (Classification And Regression Trees) : Uses Gini diversity index or Twoing criterion as measure of impurity when deciding splitting. CHAID: A statistical approach that uses the Chi-squared test (of correlation/association, dealt with in an earlier lecture) when deciding on the best split. Other statistical approaches, such as those by Goodman and Kruskal, and Zhou and Dillon, use the Assymetrical Tau or Symmetrical Tau to choose the best discriminator. Hunt’s Concept Learning System (CLS), and MINIMAX: Minimizes the cost of classifying examples correctly or incorrectly. (See Sestito & Dillon, Chapter 3, Witten & Frank Section 4.3, or Dunham Section 4.4, if you’re interested in learning more) (If interested see also: http://www.kdnuggets.com/software/classification-tree-rules.html)

May 27, 2016Data Mining: Concepts and Techniques7 Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left

May 27, 2016Data Mining: Concepts and Techniques8 DT Induction

May 27, 2016Data Mining: Concepts and Techniques9 DT Issues How to choose splitting attributes: The performance of the decision tree depends heavily on the choice of the splitting attributes. This includes – Choosing attributes that will be used as splitting attributes. – Ordering of splitting attributes. Number of splits: Branches at each internal node. Tree structure: In many cases, balanced tree is desirable. Stopping criteria: When to stop splitting? Accuracy vs. performance tradeoff. Also, stop earlier to prevent overfitting. Training data: Too small inaccurate, Too large overfit. Pruning: To improve the performance during classification.

Training Set

May 27, 2016Data Mining: Concepts and Techniques11 Which attribute to select?

May 27, 2016Data Mining: Concepts and Techniques12 Height Example Data

May 27, 2016Data Mining: Concepts and Techniques13 Comparing DTs Balanced Deep Tree structure: In many cases, balanced tree is desirable.

May 27, 2016Data Mining: Concepts and Techniques14 Building a compact tree The key to building a decision tree - which attribute to choose in order to branch (Choosing Splitting Attributes) The heuristic is to choose the attribute with the maximum Information Gain based on information theory. Decision Tree Induction is often based on Information Theory

May 27, 2016Data Mining: Concepts and Techniques15 Information theory Information theory provides a mathematical basis for measuring the information content. To understand the notion of information, think about it as providing the answer to a question, for example, whether a coin will come up heads. If one already has a good guess about the answer, then the actual answer is less informative. If one already knows that the coin is unbalanced so that it will come with heads with probability 0.99, then a message (advanced information) about the actual outcome of a flip is worth less than it would be for a honest coin.

May 27, 2016Data Mining: Concepts and Techniques16 Information theory (cont …) For a fair (honest) coin, you have no information, and you are willing to pay more (say in terms of $) for advanced information - less you know, the more valuable the information. Information theory uses this same intuition, but instead of measuring the value for information in dollars, it measures information contents in bits. One bit of information is enough to answer a yes/no question about which one has no idea, such as the flip of a fair coin

May 27, 2016Data Mining: Concepts and Techniques17 Information theory In general, if the possible answers v i have probabilities P (v i ), then the information content I (entropy) of the actual answer is given by For example, for the tossing of a fair coin we get If the coin is loaded to give 99% head, we get I = 0.08, and as the probability of heads goes to 1, the information of the actual answer goes to 0

May 27, 2016Data Mining: Concepts and Techniques18

May 27, 2016Data Mining: Concepts and Techniques19 Select the attribute with the highest information gain S contains s i tuples of class C i for i = {1, …, m} information measures info required to classify any arbitrary tuple entropy of attribute A with values {a 1,a 2,…,a v } information gained by branching on attribute A Attribute Selection Measure: Information Gain (ID3/C4.5)

May 27, 2016Data Mining: Concepts and Techniques20 Building a Tree : Choosing a Split Example 20 Example: applicants who defaulted on loans (has paying diffıculties) : FewMediumPAYS FewHighPAYS Philly City Philly Children Many Income Medium Low Status DEFAULTS 3 4 ApplicantID 1 2 Try split on Children attribute:   Try split on Income attribute: Children Many Few Income Low High Medium     Notice how the split on the Children attribute gives purer partitions. It is therefore chosen as the first (and in this case only) split.

May 27, 2016Data Mining: Concepts and Techniques21 Building a Tree Choosing a Split: Worked Example Entropy 21 Split on Children attribute: (1) Entropy of each subnode Children Many Few   (2) Weighted entropy of this Split Option: Probability of being in class at this node  Proportion of examples in Sub-node 1 Entropy of Sub-node 1

May 27, 2016Data Mining: Concepts and Techniques22 Building a Tree Choosing a Split: Worked Example Entropy 22 Split on Income attribute: (1) Entropy of each subnode (2) Weighted entropy of this Split Option: Income Low High Medium    Proportion of examples in Sub-node 2 Entropy of Sub-node 2 Probability of being in class at this node 

May 27, 2016Data Mining: Concepts and Techniques23 Building a Tree Choosing a Split: Worked Example (3) Information Gain of Split Options Children Many Few Income Low High Medium     Weighted entropy of this Split Option (Split on Income) = 0.5 (calculated earlier) Information Gain = 1 - 0.5 = 0.5 Notice how the split on Children attribute gives higher information gain (purer partitions), and is therefore the preferred split. Weighted entropy of this Split Option (Split on Children) = 0 (calculated earlier) Information Gain = 1 - 0 = 1

May 27, 2016Data Mining: Concepts and Techniques24 Building a decision tree: an example training dataset (ID3)

May 27, 2016Data Mining: Concepts and Techniques25 Output: A Decision Tree for “buys_computer” age? overcast student?credit rating? noyes fair excellent <=30 >40 no yes 30..40

May 27, 2016Data Mining: Concepts and Techniques26 Attribute Selection by Information Gain Computation (ID3)  Class P: buys_computer = “yes”  Class N: buys_computer = “no”  I(p, n) = I(9, 5) =0.940

May 27, 2016Data Mining: Concepts and Techniques27 Attribute Selection by Information Gain Computation  Class P: buys_computer = “yes”  Class N: buys_computer = “no”  I(p, n) = I(9, 5) =0.940  Compute the entropy for age: means “age <=30” has 5 out of 14 samples, with 2 yes’es and 3 no’s. Hence Similarly,

May 27, 2016Data Mining: Concepts and Techniques28 Attribute Selection : Age Selected

May 27, 2016Data Mining: Concepts and Techniques29 Example 2 : ID3 (Output1)

May 27, 2016Data Mining: Concepts and Techniques30 Example 2: ID3 Example (Output1) Starting state entropy: 4/15 log(15/4) + 8/15 log(15/8) + 3/15 log(15/3) = 0.4384 Gain using gender: Female: 3/9 log(9/3)+6/9 log(9/6)=0.2764 Male: 1/6 (log 6/1) + 2/6 log(6/2) + 3/6 log(6/3) = 0.4392 Weighted sum: (9/15)(0.2764) + (6/15)(0.4392) = 0.34152 Gain: 0.4384 – 0.34152 = 0.09688 Gain using height: Divide continous data into ranges (0,1.6], (1.6,1.7], (1.7,1.8], (1.8,1.9], (1.9,2.0], (2.0, infinite] 2 values in (1.9,2.0] with entropy (0+1/2 (0.301) + ½ (0.301) = 0.301 Other entropies in ranges are zero 0.4384 – (2/15)(0.301) = 0.3983 Choose height as first splitting attribute

May 27, 2016Data Mining: Concepts and Techniques31 Example 3 : ID3 Weather data set

May 27, 2016Data Mining: Concepts and Techniques32 Example 3 : ID3

May 27, 2016Data Mining: Concepts and Techniques35 Attribute Selection Measures Information gain (ID3) All attributes are assumed to be categorical Can be modified for continuous-valued attributes Gini index (IBM IntelligentMiner) All attributes are assumed continuous-valued Assume there exist several possible split values for each attribute May need other tools, such as clustering, to get the possible split values Can be modified for categorical attributes Information Gain Ratio (C4.5)

May 27, 2016Data Mining: Concepts and Techniques36 Gini Index (IBM IntelligentMiner) If a data set T contains examples from n classes, gini index, gini(T) is defined as where p j is the relative frequency of class j in T. If a data set T is split into two subsets T 1 and T 2 with sizes N 1 and N 2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as The attribute provides the smallest gini split (T) is chosen to split the node (need to enumerate all possible splitting points for each attribute).

May 27, 2016Data Mining: Concepts and Techniques37 C4.5 Algoritm (Gain Ratio), -Dunham, page 102 An improvement over ID3 decision tree algorithm. – Splitting: Gain value used in ID3 favors the attributes with high cardinality. C4.5 takes the cardinality of the attribute into the consideration and uses gain ratio as opposed to gain. – Missing data: During the tree construction, the training data that have missing values are simply ignored. Only records that have attribute values will be used to calculate the gain ratio. (Note: the formula for computing the gain ratio changes a bit to accommodate those missing data). During the classification, if the splitting attribute of the query is missing, we need to descent to every child node. The final class assigned to the query depends on the probabilities associated with classes at leaf nodes.

May 27, 2016Data Mining: Concepts and Techniques38 C4.5 Algoritm (Gain Ratio) Continuous data: Break the continuous attribute values into ranges. – Pruning: 2 strategies: Prune the tree if the increasing classification error is within the limit. Subtree replacement: Subtree is replaced by a leaf node. Bottom up. Subtree raising: Subtree is replaced by its most used subtree. – Rules: C4.5 allows classification directly via the decision trees or rules generated from them. In addition, there are some techniques to simplify complex or redundant rules.

May 27, 2016Data Mining: Concepts and Techniques39 CART Algorithm (Dunham, page 102) Acronym for Classification And Regression Trees. Binary splits only ( Create Binary Tree) Uses entropy Formula to choose split point, s, for node t: P L,P R probability that a tuple in the training set will be on the left or right side of the tree.

May 27, 2016Data Mining: Concepts and Techniques40 Extracting Classification Rules from Trees Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Rules are easier for humans to understand Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”

May 27, 2016Data Mining: Concepts and Techniques41 Extracting Classification Rules from Trees for Credit Card Marketing Campaign For each leaf in the tree, read the rule from the root to that leaf. You will arrive at a set of rules. Income Debts Gender Children Responder Non-Responder Low Male High Low High Female Many Few IF Income=Low AND Debts=Low THEN Non-Responder IF Income=Low AND Debts=High THEN Responder IF Income=High AND Gender=Male AND Children=Many THEN Responder IF Income=High AND Gender=Male AND Children=Few THEN Non-Responder Non-Responder Responder IF Income=High AND Gender=Female THEN Non-Responder

May 27, 2016Data Mining: Concepts and Techniques42 Avoid Overfitting in Classification The generated tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Result is in poor accuracy for unseen samples Two approaches to avoid overfitting Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the “best pruned tree”

May 27, 2016Data Mining: Concepts and Techniques43 Approaches to Determine the Final Tree Size Separate training (2/3) and testing (1/3) sets Use cross validation, e.g., 10-fold cross validation Use all the data for training but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution Use minimum description length (MDL) principle: halting growth of the tree when the encoding is minimized

May 27, 2016Data Mining: Concepts and Techniques44 Enhancements to basic decision tree induction Allow for continuous-valued attributes Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals Handle missing attribute values Assign the most common value of the attribute Assign probability to each of the possible values Attribute construction Create new attributes based on existing ones that are sparsely represented This reduces fragmentation, repetition, and replication

May 27, 2016Data Mining: Concepts and Techniques45 Classification in Large Databases Classification—a classical problem extensively studied by statisticians and machine learning researchers Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed Why decision tree induction in data mining? relatively faster learning speed (than other classification methods) convertible to simple and easy to understand classification rules can use SQL queries for accessing databases comparable classification accuracy with other methods

May 27, 2016Data Mining: Concepts and Techniques46 Scalable Decision Tree Induction Methods in Data Mining Studies SLIQ (EDBT’96 — Mehta et al.) builds an index for each attribute and only class list and the current attribute list reside in memory SPRINT (VLDB’96 — J. Shafer et al.) constructs an attribute list data structure PUBLIC (VLDB’98 — Rastogi & Shim) integrates tree splitting and tree pruning: stop growing the tree earlier RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti) separates the scalability aspects from the criteria that determine the quality of the tree builds an AVC-list (attribute, value, class label)

May 27, 2016Data Mining: Concepts and Techniques47 Data Cube-Based Decision-Tree Induction Integration of generalization with decision-tree induction (Kamber et al’97). Classification at primitive concept levels E.g., precise temperature, humidity, outlook, etc. Low-level concepts, scattered classes, bushy classification-trees Semantic interpretation problems. Cube-based multi-level classification Relevance analysis at multi-levels. Information-gain analysis with dimension + level.

May 27, 2016Data Mining: Concepts and Techniques48 Presentation of Classification Results

May 27, 2016Data Mining: Concepts and Techniques1 Chapter 6. Classification and Prediction What is classification? What is prediction? Issues regarding.

Similar presentations

Presentation on theme: "May 27, 2016Data Mining: Concepts and Techniques1 Chapter 6. Classification and Prediction What is classification? What is prediction? Issues regarding."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

May 27, 2016Data Mining: Concepts and Techniques1 Chapter 6. Classification and Prediction What is classification? What is prediction? Issues regarding.

Similar presentations

Presentation on theme: "May 27, 2016Data Mining: Concepts and Techniques1 Chapter 6. Classification and Prediction What is classification? What is prediction? Issues regarding."— Presentation transcript:

Similar presentations

About project

Feedback