Basic Data Mining Techniques

Slides:



Advertisements
Similar presentations
Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson.
Advertisements

Data Mining Classification: Alternative Techniques
Data Mining Lecture 9.
DECISION TREES. Decision trees  One possible representation for hypotheses.
Rule-Based Classifiers. Rule-Based Classifier Classify records by using a collection of “if…then…” rules Rule: (Condition)  y –where Condition is a conjunctions.
From Decision Trees To Rules
Rule Generation from Decision Tree Decision tree classifiers are popular method of classification due to it is easy understanding However, decision tree.
IT 433 Data Warehousing and Data Mining
Decision Tree Approach in Data Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Association Rules. Association rules Association rules… –… can predict any attribute and combinations of attributes … are not intended to be used.
Data Mining: A Closer Look Chapter Data Mining Strategies.
Data Quality Class 9. Rule Discovery Decision and Classification Trees Association Rules.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Basic Data Mining Techniques Chapter Decision Trees.
Basic Data Mining Techniques
Lecture 5 (Classification with Decision Trees)
Covering Algorithms. Trees vs. rules From trees to rules. Easy: converting a tree into a set of rules –One rule for each leaf: –Antecedent contains a.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Lecture14: Association Rules
Genetic Algorithm Genetic Algorithms (GA) apply an evolutionary approach to inductive learning. GA has been successfully applied to problems that are difficult.
1 An Excel-based Data Mining Tool Chapter The iData Analyzer.
Chapter 6 Decision Trees
Data Mining: A Closer Look
Data Mining: A Closer Look Chapter Data Mining Strategies 2.
Chapter 5 Data mining : A Closer Look.
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
Enterprise systems infrastructure and architecture DT211 4
Bayesian Decision Theory Making Decisions Under uncertainty 1.
Chapter 7 Decision Tree.
1 An Introduction to Data Mining Hosein Rostani Alireza Zohdi Report 1 for “advance data base” course Supervisor: Dr. Masoud Rahgozar December 2007.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
An Excel-based Data Mining Tool Chapter The iData Analyzer.
Inductive learning Simplest form: learn a function from examples
Decision Trees.
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration.
 Fundamentally, data mining is about processing data and identifying patterns and trends in that information so that you can decide or judge.  Data.
Structured Analysis.
1 Statistical Techniques Chapter Linear Regression Analysis Simple Linear Regression.
Classification and Prediction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot Readings: Chapter 6 – Han and Kamber.
Bab 5 Classification: Alternative Techniques Part 1 Rule-Based Classifer.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Chapter 20 Data Analysis and Mining. 2 n Decision Support Systems  Obtain high-level information out of detailed information stored in (DB) transaction-processing.
DATA MINING By Cecilia Parng CS 157B.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS.
DM.Lab in University of Seoul Data Mining Laboratory April 24 th, 2008 Summarized by Sungjick Lee An Excel-Based Data Mining Tool iData Analyzer.
Elsayed Hemayed Data Mining Course
Bootstrapped Optimistic Algorithm for Tree Construction
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
An Excel-based Data Mining Tool Chapter The iData Analyzer.
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
Chap 6: Association Rules. Rule Rules!  Motivation ~ recent progress in data mining + warehousing have made it possible to collect HUGE amount of data.
Basic Data Mining Techniques Chapter 3-A. 3.1 Decision Trees.
Data Mining : Basic Data Mining Techniques Database Lab 김성원.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
1 Discriminative Frequent Pattern Analysis for Effective Classification Presenter: Han Liang COURSE PRESENTATION:
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
By N.Gopinath AP/CSE.  A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, each.
Chapter 6 Decision Tree.
Market Basket Analysis and Association Rules
An Excel-based Data Mining Tool
Data Analysis.
Decision Trees.
Market Basket Analysis and Association Rules
MIS2502: Data Analytics Classification Using Decision Trees
Presentation transcript:

Basic Data Mining Techniques Decision Trees

Basic concepts Decision trees are constructed using only those attributes best able to differentiate the concepts to be learned A decision tree is built by initially selecting a subset of instances from a training set This subset is then used to construct a decision tree The remaining training set instances test the accuracy of the constructed tree

The Accuracy Score and the Goodness Score The accuracy score is a ratio (usually expressed in percent) of the number of the correctly classified samples to the total number of samples in the training set. The goodness score is the ratio of the accuracy score to the total number of branches added to the tree by the attribute, which is used to make a decision. That tree is better, which shows better accuracy and goodness scores

An Algorithm for Building Decision Trees 1. Let T be the set of training instances. 2. Choose an attribute that best differentiates the instances in T . 3. Create a tree node whose value is the chosen attribute. -Create child links from this node where each link represents a unique value for the chosen attribute. -Use the child link values to further subdivide the instances into subclasses. 4. For each subclass created in step 3: -If the instances in the subclass satisfy predefined criteria (minimum training set classification accuracy) or if the set of remaining attribute choices for this path is empty, verify the classification for the remaining training set instances following this decision path and STOP. -If the subclass does not satisfy the criteria and there is at least one attribute to further subdivide the path of the tree, let T be the current set of subclass instances and return to step 2.

Attribute Selection The attribute choice made when building a decision tree determines the size of the constructed tree A main goal is to minimize the number of tree levels and tree nodes and to maximize data generalization

Example: The Credit Card Promotion Database Designation of the life insurance promotion as the output attribute Our input attributes are: income range, credit card insurance, sex, and age

Partial Decision Trees for the Credit Card Promotion Database

Accuracy Score 11/15~0.7373% Goodness score 0.73/4~0.183

Accuracy Score 9/15=0.660% Goodness score 0.6/2~0.3

Accuracy Score 12/15~0.880% Goodness score 0.8/2~0.4

Multiple-Node Decision Trees for the Credit Card Promotion Database

No (3/1) Accuracy Score 14/15~0.9393% Goodness score 0.93/6~0.16

Accuracy Score 13/15~0.8787% Goodness score 0.87/4~0.22

Decision Tree Rules

A Rule for the Tree in Figure 3.4 IF Age <=43 & Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No

A Simplified Rule Obtained by Removing Attribute Age IF Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No

Advantages of Decision Trees Easy to understand. Map nicely to a set of production rules. Applied to real problems. Make no prior assumptions about the data. Able to process both numerical and categorical data.

Disadvantages of Decision Trees Output attribute must be categorical. Limited to one output attribute. Decision tree algorithms are unstable. Trees created from numeric datasets can be complex.

Generating Association Rules

Confidence and Support Traditional classification rules usually limit a consequent of a rule to a single attribute Association rule generators allow the consequent of a rule to contain one or several attribute values

Example If there are any interesting relationships to be found in customer purchasing trends among the grocery store products: Milk Cheese Bread Eggs

Possible associations: If customers purchase milk they also purchase bread If customers purchase bread they also purchase milk If customers purchase milk and eggs they also purchase cheese and bread If customers purchase milk, cheese, and eggs they also purchase bread

Confidence Analyzing the first rule we are coming to the natural question: “How likely will the event of a milk purchase lead to a bread purchase?” To answer this question, a rule has an associated confidence, which is in our case the conditional probability of a bread purchase given a milk purchase

Rule Confidence Given a rule of the form “If A then B”, rule confidence is the conditional probability that B is true when A is known to be true.

Rule Support The minimum percentage of instances in the database that contain all items listed in a given association rule.

Mining Association Rules: An Example

Apriori Algorithm This algorithm generates item sets Item sets are attribute-value combinations that meet a specified coverage requirement Those attribute-value combinations that do not meet the coverage requirement are discarded

Apriori Algorithm The first step: item set generation The second step: creation a set of association rules using the generated item set

The “income range” and “age” attributes are eliminated

Generation of the item sets First, we will generate “single-item” sets Minimum attribute-value coverage requirement: four items Single-item sets represent individual attribute-value combinations extracted from the original data set

Single-Item Sets

Two-Item Sets and Multiple-Item Sets Two-item sets can be created from single-item sets by their combination (usually with the same coverage restriction) The next step is to use the attribute-value combinations from the two-item sets to create three-item sets, etc. The process is continued until such n, for which the n-item set will contain a single instance

Two-Item Sets

Three-Item Set The only three-item set that satisfies the coverage criterion is: (Watch Promotion = No) & (Life Insurance Promotion = No) & (Credit Card Insurance = No)

Rule Creation The first step is to specify a minimum rule confidence Next, association rules are generated from the two- and three-item set tables Any rule not meeting the minimum confidence value is discarded

Two Possible Two-Item Set Rules IF Magazine Promotion =Yes THEN Life Insurance Promotion =Yes (5/7) (Rule confidence is 5/7x100% = 71%) IF Life Insurance Promotion =Yes THEN Magazine Promotion =Yes (5/5) (Rule confidence is 5/5x100% = 100%)

Three-Item Set Rules IF Watch Promotion =No & Life Insurance Promotion = No THEN Credit Card Insurance =No (4/4) (Rule confidence is 4/4x100% = 100%) IF Watch Promotion =No THEN Life Insurance Promotion = No & Credit Card Insurance = No (4/6) (Rule confidence is 4/6x100% = 66.6%)

General Considerations We are interested in association rules that show a lift in product sales where the lift is the result of the product’s association with one or more other products. We are also interested in association rules that show a lower than expected confidence for a particular association.

Homework Problems 2, 3 (p. 102 of the book)