CSci 8980: Data Mining (Fall 2002)

Slides:



Advertisements
Similar presentations
Data Mining Classification: Basic Concepts,
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
CHAPTER 9: Decision Trees
Classification Basic Concepts Decision Trees
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Bab /44 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 1 Classification With Decision tree.
Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.
Statistics 202: Statistical Aspects of Data Mining
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Classification: Definition l Given a collection of records (training set) l Find a model.
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Decision Tree.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Classification Kuliah 4 4/29/2015. Classification: Definition  Given a collection of records (training set )  Each record contains a set of attributes,
Data Mining Classification This lecture node is modified based on Lecture Notes for Chapter 4/5 of Introduction to Data Mining by Tan, Steinbach, Kumar,
Lecture Notes for Chapter 4 and towards the end from Chapter 5
Data Mining Classification: Naïve Bayes Classifier
Classification: Basic Concepts and Decision Trees.
Lecture Notes for Chapter 4 Introduction to Data Mining
Classification: Decision Trees, and Naïve Bayes etc. March 17, 2010 Adapted from Chapters 4 and 5 of the book Introduction to Data Mining by Tan, Steinbach,
Lecture Notes for Chapter 4 Introduction to Data Mining
1 BUS 297D: Data Mining Professor David Mease Lecture 5 Agenda: 1) Go over midterm exam solutions 2) Assign HW #3 (Due Thurs 10/1) 3) Lecture over Chapter.
Tree-based methods, neutral networks
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Lecture 5 (Classification with Decision Trees)
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
COSC 4335 DM: Preprocessing Techniques
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 11 = Finish ch. 4 and start.
DATA MINING LECTURE 9 Classification Basic Concepts Decision Trees.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Chapter 9 – Classification and Regression Trees
Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.
Classification: Basic Concepts, Decision Trees, and Model Evaluation
Classification. 2 Classification: Definition  Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes.
Classification Basic Concepts, Decision Trees, and Model Evaluation
Decision Trees and an Introduction to Classification.
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Machine Learning: Decision Trees Homework 4 assigned
Modul 6: Classification. 2 Classification: Definition  Given a collection of records (training set ) Each record contains a set of attributes, one of.
Review - Decision Trees
Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
Data Reduction Strategies Why data reduction? A database/data warehouse may store terabytes of data Complex data analysis/mining may take a very long time.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation COSC 4368.
Longin Jan Latecki Temple University
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Classification: Basic Concepts, Decision Trees. Classification: Definition l Given a collection of records (training set ) –Each record contains a set.
Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.
Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:
Lecture Notes for Chapter 4 Introduction to Data Mining
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4.
1 Illustration of the Classification Task: Learning Algorithm Model.
Big Data Analysis and Mining Qinpei Zhao 赵钦佩 2015 Fall Decision Tree.
Classification: Basic Concepts, Decision Trees. Classification Learning: Definition l Given a collection of records (training set) –Each record contains.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining By Tan, Steinbach,
3/13/2016Data Mining 1 Lecture 1-2 Data and Data Preparation Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB) Bangkok.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Data Transformation: Normalization
Lecture Notes for Chapter 4 Introduction to Data Mining
Data Mining Decision Tree Induction
Data Mining Classification: Basic Concepts and Techniques
Classification Basic Concepts, Decision Trees, and Model Evaluation
Machine Learning” Notes 2
Data Mining: Concepts and Techniques
Basic Concepts and Decision Trees
آبان 96. آبان 96 Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan,
Data Transformations targeted at minimizing experimental variance
Data Pre-processing Lecture Notes for Chapter 2
COP5577: Principles of Data Mining Fall 2008 Lecture 4 Dr
Presentation transcript:

CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer Science University of Minnesota http://www.cs.umn.edu/~kumar

Sampling Sampling is the main technique employed for data selection. It is often used for both the preliminary investigation of the data and the final data analysis. Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming. Sampling is used in data mining because it is too expensive or time consuming to process all the data

Sampling … The key principle for effective sampling is the following: using a sample will work almost as well as using the entire data sets, if the sample is representative. A sample is representative if it has approximately the same property (of interest) as the original set of data.

Types of Sampling Simple Random Sampling There is an equal probability of selecting any particular item. Sampling without replacement As each item is selected, it is removed from the population. Sampling with replacement Objects are not removed from the population as they are selected for the sample. In sampling with replacement, the same object can be picked up more than once.

Sample Size 8000 points 2000 Points 500 Points

Sample Size What sample size is necessary to get at least one object from each of 10 groups.

Discretization Some techniques don’t use class labels. Data Equal interval width Equal frequency K-means

Discretization Some techniques use class labels. Entropy based approach 3 categories for both x and y 5 categories for both x and y

Aggregation Combine data or attribute More stable behavior Standard Deviation of Average Monthly Precipitation Standard Deviation of Average Yearly Precipitation

Dimensionality Reduction Principal Components Analysis Singular Value Decomposition Curse of Dimensionality

Feature Subset Selection Redundant features duplicate much or all of the information contained in one or more other attributes, e.g., the purchase price of a product and the amount of sales tax paid contain much the same information. Irrelevant features contain no information that is useful for the data mining task at hand, e.g., students' ID numbers should be irrelevant to the task of predicting students' grade point averages.

Mapping Data to a New Space Fourier transform Wavelet transform Two Sine Waves Two Sine Waves + Noise Frequency

Classification: Outline Decision Tree Classifiers What are Decision Trees Tree Induction ID3, C4.5, CART Tree Pruning Other Classifiers Memory Based Neural Net Bayesian

Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Classification Example categorical categorical continuous class Test Set Learn Classifier Model Training Set

Classification Techniques Decision Tree based Methods Rule-based Methods Memory based reasoning Neural Networks Genetic Algorithms Naïve Bayes and Bayesian Belief Networks Support Vector Machines

Decision Tree Based Classification Decision tree models are better suited for data mining: Inexpensive to construct Easy to Interpret Easy to integrate with database systems Comparable or better accuracy in many applications

Example Decision Tree Splitting Attributes categorical categorical continuous Splitting Attributes class Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES The splitting attribute at a node is determined based on the Gini index.

Another Example of Decision Tree categorical categorical continuous class MarSt Single, Divorced Married NO Refund No Yes NO TaxInc < 80K > 80K NO YES There could be more than one tree that fits the same data!

Decision Tree Algorithms Many Algorithms: Hunt’s Algorithm (one of the earliest). CART ID3, C4.5 SLIQ,SPRINT General Structure: Tree Induction Tree Pruning

Hunt’s Method An Example: Attributes: Refund (Yes, No), Marital Status (Single, Married, Divorced), Taxable Income (Continuous) Class: Cheat, Don’t Cheat Refund Don’t Cheat Yes No Refund Don’t Cheat Yes No Marital Status Single, Divorced Married Refund Don’t Cheat Yes No Marital Status Single, Divorced Married Taxable Income < 80K >= 80K Don’t Cheat

Tree Induction Greedy strategy. Choose to split records based on an attribute that optimizes the splitting criterion. Two phases at each node: Split Determining Phase: How to Split a Given Attribute? Which attribute to split on? Use Splitting Criterion. Splitting Phase: Split the records into children.

Splitting Based on Nominal Attributes Each partition has a subset of values signifying it. Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning. CarType Family Sports Luxury CarType {Sports, Luxury} {Family} CarType {Family, Luxury} {Sports} OR

Splitting Based on Ordinal Attributes Each partition has a subset of values signifying it. Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning. What about this split? Size Small Medium Large Size {Small, Medium} {Large} Size {Medium, Large} {Small} OR Size {Small, Large} {Medium}

Splitting Based on Continuous Attributes Different ways of handling Discretization to form an ordinal categorical attribute Static – discretize once at the beginning Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. Binary Decision: (A < v) or (A  v) consider all possible splits and finds the best cut can be more compute intensive

Splitting Criterion Gini Index Entropy and Information Gain Misclassification error

Splitting Criterion: GINI Gini Index for a given node t : (NOTE: p( j | t) is the relative frequency of class j at node t). Measures impurity of a node. Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information Minimum (0.0) when all records belong to one class, implying most interesting information

Examples for computing GINI P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0 P(C1) = 1/6 P(C2) = 5/6 Gini = 1 – (1/6)2 – (5/6)2 = 0.278 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

Splitting Based on GINI Used in CART, SLIQ, SPRINT. Splitting Criterion: Minimize Gini Index of the Split. When a node p is split into k partitions (children), the quality of split is computed as, where, ni = number of records at child i, n = number of records at node p.

Binary Attributes: Computing GINI Index Splits into two partitions Effect of Weighing partitions: Larger and Purer Partitions are sought for. B? Yes No Node N1 Node N2

Categorical Attributes: Computing Gini Index For each distinct value, gather counts for each class in the dataset Use the count matrix to make decisions Multi-way split Two-way split (find best partition of values)

Continuous Attributes: Computing Gini Index Use Binary Decisions based on one value Several Choices for the splitting value Number of possible splitting values = Number of distinct values Each splitting value has a count matrix associated with it Class counts in each of the partitions, A < v and A  v Simple method to choose best v For each v, scan the database to gather count matrix and compute its Gini index Computationally Inefficient! Repetition of work.

Continuous Attributes: Computing Gini Index... For efficient computation: for each attribute, Sort the attribute on values Linearly scan these values, each time updating the count matrix and computing gini index Choose the split position that has the least gini index Split Positions Sorted Values

Alternative Splitting Criteria based on INFO Entropy at a given node t: (NOTE: p( j | t) is the relative frequency of class j at node t). Measures homogeneity of a node. Maximum (log nc) when records are equally distributed among all classes implying least information Minimum (0.0) when all records belong to one class, implying most information Entropy based computations are similar to the GINI index computations

Examples for computing Entropy P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0 P(C1) = 1/6 P(C2) = 5/6 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65 P(C1) = 2/6 P(C2) = 4/6 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

Splitting Based on INFO... Information Gain: Parent Node, p is split into k partitions; ni is number of records in partition i Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN) Used in ID3 and C4.5 Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure.

Splitting Based on INFO... Gain Ratio: Parent Node, p is split into k partitions ni is the number of records in partition i Adjusts Information Gain by the entropy of the partitioning (SplitINFO). Higher entropy partitioning (large number of small partitions) is penalized! Used in C4.5 Designed to overcome the disadvantage of Information Gain

Splitting Criteria based on Classification Error Classification error at a node t : Measures misclassification error made by a node. Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information Minimum (0.0) when all records belong to one class, implying most interesting information

Examples for Computing Error P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Error = 1 – max (0, 1) = 1 – 1 = 0 P(C1) = 1/6 P(C2) = 5/6 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6 P(C1) = 2/6 P(C2) = 4/6 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

Comparison among Splitting Criteria For a 2-class problem:

C4.5 Simple depth-first construction. Sorts Continuous Attributes at each node. Needs entire data to fit in memory. Unsuitable for Large Datasets. Needs out-of-core sorting. Classification Accuracy shown to improve when entire datasets are used!

Decision Tree for Boolean Function

Decision Tree for Boolean Function… Can simplify the tree: