2016-2-13 1 Big Data Analysis and Mining Qinpei Zhao 赵钦佩 2015 Fall Decision Tree.

Slides:



Advertisements
Similar presentations
Data Mining Classification: Basic Concepts,
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.
Statistics 202: Statistical Aspects of Data Mining
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Classification: Definition l Given a collection of records (training set) l Find a model.
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Decision Tree.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Classification Kuliah 4 4/29/2015. Classification: Definition  Given a collection of records (training set )  Each record contains a set of attributes,
Classification Techniques: Decision Tree Learning
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Data Mining Classification: Naïve Bayes Classifier
Decision Trees.
Classification: Basic Concepts and Decision Trees.
Lecture Notes for Chapter 4 Introduction to Data Mining
Classification: Decision Trees, and Naïve Bayes etc. March 17, 2010 Adapted from Chapters 4 and 5 of the book Introduction to Data Mining by Tan, Steinbach,
Decision Trees Jeff Storey. Overview What is a Decision Tree Sample Decision Trees How to Construct a Decision Tree Problems with Decision Trees Decision.
DECISION TREES David Kauchak CS 451 – Fall Admin Assignment 1 available and is due on Friday (printed out at the beginning of class) Door code for.
Online Algorithms – II Amrinder Arora Permalink:
CSci 8980: Data Mining (Fall 2002)
1 BUS 297D: Data Mining Professor David Mease Lecture 5 Agenda: 1) Go over midterm exam solutions 2) Assign HW #3 (Due Thurs 10/1) 3) Lecture over Chapter.
Decision Tree Algorithm
Induction of Decision Trees
Lecture 5 (Classification with Decision Trees)
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Machine Learning Chapter 3. Decision Tree Learning
DATA MINING LECTURE 9 Classification Basic Concepts Decision Trees.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Chapter 9 – Classification and Regression Trees
Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.
Classification. 2 Classification: Definition  Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes.
Classification Basic Concepts, Decision Trees, and Model Evaluation
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Modul 6: Classification. 2 Classification: Definition  Given a collection of records (training set ) Each record contains a set of attributes, one of.
Review - Decision Trees
Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation COSC 4368.
Artificial Intelligence in Game Design N-Grams and Decision Tree Learning.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Classification: Basic Concepts, Decision Trees. Classification: Definition l Given a collection of records (training set ) –Each record contains a set.
Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.
Lecture Notes for Chapter 4 Introduction to Data Mining
1 Illustration of the Classification Task: Learning Algorithm Model.
Classification: Basic Concepts, Decision Trees. Classification Learning: Definition l Given a collection of records (training set) –Each record contains.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining By Tan, Steinbach,
Decision Trees MSE 2400 EaLiCaRA Dr. Tom Way.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Data Mining – Clustering and Classification 1.  Review Questions ◦ Question 1: Clustering and Classification  Algorithm Questions ◦ Question 2: K-Means.
Data Mining Classification and Clustering Techniques Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining.
Illustrating Classification Task
Artificial Intelligence
Data Science Algorithms: The Basic Methods
Decision Tree Saed Sayad 9/21/2018.
Data Mining Classification: Basic Concepts and Techniques
Lecture Notes for Chapter 4 Introduction to Data Mining
Classification Basic Concepts, Decision Trees, and Model Evaluation
MIS2502: Data Analytics Classification using Decision Trees
Basic Concepts and Decision Trees
Machine Learning: Lecture 3
Decision Trees Jeff Storey.
COSC 4368 Intro Supervised Learning Organization
COP5577: Principles of Data Mining Fall 2008 Lecture 4 Dr
Presentation transcript:

Big Data Analysis and Mining Qinpei Zhao 赵钦佩 2015 Fall Decision Tree

Illustrating Classification Task

Classification: Definition Given a collection of records (training set )  Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible.  A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Examples of Classification Task Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc

An inductive learning task  Use particular facts to make more generalized conclusions A predictive model based on a branching series of Boolean tests  These smaller Boolean tests are less complex than a one-stage classifier Let’s look at a sample decision tree… What is a Decision Tree?

categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting Attributes Training Data Model: Decision Tree Example – Tax cheating

categorical continuous class MarSt Refund TaxInc YES NO Yes No Married Single, Divorced < 80K> 80K There could be more than one tree that fits the same data! Example – Tax cheating

Decision Tree Classification Task Decision Tree

Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data Start from the root of tree.

Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data

Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data

Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data

Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data

Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data Assign Cheat to “No”

Example - Predicting Commute Time Leave At Stall?Accident? 10 AM 9 AM 8 AM Long ShortMediumLong NoYes NoYes If we leave at 10 AM and there are no cars stalled on the road, what will our commute time be?

Inductive Learning In this decision tree, we made a series of Boolean decisions and followed the corresponding branch  Did we leave at 10 AM?  Did a car stall on the road?  Is there an accident on the road? By answering each of these yes/no questions, we then came to a conclusion on how long our commute might take

Decision Trees as Rules We did not have represented this tree graphically We could have represented as a set of rules. However, this may be much harder to read… if hour == 8am commute time = long else if hour == 9am if accident == yes commute time = long else commute time = medium else if hour == 10am if stall == yes commute time = long else commute time = short

How to Create a Decision Tree We first make a list of attributes that we can measure  These attributes (for now) must be discrete We then choose a target attribute that we want to predict Then create an experience table that lists what we have seen in the past

Sample Experience Table ExampleAttributesTarget HourWeatherAccidentStallCommute D18 AMSunnyNo Long D28 AMCloudyNoYesLong D310 AMSunnyNo Short D49 AMRainyYesNoLong D59 AMSunnyYes Long D610 AMSunnyNo Short D710 AMCloudyNo Short D89 AMRainyNo Medium D99 AMSunnyYesNoLong D1010 AMCloudyYes Long D1110 AMRainyNo Short D128 AMCloudyYesNoLong D139 AMSunnyNo Medium

Tree Induction Greedy strategy.  Split the records based on an attribute test that optimizes certain criterion. Issues  Determine how to split the records  How to specify the attribute test condition?  How to determine the best split?  Determine when to stop splitting

Tree Induction Greedy strategy.  Split the records based on an attribute test that optimizes certain criterion. Issues  Determine how to split the records  How to specify the attribute test condition?  How to determine the best split?  Determine when to stop splitting

How to Specify Test Condition? Depends on attribute types  Nominal  Ordinal  Continuous Depends on number of ways to split  2-way split  Multi-way split

Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning. CarType Family Sports Luxury CarType {Family, Luxury} {Sports} CarType {Sports, Luxury} {Family} OR Splitting Based on Nominal Attributes

Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning. What about this split? Splitting Based on Ordinal Attributes Size Small Medium Large Size {Medium, Large} {Small} Size {Small, Medium} {Large} OR Size {Small, Large} {Medium}

Different ways of handling  Discretization to form an ordinal categorical attribute  Static – discretize once at the beginning  Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering.  Binary Decision: (A < v) or (A  v)  consider all possible splits and finds the best cut  can be more compute intensive Splitting Based on Continuous Attributes

Choosing Attributes Methods for selecting attributes (which will be described later) show that weather is not a discriminating attribute We use the principle of Occam’s Razor: Given a number of competing hypotheses, the simplest one is preferable Notice that not every attribute has to be used in each path of the decision. As we will see, some attributes may not even appear in the tree. The previous experience decision table showed 4 attributes: hour, weather, accident and stall But the decision tree only showed 3 attributes: hour, accident and stall Why is that?

Identifying the Best Attributes Refer back to our original decision tree Leave At Stall? Accident? 10 AM 9 AM 8 AM Long ShortMedium No YesNoYes Long How did we know to split on leave at and then on stall and accident and not weather?

Tree Induction Greedy strategy.  Split the records based on an attribute test that optimizes certain criterion. Issues  Determine how to split the records  How to specify the attribute test condition?  How to determine the best split?  Determine when to stop splitting

Impurity/Entropy (informal)  Measures the level of impurity in a group of examples Entropy or Purity

Very impure group Less impure Minimum impurity Entropy or Purity

Entropy = p i is the probability of class i Compute it as the proportion of class i in the set. Entropy comes from information theory. The higher the entropy the more the information content. Entropy: a common way to measure impurity

Information “ Information ” answers questions. The more clueless I am about a question, the more information the answer to the question contains. Example – fair coin  prior By definition Information of the prior (or entropy of the prior): I(P1,P2) = - P1 log 2 (P1) –P2 log 2 (P2) = I(0.5,0.5) = -0.5 log 2 (0.5) – 0.5 log 2 (0.5) = 1 We need 1 bit to convey the outcome of the flip of a fair coin. Why does a biased coin have less information? (How can we code the outcome of a biased coin sequence?) Scale: 1 bit = answer to Boolean question with prior

Information in an answer given possible answers v 1, v 2, … v n : Example – biased coin  prior I(1/100,99/100) = -1/100 log 2 (1/100) –99/100 log 2 (99/100) = 0.08 bits (so not much information gained from “answer.”) Example – fully biased coin  prior I(1,0) = -1 log 2 (1) – 0 log 2 (0) = 0 bits 0 log 2 (0) =0 i.e., no uncertainty left in source! (Also called entropy of the prior.) Information or Entropy or Purity

Shape of Entropy Function Roll of an unbiased die The more uniform the probability distribution, the greater is its entropy /2 1 p

Information Gain: Parent Node, p is split into k partitions; n i is number of records in partition i  Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN)  Used in ID3 and C4.5  Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure. Splitting based on information

Decision Tree Algorithms The basic idea behind any decision tree algorithm is as follows:  Choose the best attribute(s) to split the remaining instances and make that attribute a decision node  Repeat this process for recursively for each child  Stop when:  All the instances have the same target attribute value  There are no more attributes  There are no more instances

Entropy Entropy is minimized when all values of the target attribute are the same.  If we know that commute time will always be short, then entropy = 0 Entropy is maximized when there is an equal chance of all values for the target attribute (i.e. the result is random)  If commute time = short in 3 instances, medium in 3 instances and long in 3 instances, entropy is maximized

Entropy Calculation of entropy  Entropy(S) = ∑ (i=1 to l) -|S i |/|S| * log 2 (|S i |/|S|)  S = set of examples  S i = subset of S with value v i under the target attribute  l = size of the range of the target attribute

ID3 ID3 splits on attributes with the lowest entropy We calculate the entropy for all values of an attribute as the weighted sum of subset entropies as follows:  ∑ (i = 1 to k) |S i |/|S| Entropy(S i ), where k is the range of the attribute we are testing We can also measure information gain (which is inversely proportional to entropy) as follows:  Entropy(S) - ∑ (i = 1 to k) |S i |/|S| Entropy(S i )

Given our commute time sample set, we can calculate the entropy of each attribute at the root node AttributeExpected EntropyInformation Gain Hour Weather Accident Stall ID3

Problems with ID3 ID3 is not optimal  Uses expected entropy reduction, not actual reduction Must use discrete (or discretized) attributes  What if we left for work at 9:30 AM?  We could break down the attributes into smaller values…

Problems with Decision Trees While decision trees classify quickly, the time for building a tree may be higher than another type of classifier Decision trees suffer from a problem of errors propagating throughout a tree  A very serious problem as the number of classes increases