Math 5364 Notes Chapter 4: Classification

Slides:



Advertisements
Similar presentations
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Advertisements

Classification with Multiple Decision Trees
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.
Computational Biology Lecture Slides Week 10 Classification (some parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar)
Statistics 202: Statistical Aspects of Data Mining
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Classification: Definition l Given a collection of records (training set) l Find a model.
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Decision Tree.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Data Mining Classification: Alternative Techniques
Chapter 7 – Classification and Regression Trees
Data Mining Classification: Naïve Bayes Classifier
Classification: Basic Concepts and Decision Trees.
Lecture Notes for Chapter 4 Introduction to Data Mining
Classification: Decision Trees, and Naïve Bayes etc. March 17, 2010 Adapted from Chapters 4 and 5 of the book Introduction to Data Mining by Tan, Steinbach,
Online Algorithms – II Amrinder Arora Permalink:
CSci 8980: Data Mining (Fall 2002)
1 BUS 297D: Data Mining Professor David Mease Lecture 5 Agenda: 1) Go over midterm exam solutions 2) Assign HW #3 (Due Thurs 10/1) 3) Lecture over Chapter.
Tree-based methods, neutral networks
Lecture 5 (Classification with Decision Trees)
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Classification and Prediction: Basic Concepts Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 11 = Finish ch. 4 and start.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CISC 4631 Data Mining Lecture 03: Introduction to classification Linear classifier Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Chapter 9 – Classification and Regression Trees
Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.
Classification. 2 Classification: Definition  Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes.
Classification Basic Concepts, Decision Trees, and Model Evaluation
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Modul 6: Classification. 2 Classification: Definition  Given a collection of records (training set ) Each record contains a set of attributes, one of.
Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
SOCIAL NETWORKS ANALYSIS SEMINAR INTRODUCTORY LECTURE #2 Danny Hendler and Yehonatan Cohen Advanced Topics in on-line Social Networks Analysis.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Classification: Basic Concepts, Decision Trees. Classification: Definition l Given a collection of records (training set ) –Each record contains a set.
Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.
Lecture Notes for Chapter 4 Introduction to Data Mining
ECE 471/571 – Lecture 20 Decision Tree 11/19/15. 2 Nominal Data Descriptions that are discrete and without any natural notion of similarity or even ordering.
1 Illustration of the Classification Task: Learning Algorithm Model.
Big Data Analysis and Mining Qinpei Zhao 赵钦佩 2015 Fall Decision Tree.
Classification: Basic Concepts, Decision Trees. Classification Learning: Definition l Given a collection of records (training set) –Each record contains.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining By Tan, Steinbach,
Classification and Regression Trees
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
CIS 335 CIS 335 Data Mining Classification Part I.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Presentation prepared by Yehonatan Cohen and Danny Hendler Some of the slides based on the online book “Social media mining” Danny Hendler Advanced Topics.
Computational Biology
Data Mining Introduction to Classification using Linear Classifiers
Lecture Notes for Chapter 4 Introduction to Data Mining
Trees, bagging, boosting, and stacking
Classification Decision Trees
EECS 647: Introduction to Database Systems
Overview of Supervised Learning
Data Mining Classification: Basic Concepts and Techniques
Introduction to Data Mining, 2nd Edition by
Lecture Notes for Chapter 4 Introduction to Data Mining
Introduction to Data Mining, 2nd Edition by
Classification Basic Concepts, Decision Trees, and Model Evaluation
Basic Concepts and Decision Trees
R & Trees There are two tree libraries: tree: original
Statistical Learning Dong Liu Dept. EEIS, USTC.
COSC 4368 Intro Supervised Learning Organization
COP5577: Principles of Data Mining Fall 2008 Lecture 4 Dr
Presentation transcript:

Math 5364 Notes Chapter 4: Classification Jesse Crawford Department of Mathematics Tarleton State University

Today's Topics Preliminaries Decision Trees Hunt's Algorithm Impurity measures

Preliminaries Data: Table with rows and columns Rows: People or objects being studied Columns: Characteristics of those objects Rows: Objects, subjects, records, cases, observations, sample elements. Columns: Characteristics, attributes, variables, features

Dependent variable Y: Variable being predicted. Independent variables Xj : Variables used to make predictions. Dependent variable: Response or output variable. Independent variables: Predictors, explanatory variables, control variables, covariates, or input variables.

Nominal variable: Values are names or categories with no ordinal structure. Examples: Eye color, gender, refund, marital status, tax fraud. Ordinal variable: Values are names or categories with an ordinal structure. Examples: T-shirt size (small, medium, large) or grade in a class (A, B, C, D, F). Binary/Dichotomous variable: Only two possible values. Examples: Refund and tax fraud. Categorical/qualitative variable: Term that includes all nominal and ordinal variables. Quantitative variable: Variable with numerical values for which meaningful arithmetic operations can be applied. Examples: Blood pressure, cholesterol, taxable income.

Regression: Determining or predicting the value of a quantitative variable using other variables. Classification: Determining or predicting the value of a categorical variable using other variables. Classifying tumors as benign or malignant. Classifying credit card transactions as legitimate or fraudulent. Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil. Classifying a user of a website as a real person or a bot. Predicting whether a student will be retained/academically successful at a university.

Related fields: Data mining/data science, machine learning, artificial intelligence, and statistics. Classification learning algorithms: Decision trees Rule-based classifiers Nearest-neighbor classifiers Bayesian classifiers Artificial neural networks Support vector machines

Decision Trees ⋮ ⋮ Training Data Body Temperature Warm-blooded Name Body Skin Gives Aquatic Has Class   Temperature Cover Birth Creature Legs Label Human Warm-blooded hair yes no mammal Python Cold-blooded scales non-mammal Salmon Whale Penguin feathers semi Training Data ⋮ ⋮ Body Temperature Warm-blooded Cold-blooded Gives Birth? Non-mammal Yes No Mammal Non-mammal

Chicken  Classified as non-mammal Dog  Classified as mammal Body Temperature Warm-blooded Cold-blooded Gives Birth? Non-mammal Yes No Mammal Non-mammal Chicken  Classified as non-mammal Dog  Classified as mammal Frog  Classified as non-mammal Duck-billed platypus  Classified as non-mammal (mistake)

Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES

Hunt’s Algorithm (Basis of ID3, C4.5, and CART) (7, 3) No

Hunt’s Algorithm (Basis of ID3, C4.5, and CART) (7, 3) Refund Yes No NO NO N = 3 (3, 0) N = 7 (4, 3)

Hunt’s Algorithm (Basis of ID3, C4.5, and CART) (7, 3) Refund N = 7 (4, 3) Yes No NO MarSt Married N = 3 (3, 0) Single Divorced NO YES N = 3 (3, 0) N = 4 (1, 3)

Hunt’s Algorithm (Basis of ID3, C4.5, and CART) (7, 3) Refund N = 7 (4, 3) Yes No NO MarSt Married N = 3 (3, 0) Single Divorced NO TaxInc N = 3 (3, 0) < 80K > 80K NO YES N = 1 (1, 0) N = 3 (0, 3)

Impurity Measures No

Impurity Measures

Impurity Measures

Hunt’s Algorithm (Basis of ID3, C4.5, and CART) (7, 3) No

Hunt’s Algorithm (Basis of ID3, C4.5, and CART) Refund Yes No NO NO N = 3 (3, 0) N = 7 (4, 3)

Hunt’s Algorithm (Basis of ID3, C4.5, and CART) Refund Yes No NO MarSt Married N = 3 (3, 0) Single Divorced NO YES N = 3 (3, 0) N = 4 (1, 3)

Hunt’s Algorithm (Basis of ID3, C4.5, and CART) Refund Yes No NO MarSt Married N = 3 (3, 0) Single Divorced NO TaxInc N = 3 (3, 0) < 80K > 80K NO YES N = 1 (1, 0) N = 3 (0, 3)

Types of Splits Binary Split Multi-way Split Divorced Marital Status Single, Divorced Married Marital Status Single Married Divorced

Types of Splits

Hunt’s Algorithm Details Which variable should be used to split first? Answer: the one that decreases impurity the most. How should each variable be split? Answer: in the manner that minimizes the impurity measure. Stopping conditions: If all records in a node have the same class label, it becomes a terminal node with that class label. If all records in a node have the same attributes, it becomes a terminal node with label determined by majority rule. If gain in impurity falls below a given threshold. If tree reaches a given depth. If other prespecified conditions are met.

Today's Topics Data sets included in R Decision trees with rpart and party packages Using a tree to classify new data Confusion matrices Classification accuracy

Iris Data Set Iris Flowers 3 Species: Setosa, Versicolor, and Virginica Variables: Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width head(iris) attach(iris) plot(Petal.Length,Petal.Width) plot(Petal.Length,Petal.Width,col=Species) plot(Petal.Length,Petal.Width,col=c('blue','red','purple')[Species])

Iris Data Set plot(Petal.Length,Petal.Width,col=c('blue','red','purple')[Species])

The rpart Package library(rpart) library(rattle) iristree=rpart(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width, data=iris) iristree=rpart(Species~.,data=iris) fancyRpartPlot(iristree)

predSpecies=predict(iristree,newdata=iris,type="class") confusionmatrix=table(Species,predSpecies) confusionmatrix

plot(jitter(Petal. Length),jitter(Petal plot(jitter(Petal.Length),jitter(Petal.Width),col=c('blue','red','purple')[Species]) lines(1:7,rep(1.8,7),col='black') lines(rep(2.4,4),0:3,col='black')

predSpecies=predict(iristree,newdata=iris,type="class") confusionmatrix=table(Species,predSpecies) confusionmatrix

Confusion Matrix Predicted Class Class = 1 Class = 0 Actual Class f11

Accuracy for Iris Decision Tree accuracy=sum(diag(confusionmatrix))/sum(confusionmatrix) The accuracy is 96% Error rate is 4%

The party Package library(party) iristree2=ctree(Species~.,data=iris) plot(iristree2)

The party Package plot(iristree2,type='simple')

Predictions with ctree predSpecies=predict(iristree2,newdata=iris) confusionmatrix=table(Species,predSpecies) confusionmatrix

iristree3=ctree(Species~ iristree3=ctree(Species~.,data=iris, controls=ctree_control(maxdepth=2)) plot(iristree3)

Today's Topics Training and Test Data Training error, test error, and generalization error Underfitting and Overfitting Confidence intervals and hypothesis tests for classification accuracy

Training and Testing Sets

Training and Testing Sets Divide data into training data and test data. Training data: used to construct classifier/statisical model Test data: used to test classifier/model Types of errors: Training error rate: error rate on training data Generalization error rate: error rate on all nontraining data Test error rate: error rate on test data Generalization error is most important Use test error to estimate generalization error Entire process is called cross-validation

Example Data

Split 30% training data and 70% test data. extree=rpart(class~.,data=traindata) fancyRpartPlot(extree) plot(extree) Training accuracy = 79% Training error = 21% Testing error = 29% dim(extree$frame) Tells us there are 27 nodes

Training error = 40% Testing error = 40% 1 Nodes

extree=rpart(class~. ,data=traindata, control=rpart extree=rpart(class~.,data=traindata, control=rpart.control(maxdepth=1)) Training error = 36% Testing error = 39% 3 Nodes

extree=rpart(class~. ,data=traindata, control=rpart extree=rpart(class~.,data=traindata, control=rpart.control(maxdepth=2)) Training error = 30% Testing error = 34% 5 Nodes

extree=rpart(class~. ,data=traindata, control=rpart extree=rpart(class~.,data=traindata, control=rpart.control(maxdepth=4)) Training error = 28% Testing error = 34% 9 Nodes

extree=rpart(class~. ,data=traindata, control=rpart extree=rpart(class~.,data=traindata, control=rpart.control(maxdepth=5)) Training error = 24% Testing error = 30% 21 Nodes

extree=rpart(class~. ,data=traindata, control=rpart extree=rpart(class~.,data=traindata, control=rpart.control(maxdepth=6)) Training error = 21% Testing error = 29% 27 Nodes

extree=rpart(class~. ,data=traindata, control=rpart extree=rpart(class~.,data=traindata, control=rpart.control(minsplit=1,cp=0.004)) Default value of cp is 0.01 Lower values of cp make tree more complex Training error = 16% Testing error = 30% 81 Nodes

extree=rpart(class~. ,data=traindata, control=rpart extree=rpart(class~.,data=traindata, control=rpart.control(minsplit=1,cp=0.0025)) Default value of cp is 0.01 Lower values of cp make tree more complex Training error = 9% Testing error = 31% 195 Nodes

extree=rpart(class~. ,data=traindata, control=rpart extree=rpart(class~.,data=traindata, control=rpart.control(minsplit=1,cp=0.0015)) Default value of cp is 0.01 Lower values of cp make tree more complex Training error = 6% Testing error = 33% 269 Nodes

extree=rpart(class~. ,data=traindata, control=rpart extree=rpart(class~.,data=traindata, control=rpart.control(minsplit=1,cp=0)) Default value of cp is 0.01 Lower values of cp make tree more complex Training error = 0% Testing error = 34% 477 Nodes

Testing Error Training Error

Underfitting and Overfitting Underfitting: Model is not complex enough High training error High generalization error Overfitting: Model is too complex Low training error

A Linear Regression Example Training error = 0.0129

A Linear Regression Example Training error = 0.0129 Test error = 0.00640

A Linear Regression Example Training error = 0

A Linear Regression Example Training error = 0 Test error = 50458.33

Occam's Razor Occam's Razor/Principle of Parsimony: Simpler models are preferred to more complex models, all other things being equal.

Confidence Interval for Classification Accuracy

Confidence Interval for Example Data (0.6888, 0.7276) (0.6891, 0.7280)

Exact Binomial Confidence Interval binom.test(1488,2100) (0.6886, 0.7279)

Comparing Two Classifiers Classifier 2 Correct Classifier 2 Incorrect Classifier 1 Correct a b Classifier 1 Incorrect c d a, b, c, and d Number of records in each category

Exact McNemar Test library(exact2x2) Use the mcnemar.exact function

K-fold Cross-validation

Other Types of Cross-validation Leave-one-out CV For each record Use that record as a test set Use all other records as a training set Compute accuracy Afterwards, average all accuracies (Equivalent to K-fold CV with K = n) Delete-d CV Repeat the following m times: Randomly select d records Use those d records as a test set n = Number of records in original data

Other Types of Cross-validation Bootstrap Repeat the following b times: Randomly select n records with replacement Use those n records as a training set Use all other records as a test set Compute accuracy Afterwards, average all accuracies n = Number of records in original data