Classification Based in part on Chapter 10 of Hand, Manilla, & Smyth and Chapter 7 of Han and Kamber David Madigan.

Slides:



Advertisements
Similar presentations
Classification and Prediction
Advertisements

Data Mining Lecture 9.
DECISION TREES. Decision trees  One possible representation for hypotheses.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Data Mining Classification This lecture node is modified based on Lecture Notes for Chapter 4/5 of Introduction to Data Mining by Tan, Steinbach, Kumar,
Classification Techniques: Decision Tree Learning
Chapter 7 – Classification and Regression Trees
Lecture outline Classification Decision-tree classification.
Classification and Prediction
Flexible Metric NN Classification based on Friedman (1995) David Madigan.
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Linear Methods for Classification
Classification Continued
Lecture 5 (Classification with Decision Trees)
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Classification and Prediction
Classification.
1 Linear Classification Problem Two approaches: -Fisher’s Linear Discriminant Analysis -Logistic regression model.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ensemble Learning (2), Tree and Forest
Chapter 7 Decision Tree.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Mohammad Ali Keyvanrad
Basics of Decision Trees  A flow-chart-like hierarchical tree structure –Often restricted to a binary structure  Root: represents the entire dataset.
Chapter 9 – Classification and Regression Trees
Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.
Feature Selection: Why?
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification CS 685: Special Topics in Data Mining Fall 2010 Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.
Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.
Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length Abdomen Length.
Classification and Prediction
Classification Today: Basic Problem Decision Trees.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining By Tan, Steinbach,
Decision Trees.
CIS 335 CIS 335 Data Mining Classification Part I.
Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Chapter 6 Decision Tree.
DECISION TREES An internal node represents a test on an attribute.
Ch9: Decision Trees 9.1 Introduction A decision tree:
Chapter 6 Classification and Prediction
Decision Tree Saed Sayad 9/21/2018.
Introduction to Data Mining, 2nd Edition by
Classification and Prediction
Introduction to Data Mining, 2nd Edition by
CS 685: Special Topics in Data Mining Jinze Liu
CS 685: Special Topics in Data Mining Jinze Liu
CSCI N317 Computation for Scientific Applications Unit Weka
Statistical Learning Dong Liu Dept. EEIS, USTC.
©Jiawei Han and Micheline Kamber
Avoid Overfitting in Classification
CS 685: Special Topics in Data Mining Spring 2009 Jinze Liu
CS 685: Special Topics in Data Mining Jinze Liu
Presentation transcript:

Classification Based in part on Chapter 10 of Hand, Manilla, & Smyth and Chapter 7 of Han and Kamber David Madigan

Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy Categorical y  {c 1,…,c m }: classification Real-valued y: regression Note: usually assume {c 1,…,c m } are mutually exclusive and exhaustive

Probabilistic Classification Let p(c k ) = prob. that a randomly chosen object comes from c k Objects from c k have: p(x |c k,  k ) (e.g., MVN) Then: p(c k | x )  p(x |c k,  k ) p(c k ) Bayes Error Rate: Lower bound on the best possible error rate

Bayes error rate about 6%

Classifier Types Discrimination: direct mapping from x to {c 1,…,c m } - e.g. perceptron, SVM, CART Regression: model p(c k | x ) - e.g. logistic regression, CART Class-conditional: model p(x |c k,  k ) - e.g. “Bayesian classifiers”, LDA

Simple Two-Class Perceptron Define: Classify as class 1 if h(x)>0, class 2 otherwise Score function: # misclassification errors on training data For training, replace class 2 x j ’s by -x j ; now need h(x)>0 Initialize weight vector Repeat one or more times: For each training data point x i If point correctly classified, do nothing Else Guaranteed to converge when there is perfect separation

Linear Discriminant Analysis K classes, X n × p data matrix. p(c k | x )  p(x |c k,  k ) p(c k ) Could model each class density as multivariate normal: LDA assumes for all k. Then: This is linear in x.

Linear Discriminant Analysis (cont.) It follows that the classifier should predict: “linear discriminant function” If we don’t assume the  k ’s are identicial, get Quadratic DA:

Linear Discriminant Analysis (cont.) Can estimate the LDA parameters via maximum likelihood:

LDAQDA

LDA (cont.) Fisher is optimal if the class are MVN with a common covariance matrix Computational complexity O(mp 2 n)

Logistic Regression Note that LDA is linear in x: Linear logistic regression looks the same: But the estimation procedure for the co-efficicents is different. LDA maximizes joint likelihood [y,X]; logistic regression maximizes conditional likelihood [y|X]. Usually similar predictions.

Logistic Regression MLE For the two-class case, the likelihood is: The maximize need to solve (non-linear) score equations:

Logistic Regression Modeling South African Heart Disease Example (y=MI) Coef.S.E.Z score Intercept sbp Tobacco ldl Famhist Obesity Alcohol Age Wald

Tree Models Easy to understand Can handle mixed data, missing values, etc. Sequential fitting method can be sub-optimal Usually grow a large tree and prune it back rather than attempt to optimally stop the growing process

Training Dataset This follows an example from Quinlan’s ID3

Output: A Decision Tree for “buys_computer” age? overcast student?credit rating? noyes fair excellent <=30 >40 no yes

Confusion matrix

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) –Tree is constructed in a top-down recursive divide-and-conquer manner –At start, all the training examples are at the root –Attributes are categorical (if continuous-valued, they are discretized in advance) –Examples are partitioned recursively based on selected attributes –Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning –All samples for a given node belong to the same class –There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf –There are no samples left

Information Gain (ID3/C4.5) Select the attribute with the highest information gain Assume there are two classes, P and N –Let the set of examples S contain p elements of class P and n elements of class N –The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined as e.g. I(0.5,0.5)=1; I(0.9,0.1)=0.47; I(0.99,0.01)=0.08;

Information Gain in Decision Tree Induction Assume that using attribute A a set S will be partitioned into sets {S 1, S 2, …, S v } –If S i contains p i examples of P and n i examples of N, the entropy, or the expected information needed to classify objects in all subtrees S i is The encoding information that would be gained by branching on A

Attribute Selection by Information Gain Computation  Class P: buys_computer = “yes”  Class N: buys_computer = “no”  I(p, n) = I(9, 5) =0.940  Compute the entropy for age: Hence Similarly

Gini Index (IBM IntelligentMiner) If a data set T contains examples from n classes, gini index, gini(T) is defined as where p j is the relative frequency of class j in T. If a data set T is split into two subsets T 1 and T 2 with sizes N 1 and N 2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as The attribute provides the smallest gini split (T) is chosen to split the node

Avoid Overfitting in Classification The generated tree may overfit the training data –Too many branches, some may reflect anomalies due to noise or outliers –Result is in poor accuracy for unseen samples Two approaches to avoid overfitting –Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold –Postpruning: Remove branches from a “fully grown” tree— get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the “best pruned tree”

Approaches to Determine the Final Tree Size Separate training (2/3) and testing (1/3) sets Use cross validation, e.g., 10-fold cross validation Use minimum description length (MDL) principle: –halting growth of the tree when the encoding is minimized

Nearest Neighbor Methods k-NN assigns an unknown object to the most common class of its k nearest neighbors Choice of k? (bias-variance tradeoff again) Choice of metric? Need all the training to be present to classify a new point (“lazy methods”) Surprisingly strong asymptotic results (e.g. no decision rule is more than twice as accurate as 1-NN)

Flexible Metric NN Classification

Naïve Bayes Classification Recall: p(c k |x)  p(x| c k )p(c k ) Now suppose: Then: Equivalently: C x1x1 x2x2 xpxp … “weights of evidence”

Evidence Balance Sheet

Naïve Bayes (cont.) Despite the crude conditional independence assumption, works well in practice (see Friedman, 1997 for a partial explanation) Can be further enhanced with boosting, bagging, model averaging, etc. Can relax the conditional independence assumptions in myriad ways (“Bayesian networks”)

Dietterich (1999) Analysis of 33 UCI datasets