# Classification and Supervised Learning

## Presentation on theme: "Classification and Supervised Learning"— Presentation transcript:

Classification and Supervised Learning
Credits Hand, Mannila and Smyth Cook and Swayne Padhraic Smyth’s notes Shawndra Hill notes Data Mining - Massey University

Data Mining - Massey University
Outline Supervised Learning Overview Linear Discriminant analysis Tree models Probability based and Bayes models Data Mining - Massey University

Data Mining - Massey University
Classification Classification or supervised learning prediction for categorical response for binary, T/F, can be used as an alternative to logistic regression often is a quantized real value or non-scaled numeric can be used with categorical predictors great for missing data - can be a response in itself! methods for fitting can be parametric algorithmic Data Mining - Massey University

Data Mining - Massey University
Because labels are known, you can build parametric models for the classes can also define decision regions and decision boundaries Data Mining - Massey University

Examples of classifiers
Generative/class-conditional/probabilistic, based on p( x | ck ), Naïve Bayes (simple, but often effective in high dimensions) Parametric generative models, e.g., Gaussian - Linear discriminant analysis Regression-based, based on p( ck | x ) Logistic regression: simple, linear in “odds” space Neural network: non-linear extension of logistic Discriminative models, focus on locating optimal decision boundaries Decision trees: “swiss army knife”, often effective in high dimensions Linear discriminants, Support vector machines (SVM): generalization of linear discriminants, can be quite effective, computational complexity is an issue Nearest neighbor: simple, can scale poorly in high dimensions Data Mining - Massey University

Evaluation of Classifiers
Already seen some of this… Assume output is probability vector for each class Classification error P(true Y | predicted Y) ROC Area area under ROC plot top-k analysis sometimes all you care about is how well you can do at the top of the list plan A: top 50 candidates have 44 sales, top 500 have 300 sales plan B: top 50 have 48 sales, top 500 have 270 sales which do you choose? often used with imbalanced class distributions - good classification error is easy! fraud, etc calibration is sometimes important if you say something has 90% chance, does it? Data Mining - Massey University

Linear Discriminant Analysis
LDA - parametric classification Fisher 1936 Rao 1948 linear combination of variables separating two classes by comparing the difference between class means with the variance in each class assumes multivariate normal distribution of each class (cluster) pros: easy to define likelihood easy to define boundary easy to measure goodness of fit interpretation easy cons: very rare for data come close to a multi-normal! works only on numeric predictors Data Mining - Massey University

Data Mining - Massey University
painters data: 54 painters rated on a score of 0-21 for composition, drawing color and expression. Classified them into 8 classes: Composition Drawing Colour Expression School Da Udine A Da Vinci A Del Piombo A Del Sarto A Fr. Penni A Guilio Romano A Michelangelo A Perino del Vaga A Perugino A Raphael A library(MASS) lda1=lda(School~.,data=painters) Data Mining - Massey University

Data Mining - Massey University

Data Mining - Massey University

Data Mining - Massey University
LDA - predictions to check how good the model is, you can see how well it predicts what actually happened: > predict(lda1) \$class [1] D H D A A H A C A A A A A C A B B E C C B E D D D D G D D D D D E D G H E E E F G A F D G A G G E [50] G C H H H Levels: A B C D E F G H \$posterior A B C D E F Da Udine e e-03 Da Vinci e e-02 Del Piombo e e-03 Del Sarto e e-03 > table(predict(lda1)\$class,painters\$Sch) A B C D E F G H A B C D E F G H Data Mining - Massey University

Classification (Decision) Trees
Trees are one of the most popular and useful of all data mining models Algorithmic version of classification no distributional assumptions Competing algorithms: CART, C4.5, DBMiner Pros: can handle real and nominal inputs speed and scalability robustness to outliers and missing values interpretability compactness of classification rules Cons interpretability ? several tuning parameters to set with little guidance decision boundary is non-continuous Data Mining - Massey University

Data Mining - Massey University
Decision Tree Example Debt Income Data Mining - Massey University

Data Mining - Massey University
Decision Tree Example Debt Income > t1 ?? t1 Income Data Mining - Massey University

Data Mining - Massey University
Decision Tree Example Debt Income > t1 t2 Debt > t2 t1 Income ?? Data Mining - Massey University

Data Mining - Massey University
Decision Tree Example Debt Income > t1 t2 Debt > t2 t3 t1 Income Income > t3 Data Mining - Massey University

Data Mining - Massey University
Decision Tree Example Debt Income > t1 t2 Debt > t2 t3 t1 Income Income > t3 Note: tree boundaries are piecewise linear and axis-parallel Data Mining - Massey University

Data Mining - Massey University
Example: Titanic Data On the Titanic 1313 passengers 34% survived was it a random sample? or did survival depend on features of the individual? sex age class pclass survived name age embarked sex 1 1st Allen, Miss Elisabeth Walton Southampton female 2 1st Allison, Miss Helen Loraine Southampton female 3 1st Allison, Mr Hudson Joshua Creighton Southampton male 4 1st Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) Southampton female 5 1st Allison, Master Hudson Trevor Southampton male 6 2nd Anderson, Mr Harry Southampton male Data Mining - Massey University

Data Mining - Massey University
Decision trees At first ‘split’ decide which is the best variable to create separation between the survivors and non-survivors cases: N:1313 p: 0.34 Y: 150 N: 1500 Y: 50 N: 3500 Sex? N:850 p: 0.16 N:463 p: 0.66 Male Female Y: 5 N: 3000 Y: 50 N: 1200 Y: 150 N: 300 Y: 45 N: 500 Age Less Than 12 Class 1st or 2nd Class 3rd Class N: 250 p: 0.912 N:29 p:0.73 N: 821 p: 0.15 N=213 p: 0.37 Greater than 12 Class N: 646 p:0.10 N: 175 p: 0.31 2nd or 3rd 1st Class Goodness of split is determined by the ‘purity’ of the leaves Data Mining - Massey University

Decision Tree Induction
Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Examples are partitioned recursively to create pure subgroups Purity measured by: information gain, Gini index, entropy, etc Conditions for stopping partitioning All samples for a given node belong to the same class All leaf nodes are smaller than a specified threshold BUT: building a tree too big will overfit the data, and will predict poorly. Predictions: each leaf will have class probability estimates (CPE), based on the training data that ended up in that leaf. majority voting is employed for classifying all members of the leaf Data Mining - Massey University

Purity in tree building
Why do we care about pure subgroups? purity of the subgroup gives us confidence that new cases that fall into this “leaf” have a given label Data Mining - Massey University

Data Mining - Massey University
Purity measures If a data set T contains examples from n classes, gini index, gini(T) is defined as where pj is the relative frequency of class j in T. If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as For Titanic split on sex: 850/1313 x(1-0.16*0.84) + 463/1313*(1-0.66*0.34) = 0.83 The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute). Another often used measure: Entropy Data Mining - Massey University

Calculating Information Gain
Information Gain = Impurity (parent) – [Impurity (children)] Entire population (30 instances) 17 instances Balance>=50K Balance<50K 13 instances (Weighted) Average Impurity of Children = Information Gain= Entropy ( parent) – Entropy (Children) = = 0.38 23 Data Mining - Massey University 23

Data Mining - Massey University
Information Gain Information Gain = Impurity (parent) – [Impurity (children)] Gain=0.38 Impurity(A) =0.996 Impurity(,B,C) = 0.61 Impurity(D,E) =0.405 Gain=0.205 D Age>=45 B Impurity(D)=0 Log20 +1 log21=0 Entire population Balance>=50K Age<45 Impurity(B) = 0.787 A E C Impurity(E) = -3/7 Log23/7 -4/7Log24/7=0.985 Balance<50K Impurity (C)= 0.39 Bad risk (Default) 24 Good risk (Not default) Data Mining - Massey University 24

Data Mining - Massey University
Information Gain At each node chose first the attribute that obtains maximum information gain: providing maximum information Gain=0.38 Impurity(A) =0.996 Impurity(B,C)= 0.61 Impurity(D,E)= 0.405 Gain=0.205 D B Age>=45 Entire population Balance>=50K A Age<45 E C Balance<50K Bad risk (Default) 25 Good risk (Not default) Data Mining - Massey University 25

Avoid Overfitting in Classification
The generated tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Result is in poor accuracy for unseen samples Two approaches to avoid overfitting Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the “best pruned tree” Data Mining - Massey University

Which attribute to split over?
Brute-force search: At each node examine splits over each of the attributes Select the attribute for which the maximum information gain is obtained Balance <50K >=50K 27 Data Mining - Massey University 27

Data Mining - Massey University
Finding the right size Use a hold out sample (n fold cross-validation) Overfit a tree - with many leaves snip the tree back and use the hold out sample for prediction, calculate predictive error record error rate for each tree size repeat for n folds plot average error rate as a function of tree size fit optimal tree size to the entire data set R note: can use cvtree() Data Mining - Massey University

Data Mining - Massey University
Olive oil data X region area palmitic palmitoleic stearic oleic linoleic linolenic arachidic 1 1.North-Apulia 2 2.North-Apulia 3 3.North-Apulia 4 4.North-Apulia 5 5.North-Apulia 6 6.North-Apulia classification of Italian olive oils by their components 9 areas, from 3 regions Data Mining - Massey University

Data Mining - Massey University

Data Mining - Massey University

Data Mining - Massey University

Data Mining - Massey University
Regression Trees Trees can also be used for regression: when the response is real valued leaf prediction is mean value instead of class probability estimates (CPE) helpful with categorical predictors Data Mining - Massey University

Data Mining - Massey University
Tips data Data Mining - Massey University

Treating Missing Data in Trees
Missing values are common in practice Approaches to handing missing values During training Ignore rows with missing values (inefficient) During testing Send the example being classified down both branches and average predictions Replace missing values with an “imputed value” Other approaches Treat “missing” as a unique value (useful if missing values are correlated with the class) Surrogate splits method Search for and store “surrogate” variables/splits during training Data Mining - Massey University

Other Issues with Classification Trees
Can use non-binary splits Multi-way Linear combinations Tend to increase complexity substantially, and don’t improve performance Binary splits are interpretable, even by non-experts Easy to compute, visualize Model instability A small change in the data can lead to a completely different tree Model averaging techniques (like bagging) can be useful Restricted to splits along coordinate axes Discontinuities in prediction space Data Mining - Massey University

Why Trees are widely used in Practice
Can handle high dimensional data builds a model using 1 dimension at time Can handle any type of input variables categorical, real-valued, etc Invariant to monotonic transformations of input variables E.g., using x, 10x + 2, log(x), 2^x, etc, will not change the tree So, scaling is not a factor - user can be sloppy! Trees are (somewhat) interpretable domain expert can “read off” the tree’s logic Tree algorithms are relatively easy to code and test Data Mining - Massey University

Data Mining - Massey University
Limitations of Trees Representational Bias classification: piecewise linear boundaries, parallel to axes regression: piecewise constant surfaces High Variance trees can be “unstable” as a function of the sample e.g., small change in the data -> completely different tree causes two problems 1. High variance contributes to prediction error 2. High variance reduces interpretability Trees are good candidates for model combining Often used with boosting and bagging Data Mining - Massey University

Decision Trees are not stable
Moving just one example slightly may lead to quite different trees and space partition! Lack of stability against small perturbation of data. Figure from Duda, Hart & Stork, Chap. 8 Data Mining - Massey University

Data Mining - Massey University
Random Forests Another con for trees: trees are sensitive to the primary split, which can lead the tree in inappropriate directions one way to see this: fit a tree on a random sample, or a bootstrapped sample of the data - Solution: random forests: an ensemble of unpruned decision trees each tree is built on a random subset of the training data at each split point, only a random subset of predictors are selected many parameters to fiddle! prediction is simply majority vote of the trees ( or mean prediction of the trees). Has the advantage of trees, with more robustness, and a smoother decision rule. Also, they are trendy! Data Mining - Massey University

Data Mining - Massey University
Other Models: k-NN k-Nearest Neighbors (kNN) to classify a new point look at the kth nearest neighbor from the training set look at the circle of radius r that includes this point what is the class distribution of this circle? Advantages simple to understand simple to implement Disadvantages what is k? k=1 : high variance, sensitive to data k large : robust, reduces variance but blends everything together - includes ‘far away points’ what is near? Euclidean distance assumes all inputs are equally important how do you deal with categorical data? no interpretable model Best to use cross-validation and visualization techniques to pick k. Data Mining - Massey University

Probabilistic (Bayesian) Models for Classification
If you belong to class k, you have a distribution over input vectors: Then, given priors on ck, we can get posterior distribution on classes: At each point in the x space, we have a predicted class vector, allowing for decision boundaries Data Mining - Massey University

Example of Probabilistic Classification
p( x | c2 ) p( x | c1 ) 1 p( c1 | x ) 0.5 Data Mining - Massey University

Example of Probabilistic Classification
p( x | c2 ) p( x | c1 ) 1 p( c1 | x ) 0.5 Data Mining - Massey University

Decision Regions and Bayes Error Rate
p( x | c2 ) p( x | c1 ) Class c2 Class c1 Class c2 Class c1 Class c2 Optimal decision regions = regions where 1 class is more likely Optimal decision regions  optimal decision boundaries Data Mining - Massey University

Decision Regions and Bayes Error Rate
p( x | c2 ) p( x | c1 ) Class c2 Class c1 Class c2 Class c1 Class c2 Optimal decision regions = regions where 1 class is more likely Optimal decision regions  optimal decision boundaries Bayes error rate = fraction of examples misclassified by optimal classifier (shaded area above). If max=1, then there is no error. Hence: Data Mining - Massey University

Procedure for optimal Bayes classifier
For each class learn a model p( x | ck ) E.g., each class is multivariate Gaussian with its own mean and covariance Use Bayes rule to obtain p( ck | x ) => this yields the optimal decision regions/boundaries => use these decision regions/boundaries for classification Correct in theory…. but practical problems include: How do we model p( x | ck ) ? Even if we know the model for p( x | ck ), modeling a distribution or density will be very difficult in high dimensions (e.g., p = 100) Alternative approach: model the decision boundaries directly Data Mining - Massey University

Bayesian Classification: Why?
Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured Data Mining - Massey University

Naïve Bayes Classifiers
Generative probabilistic model with conditional independence assumption on p( x | ck ), i.e p( x | ck ) = P p( xj | ck ) Typically used with nominal variables Real-valued variables discretized to create nominal versions Comments: Simple to train (just estimate conditional probabilities for each feature-class pair) Often works surprisingly well in practice e.g., state of the art for text-classification, basis of many widely used spam filters Data Mining - Massey University

Data Mining - Massey University
Naïve Bayes When all variables are categorical, classification should be easy (since all xs can be enumerated): But, remember the curse of dimensionality! Data Mining - Massey University

Naïve Bayes Classification
Recall: p(ck |x)  p(x| ck)p(ck) Now assume variables are conditionally independent given the classes: C x1 x2 xp is this a valid assumption? Probably not, but perhaps still useful example - symptoms and diseases Data Mining - Massey University

Data Mining - Massey University
Naïve Bayes estimate of the prob that a point x will belong to ck: “weights of evidence” if two classes: Data Mining - Massey University

Play-tennis example: estimating P(xi|C)
outlook P(sunny|y) = 2/9 P(sunny|n) = 3/5 P(overcast|y) = 4/9 P(overcast|n) = 0 P(rain|y) = 3/9 P(rain|n) = 2/5 temperature P(hot|y) = 2/9 P(hot|n) = 2/5 P(mild|y) = 4/9 P(mild|n) = 2/5 P(cool|y) = 3/9 P(cool|n) = 1/5 humidity P(high|y) = 3/9 P(high|n) = 4/5 P(normal|y) = 6/9 P(normal|n) = 2/5 windy P(true|y) = 3/9 P(true|n) = 3/5 P(false|y) = 6/9 P(false|n) = 2/5 P(y) = 9/14 P(n) = 5/14 Data Mining - Massey University

Play-tennis example: classifying X
An unseen sample X = <rain, hot, high, false> P(X|y)·P(y) = P(rain|y)·P(hot|y)·P(high|y)·P(false|y)·P(y) = 3/9·2/9·3/9·6/9·9/14 = P(X|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = Sample X is classified in class n (you’ll lose!) Data Mining - Massey University

The independence hypothesis…
… makes computation possible … yields optimal classifiers when satisfied … but is seldom satisfied in practice, as attributes (variables) are often correlated. Yet, empirically, naïve bayes performs really well in practice. Data Mining - Massey University

Data Mining - Massey University
Lab #5 Olive Oil Data from Cook and Swayne book consists of % composition of fatty acids found in the lipid fraction of Italian Olive Oils. Study done to determine authenticity of olive oils. region (North, South, and Sardinia) area (nine regions) 9 fatty acids and %s Data Mining - Massey University

Data Mining - Massey University
Lab #5 Spam Data Collected at Iowa State University in (Cook and Swayne) 2171 cases 21 variables be careful - 3 vars: spampct, category, and spam were determined by spam models - do not use these for fitting! Goal: determine spam from valid mail Data Mining - Massey University