CSE 4705 Artificial Intelligence

CSE 4705 Artificial Intelligence
Jinbo Bi Department of Computer Science & Engineering

Machine learning (1) Supervised learning algorithms

A bit review of last class
Fisher’s discriminant analysis Assuming each class follows a Gaussian distribution X|Y= k ~ N(k, k), we can use maximum a posterior to derive a quadratic classification rule (quadratic in terms of x) C(X) = arg mink {(X - k)’ k-1 (X - k) + log| k |} If we further assume that the two classes have the same covariance matrix, k = the discriminant rule is linear (Linear discriminant analysis, or LDA; FLDA for k = 2): Then, we give another separate derivation of the linear discriminant rule using geometric interpretation where

A bit review of last class
Using geometric interpretation, we maximize the signal-to-noise ratio That gives us the exactly same discriminant rule Hence, we can conclude Fisher’s linear discriminant rule does find a direction along which the data shows the best separability Between-class separation Within-class cohesion where

Geometric interpretation (two-class)
The geometric interpretation version of linear discriminant analysis can be extended to multi-class classification (not covered in class, and interested students can see additional slides in last lecture)

Supervised learning: classification
Underfitting or Overfitting can also happen in classification approaches We now illustrate these practical issues on classification problem We will cover neural networks in the next week

Underfitting and overfitting
500 circular and 500 triangular data points. Circular points: 0.5  sqrt(x12+x22)  1 Triangular points: sqrt(x12+x22) > 1 or sqrt(x12+x22) < 0.5

Overfitting 500 circular and 500 triangular data points.
Circular points: 0.5  sqrt(x12+x22)  1 Triangular points: sqrt(x12+x22) > 1 or sqrt(x12+x22) < 0.5

Overfitting due to noise
Decision boundary is distorted by noise point

Overfitting due to insufficient examples
Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the neural nets to predict the test examples using other training records that are irrelevant to the classification task

Notes on over-fitting Overfitting results in classifiers (a neural net, or a support vector machine) that are more complex than necessary Even the training error is small, but the test error could be large. Training error no longer provides a good estimate of how well the classifier will perform on previously unseen records Need new ways for estimating errors – again regularized errors are one kind of solution

Regularization and Occam’s razor
Regularization means to add in a penalty on the model complexity, such as a 2-vector norm or a 1-vector norm It fundamentally similar to Occam’s razor Given two models of similar test errors, one should prefer the simpler model over the complex model For complex models, there is a greater chance that it was fitted accidentally by noises in data Therefore, one should include model complexity when finding the best model

How to address over-fitting
Minimize training error no longer guarantees a good model (a classifier or a regressor) Need better estimate of the error on the true population – generalization error Ppopulation( f(x) not equal to y ) In practice, design a procedure that gives better estimate of the error than training error In theoretical analysis, find an analytical bound to bound the generalization error or use Bayesian formula

How to evaluate a model or an algorithm
Even we use regularization or schemes to overcome overfitting, how do we know the resultant model is better or worse This motivates us to study model evaluation

Model evaluation Metrics for Performance Evaluation
How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models?

Metric for performance evaluation
Regression (accuracy-related performance) Sum of squares Root mean square error (RMSE) Sum of deviation Exponential function of the deviation

Classification Focus on the predictive capability of a model Rather than how fast it takes to classify or build models, scalability, etc. Confusion Matrix (for a two-class classification): PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No a b c d a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)

Most widely-used metric: PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No a (TP) b (FN) c (FP) d (TN)

Limitation of accuracy
Consider a 2-class problem Number of Class 0 examples = 9990 Number of Class 1 examples = 10 If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % Accuracy is misleading because model does not detect any class 1 example

Cost function of classification
Cost Matrix PREDICTED CLASS ACTUAL CLASS C(i|j) + - -1 100 1 Model M1 PREDICTED CLASS ACTUAL CLASS + - 150 40 60 250 Model M2 PREDICTED CLASS ACTUAL CLASS + - 250 45 5 200 Accuracy = 80% Cost = 3910 Accuracy = 90% Cost = 4255

Cost versus accuracy Count Cost a b c d p q PREDICTED CLASS
ACTUAL CLASS Class=Yes Class=No a b c d N = a + b + c + d Accuracy = (a + d)/N Cost = p (a + d) + q (b + c) = p (a + d) + q (N – a – d) = q N – (q – p)(a + d) = N [q – (q-p)  Accuracy] Accuracy is proportional to cost if 1. C(Yes|No)=C(No|Yes) = q 2. C(Yes|Yes)=C(No|No) = p Cost PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No p q

Cost-sensitive measures
Count PREDICTED CLASS ACTUAL CLASS Class= Yes Class= No a b c d Precision is biased towards C(Yes|Yes) & C(Yes|No) Recall is biased towards C(Yes|Yes) & C(No|Yes) A model that declares every record to be the positive class: b = d = 0 A model that assigns a positive class to the (sure) test record: c is small Recall is high Precision is high

Cost-sensitive measures
Count PREDICTED CLASS ACTUAL CLASS Class= Yes Class= No a (TP) B (FN) C (FP) D (TN)

A more comprehensive metric
When a classifier outputs a membership probability (or a numerical score) of a class for each example (and most classifiers do output in such a way) Another more comprehensive metric can be used Receiver operating characteristic curve In short ROC curve

ROC curve PREDICTED CLASS ACTUAL CLASS TPR = TP/(TP+FN)
Class =Yes Class= No a (TP) b (FN) Class =No c (FP) d (TN) At threshold t: TP=50, FN=50, FP=12, TN=88 TPR = TP/(TP+FN) FPR = FP/(FP+TN)

ROC curve PREDICTED CLASS (TPR,FPR):
(0,0): declare everything to be negative class TP=0, FP = 0 (1,1): declare everything to be positive class FN = 0, TN = 0 (1,0): ideal FN = 0, FP = 0 PREDICTED CLASS ACTUAL CLASS Class =Yes Class= No a (TP) b (FN) Class =No c (FP) d (TN) TPR = TP/(TP+FN) FPR = FP/(FP+TN)

ROC curve (TPR,FPR): (0,0): declare everything to be negative class
(1,1): declare everything to be positive class (1,0): ideal Diagonal line: Random guessing Below diagonal line: prediction is opposite of the true class

How to construct an ROC curve
Use classifier that produces posterior probability for each test instance P(+|X) Sort the instances according to P(+|X) in decreasing order Apply threshold at each unique value of P(+|X) Count the number of TP, FP, TN, FN at each threshold TP rate, TPR = TP/(TP+FN) FP rate, FPR = FP/(FP + TN) Instance P(+|X) True Class 1 0.95 + 2 0.93 3 0.87 - 4 0.85 5 6 7 0.76 8 0.53 9 0.43 10 0.25

Use classifier that produces posterior probability for each test instance P(+|X) Sort the instances according to P(+|X) in decreasing order Pick a threshold 0.85 p>= 0.85, predicted to P p< 0.85, predicted to N TP = 3, FP=3, TN=2, FN=2 TP rate, TPR = 3/5=60% FP rate, FPR = 3/5=60% Instance P(+|X) True Class 1 0.95 + 2 0.93 3 0.87 - 4 0.85 5 6 7 0.76 8 0.53 9 0.43 10 0.25

Threshold >= ROC Curve:

Methods for performance evaluation
How to obtain a reliable estimate of performance? Performance of a model may depend on other factors besides the learning algorithm: Class distribution Cost of misclassification Size of training and test sets

Methods of estimation Holdout
For instance, reserve 2/3 for training and 1/3 for testing Random subsampling Repeated holdout Cross validation Partition data into k disjoint subsets k-fold: train on k-1 partitions, test on the remaining one Leave-one-out: k=n Stratified sampling oversampling vs undersampling Bootstrap Sampling with replacement

Methods of estimation Holdout method
Given data is randomly partitioned into two independent sets Training set (e.g., 2/3) for model construction Test set (e.g., 1/3) for accuracy estimation Random sampling: a variation of holdout Repeat holdout k times, accuracy = avg. of the accuracies obtained Cross-validation (k-fold, where k = 10 is most popular) Randomly partition the data into k mutually exclusive subsets, each approximately equal size At i-th iteration, use Di as test set and others as training set Leave-one-out: k folds where k = # of tuples, for small sized data Stratified cross-validation: folds are stratified so that class dist. in each fold is approx. the same as that in the initial data

Methods of estimation Bootstrap Works well with small data sets
Samples the given training tuples uniformly with replacement i.e., each time a tuple is selected, it is equally likely to be selected again and re-added to the training set Several boostrap methods, and a common one is .632 boostrap Suppose we are given a data set of d examples. The data set is sampled d times, with replacement, resulting in a training set of d samples. The data points that did not make it into the training set end up forming the test set. About 63.2% of the original data will end up in the bootstrap, and the remaining 36.8% will form the test set (since (1 – 1/d)d ≈ e-1 = 0.368) Repeat the sampling procedure k times, overall accuracy of the model:

Methods for model comparison
Compute a specific accuracy metric for all available models using the same test dataset Compare the values For instance, compute RMSE for two regression models RMSE(M1) = RMSE(M2) = 0.3 Which one is better?? For instance (classification), compute sensitivity-specificity Sen(M1) = 0.7 Spec(M1) = 0.8 Sen(M1) = 0.6 Spec(M2) = 0.65

Using ROC for model comparison
No model consistently outperforms the other M1 is better for small FPR M2 is better for large FPR Area Under the ROC curve (AUC) Ideal: Area = 1 Random guess: Area = 0.5

Questions?

CSE 4705 Artificial Intelligence

Similar presentations

Presentation on theme: "CSE 4705 Artificial Intelligence"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSE 4705 Artificial Intelligence

Similar presentations

Presentation on theme: "CSE 4705 Artificial Intelligence"— Presentation transcript:

Similar presentations

About project

Feedback