CpSc 810: Machine Learning Evaluation of Classifier.

Slides:



Advertisements
Similar presentations
An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.
Advertisements

Evaluating Classifiers
Anthony Greene1 Simple Hypothesis Testing Detecting Statistical Differences In The Simplest Case:  and  are both known I The Logic of Hypothesis Testing:
Learning Algorithm Evaluation
Introduction to Statistics
Hypothesis testing Week 10 Lecture 2.
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Evaluation.
Model Evaluation Metrics for Performance Evaluation
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Chapter 7 Sampling and Sampling Distributions
Evaluation.
Darlene Goldstein 29 January 2003 Receiver Operating Characteristic Methodology.
2008 Chingchun 1 Bootstrap Chingchun Huang ( 黃敬群 ) Vision Lab, NCTU.
Sampling Distributions
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Overview of Lecture Independent and Dependent Variables Between and Within Designs.
Inference about a Mean Part II
Part III: Inference Topic 6 Sampling and Sampling Distributions
Experimental Evaluation
INFERENTIAL STATISTICS – Samples are only estimates of the population – Sample statistics will be slightly off from the true values of its population’s.
Today Evaluation Measures Accuracy Significance Testing
Evaluating Classifiers
Testing Hypotheses I Lesson 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics n Inferential Statistics.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
© Copyright McGraw-Hill CHAPTER 6 The Normal Distribution.
Error estimation Data Mining II Year Lluís Belanche Alfredo Vellido.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Hypothesis Testing: One Sample Cases. Outline: – The logic of hypothesis testing – The Five-Step Model – Hypothesis testing for single sample means (z.
Copyright © 2012 by Nelson Education Limited. Chapter 7 Hypothesis Testing I: The One-Sample Case 7-1.
1 Sampling Distributions Lecture 9. 2 Background  We want to learn about the feature of a population (parameter)  In many situations, it is impossible.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Experimental Evaluation of Learning Algorithms Part 1.
Hypothesis Testing A procedure for determining which of two (or more) mutually exclusive statements is more likely true We classify hypothesis tests in.
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Evaluating Results of Learning Blaž Zupan
CpSc 881: Machine Learning Evaluating Hypotheses.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
1 URBDP 591 A Lecture 12: Statistical Inference Objectives Sampling Distribution Principles of Hypothesis Testing Statistical Significance.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches.
© Copyright McGraw-Hill 2004
CSCI 347, Data Mining Evaluation: Cross Validation, Holdout, Leave-One-Out Cross Validation and Bootstrapping, Sections 5.3 & 5.4, pages
Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.
Validation methods.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:
From the population to the sample The sampling distribution FETP India.
1 Probability and Statistics Confidence Intervals.
Chapter 8: Introduction to Hypothesis Testing. Hypothesis Testing A hypothesis test is a statistical method that uses sample data to evaluate a hypothesis.
ROC curve estimation. Index Introduction to ROC ROC curve Area under ROC curve Visualization using ROC curve.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Tests of Significance We use test to determine whether a “prediction” is “true” or “false”. More precisely, a test of significance gets at the question.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Review Statistical inference and test of significance.
Review Law of averages, expected value and standard error, normal approximation, surveys and sampling.
Data Science Credibility: Evaluating What’s Been Learned
7. Performance Measurement
Evaluating Results of Learning
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Hypothesis Testing: Hypotheses
One-Sample Tests of Hypothesis
Chapter 9 Hypothesis Testing.
Learning Algorithm Evaluation
One-Sample Tests of Hypothesis
Presentation transcript:

CpSc 810: Machine Learning Evaluation of Classifier

2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources. The Copyright belong to the original authors. Thanks!

3 Classifier Accuracy Measures C1C1 C2C2 C1C1 True positive (TP)False negative (FN) C2C2 False positive (FP)True negative (TN) classesbuy_computer = yesbuy_computer = no total buy_computer = yes buy_computer = no total

4 Classifier Accuracy Measures The sensitivity: the percentage of correctly predicted positive data over the total number of positive data The specificity: the percentage of correctly identified negative data over the total number of negative data. The accuracy: the percentage of correctly predicted positive and negative data over the sum of positive and negative data

5 Classifier Accuracy Measures The precision: the percentage of correctly predicted positive data over the total number of predicted positive data. The F-measure is also called F-score. As a weighted average of the precision and recall, it considers both the precision and recall of the test to computer the score.

6 ROC curves ROC = Receiver Operating Characteristic Started in electronic signal detection theory (1940s s) Has become very popular method used in machine learning applications to assess classifiers

7 ROC curves: simplest case Consider diagnostic test for a disease Test has 2 possible outcomes: ‘positive’ = suggesting presence of disease ‘negative’ = suggesting un-presence of disease An individual can test either positive or negative for the disease

8 Hypothesis testing refresher 2 ‘competing theories’ regarding a population parameter: NULL hypothesis H ALTERNATIVE hypothesis A H: NO DIFFERENCE any observed deviation from what we expect to see is due to chance variability A: THE DIFFERENCE IS REAL

9 Test Statistic Measure how far the observed data are from what is expected assuming the NULL H by computing the value of a test statistic (TS) from the data The particular TS computed depends on the parameter For example, to test the population mean , the TS is the sample mean (or standardized sample mean) The NULL is rejected if the TS falls in a user- specified ‘rejection region’

10 True disease state vs. Test result not rejected rejected No disease (D = 0) specificity X Type I error (False +)  Disease (D = 1) X Type II error (False -)  Power 1 -  ; sensitivity Disease Test

11 Specific Example Test Result Pts with disease Pts without the disease

12 Test Result Call these patients “negative”Call these patients “positive” Threshold

13 Test Result Call these patients “negative”Call these patients “positive” without the diseasewith the disease True Positives

14 Test Result Call these patients “negative”Call these patients “positive” False Positives

15 Test Result Call these patients “negative”Call these patients “positive ” True negatives

16 Test Result Call these patients “negative”Call these patients “positive” False negatives

17 Test Result ‘‘-’’‘‘+’’ Moving the Threshold: right

18 Test Result ‘‘-’’‘‘+’’ Moving the Threshold: left

19 ROC curve True Positive Rate (sensitivity) 0%0% 100% False Positive Rate (1-specificity) 0%0% 100 %

20 True Positive Rate 0%0% 100% False Positive Rate 0%0% 100% True Positive Rate 0%0% 100% False Positive Rate 0%0% 100% A good test: A poor test: ROC curve comparison

21 Best Test: Worst test: True Positive Rate 0%0% 100% False Positive Rate 0%0% 100 % True Positive Rate 0%0% 100% False Positive Rate 0%0% 100 % The distributions don’t overlap at all The distributions overlap completely ROC curve extremes

22 Area under ROC curve (AUC) Overall measure of test performance Comparisons between two tests based on differences between (estimated) AUC For continuous data, AUC equivalent to Mann- Whitney U-statistic (nonparametric test of difference in location between two populations)

23 True Positive Rate 0%0% 100% False Positive Rate 0%0% 100 % True Positive Rate 0%0% 100% False Positive Rate 0%0% 100 % True Positive Rate 0%0% 100% False Positive Rate 0%0% 100 % AUC = 50% AUC = 90% AUC = 65% AUC = 100% True Positive Rate 0%0% 100% False Positive Rate 0%0% 100 % AUC for ROC curves

24 Interpretation of AUC AUC can be interpreted as the probability that the test result from a randomly chosen diseased individual is more indicative of disease than that from a randomly chosen nondiseased individual: P(X i  X j | D i = 1, D j = 0) So can think of this as a nonparametric distance between disease/nondisease test results

25 Predictor Error Measures Measure predictor accuracy: measure how far off the predicted value is from the actual known value Loss function: measures the error betw. y i and the predicted value y i ’ Absolute error: | y i – y i ’| Squared error: (y i – y i ’) 2

26 Predictor Error Measures Test error (generalization error): the average loss over the test set Mean absolute error: Mean squared error: Relative absolute error: Relative squared error: The mean squared-error exaggerates the presence of outliers Popularly use (square) root mean-square error, similarly, root relative squared error

27 Evaluating the Accuracy of a Classifier or Predictor (I) Holdout method Given data is randomly partitioned into two independent sets Training set (e.g., 2/3) for model construction Test set (e.g., 1/3) for accuracy estimation Random sampling: a variation of holdout Repeat holdout k times, accuracy = avg. of the accuracies obtained Cross-validation (k-fold, where k = 10 is most popular) Randomly partition the data into k mutually exclusive subsets, each approximately equal size At i-th iteration, use D i as test set and others as training set Leave-one-out: k folds where k = # of examples, for small sized data Stratified cross-validation: folds are stratified so that class distribution in each fold is approximately the same as that in the initial data

28 Evaluating the Accuracy of a Classifier or Predictor (II) Bootstrap Works well with small data sets Samples the given training examples uniformly with replacement i.e., each time a example is selected, it is equally likely to be selected again and re-added to the training set Several bootstrap methods, and a common one is.632 bootstrap Suppose we are given a data set of d examples. The data set is sampled d times, with replacement, resulting in a training set of d samples. The data examples that did not make it into the training set end up forming the test set. About 63.2% of the original data will end up in the bootstrap, and the remaining 36.8% will form the test set (since (1 – 1/d) d ≈ e -1 = 0.368) Repeat the sampling procedure k times, overall accuracy of the model:

29 More About Bootstrap The bootstrap method attempts to determine the probability distribution from the data itself, without recourse to CLT (. The bootstrap method attempts to determine the probability distribution from the data itself, without recourse to CLT (Central Limit Theorem). The bootstrap method is not a way of reducing the error ! It only tries to estimate it. The bootstrap method is not a way of reducing the error ! It only tries to estimate it. Basic idea of Bootstrap Originally, from some list of data, one computes an object. Create an artificial list by randomly drawing elements from that list. Some elements will be picked more than once. Compute a new object. Repeat times and look at the distribution of these objects.

30 More About Bootstrap How many bootstraps ? No clear answer to this. Lots of theorems on asymptotic convergence, but no real estimates ! Rule of thumb: try it 100 times, then 1000 times, and see if your answers have changed by much. Anyway have N N possible subsamples Is it reliable ? A very good question ! Jury still out on how far it can be applied, but for now nobody is going to shoot you down for using it. Good agreement for Normal (Gaussian) distributions, skewed distributions tend to more problematic, particularly for the tails, (boot strap underestimates the errors).

31 Sampling Sampling is the main technique employed for data selection. It is often used for both the preliminary investigation of the data and the final data analysis. Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming. Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming.

32 Sampling … The key principle for effective sampling is the following: Using a sample will work almost as well as using the entire data sets, if the sample is representative A sample is representative if it has approximately the same property (of interest) as the original set of data

33 Sample Size 8000 points 2000 Points500 Points

34 Types of Sampling Simple Random Sampling There is an equal probability of selecting any particular item Sampling without replacement As each item is selected, it is removed from the population Sampling with replacement Objects are not removed from the population as they are selected for the sample. In sampling with replacement, the same object can be picked up more than once Stratified sampling Split the data into several partitions; then draw random samples from each partition