Lecture 22: Evaluation April 24, 2010.

Slides:



Advertisements
Similar presentations
Brief introduction on Logistic Regression
Advertisements

Evaluating Classifiers
Statistics Review – Part II Topics: – Hypothesis Testing – Paired Tests – Tests of variability 1.
Learning Algorithm Evaluation
Sampling: Final and Initial Sample Size Determination
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
What is Statistical Modeling
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Model Evaluation Metrics for Performance Evaluation
Cost-Sensitive Classifier Evaluation Robert Holte Computing Science Dept. University of Alberta Co-author Chris Drummond IIT, National Research Council,
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
1 BA 275 Quantitative Business Methods Residual Analysis Multiple Linear Regression Adjusted R-squared Prediction Dummy Variables Agenda.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
© sebastian thrun, CMU, The KDD Lab Intro: Outcome Analysis Sebastian Thrun Carnegie Mellon University
9-1 Hypothesis Testing Statistical Hypotheses Statistical hypothesis testing and confidence interval estimation of parameters are the fundamental.
Evaluation.
T-test.
Performance Evaluation in Computer Vision Kyungnam Kim Computer Vision Lab, University of Maryland, College Park.
Inference about a Mean Part II
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
Horng-Chyi HorngStatistics II41 Inference on the Mean of a Population - Variance Known H 0 :  =  0 H 0 :  =  0 H 1 :    0, where  0 is a specified.
Experimental Evaluation
Chapter 9 Hypothesis Testing.
Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh.
Chapter 14 Inferential Data Analysis
Standard error of estimate & Confidence interval.
Inferential Statistics
Today Evaluation Measures Accuracy Significance Testing
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
Evaluating Classifiers
AM Recitation 2/10/11.
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Two Sample Tests Ho Ho Ha Ha TEST FOR EQUAL VARIANCES
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
Education 793 Class Notes T-tests 29 October 2003.
Evaluation – next steps
Error estimation Data Mining II Year Lluís Belanche Alfredo Vellido.
Chapter 9 Hypothesis Testing and Estimation for Two Population Parameters.
9-1 Hypothesis Testing Statistical Hypotheses Definition Statistical hypothesis testing and confidence interval estimation of parameters are.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
+ Chapter 12: Inference for Regression Inference for Linear Regression.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Experimental Evaluation of Learning Algorithms Part 1.
Classification Performance Evaluation. How do you know that you have a good classifier? Is a feature contributing to overall performance? Is classifier.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
CpSc 810: Machine Learning Evaluation of Classifier.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005 Dr. John Lipp Copyright © Dr. John Lipp.
Evaluating Results of Learning Blaž Zupan
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Correlation & Regression Analysis
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Introducing Communication Research 2e © 2014 SAGE Publications Chapter Seven Generalizing From Research Results: Inferential Statistics.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
7. Performance Measurement
Chapter 9 Hypothesis Testing.
Evaluating Classifiers
Evaluating Results of Learning
Psychology 202a Advanced Psychological Statistics
9 Tests of Hypotheses for a Single Sample CHAPTER OUTLINE
Product moment correlation
Inference on the Mean of a Population -Variance Known
Presentation transcript:

Lecture 22: Evaluation April 24, 2010

Last Time Spectral Clustering

Today Evaluation Measures Accuracy Significance Testing F-Measure Error Types ROC Curves Equal Error Rate AIC/BIC

How do you know that you have a good classifier? Is a feature contributing to overall performance? Is classifier A better than classifier B? Internal Evaluation: Measure the performance of the classifier. External Evaluation: Measure the performance on a downstream task

Accuracy Easily the most common and intuitive measure of classification performance.

Significance testing Say I have two classifiers. A = 50% accuracy B = 75% accuracy B is better, right?

Significance Testing Say I have another two classifiers A = 50% accuracy B = 50.5% accuracy Is B better?

Basic Evaluation Training data – used to identify model parameters Testing data – used for evaluation Optionally: Development / tuning data – used to identify model hyperparameters. Difficult to get significance or confidence values

Cross validation Identify n “folds” of the available data. Train on n-1 folds Test on the remaining fold. In the extreme (n=N) this is known as “leave-one-out” cross validation n-fold cross validation (xval) gives n samples of the performance of the classifier.

Significance Testing Is the performance of two classifiers different with statistical significance? Means testing If we have two samples of classifier performance (accuracy), we want to determine if they are drawn from the same distribution (no difference) or two different distributions.

T-test One Sample t-test Independent t-test Unequal variances and sample sizes Once you have a t-value, look up the significance level on a table, keyed on the t-value and degrees of freedom

Significance Testing Run Cross-validation to get n-samples of the classifier mean. Use this distribution to compare against either: A known (published) level of performance one sample t-test Another distribution of performance two sample t-test If at all possible, results should include information about the variance of classifier performance.

Significance Testing Caveat – including more samples of the classifier performance can artificially inflate the significance measure. If x and s are constant (the sample represents the population mean and variance) then raising n will increase t. If these samples are real, then this is fine. Often cross-validation fold assignment is not truly random. Thus subsequent xval runs only resample the same information.

Confidence Bars Variance information can be included in plots of classifier performance to ease visualization. Plot standard deviation, standard error or confidence interval?

Confidence Bars Most important to be clear about what is plotted. 95% confidence interval has the clearest interpretation.

Baseline Classifiers Majority Class baseline Random baseline Every data point is classified as the class that is most frequently represented in the training data Random baseline Randomly assign one of the classes to each data point. with an even distribution with the training class distribution

Problems with accuracy Contingency Table True Values Positive Negative Hyp Values True Positive False Positive False Negative True Negative

Problems with accuracy Information Retrieval Example Find the 10 documents related to a query in a set of 110 documents True Values Positive Negative Hyp Values 10 100

Problems with accuracy Precision: how many hypothesized events were true events Recall: how many of the true events were identified F-Measure: Harmonic mean of precision and recall True Values Positive Negative Hyp Values 10 100

F-Measure F-measure can be weighted to favor Precision or Recall beta > 1 favors recall beta < 1 favors precision True Values Positive Negative Hyp Values 10 100

F-Measure True Values Positive Negative Hyp Values 1 9 100

F-Measure True Values Positive Negative Hyp Values 10 50

F-Measure True Values Positive Negative Hyp Values 9 1 99

F-Measure Accuracy is weighted towards majority class performance. F-measure is useful for measuring the performance on minority classes.

Types of Errors False Positives False Negatives The system predicted TRUE but the value was FALSE aka “False Alarms” or Type I error False Negatives The system predicted FALSE but the value was TRUE aka “Misses” or Type II error

ROC curves It is common to plot classifier performance at a variety of settings or thresholds Receiver Operating Characteristic (ROC) curves plot true positives against false positives. The overall performance is calculated by the Area Under the Curve (AUC)

ROC Curves Equal Error Rate (EER) is commonly reported. EER represents the highest accuracy of the classifier Curves provide more detail about performance Gauvain et al. 1995

Goodness of Fit Another view of model performance. Measure the model likelihood of the unseen data. However, we’ve seen that model likelihood is likely to improve by adding parameters. Two information criteria measures include a cost term for the number of parameters in the model

Akaike Information Criterion Akaike Information Criterion (AIC) based on entropy The best model has the lowest AIC. Greatest model likelihood Fewest free parameters Information in the parameters Information lost by the modeling

Bayesian Information Criterion Another penalization term based on Bayesian arguments Select the model that is a posteriori most probably with a constant penalty term for wrong models If errors are normally distributed. Note compares estimated models when x is constant

Today Accuracy Significance Testing F-Measure AIC/BIC

Next Time Regression Evaluation Cluster Evaluation