Presentation on theme: "Model Evaluation Instructor: Qiang Yang"— Presentation transcript:
1 Model Evaluation Instructor: Qiang Yang Hong Kong University of Science and TechnologyThanks: Eibe Frank and Jiawei HanLecture 1
2 Constructing a Classifier Model The goal is to maximize the accuracy on new cases that have similar class distribution.Since new cases are not available at the time of construction, the given examples are divided into the testing set and the training set. The classifier is built using the training set and is evaluated using the testing set.The goal is to be accurate on the testing set. It is essential to capture the “structure” shared by both sets.Must prune overfitting rules that work well on the training set, but poorly on the testing set.
3 Example Classification Algorithms Training Data Classifier (Model) IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’
4 Example (Conted) Tenured? Classifier Testing Data Unseen Data (Jeff, Professor, 4)Tenured?
5 Evaluation Criteria Accuracy on test set Error Rate on test set the rate of correct classification on the testing set. E.g., if 90 are classified correctly out of the 100 testing cases, accuracy is 90%.Error Rate on test setThe percentage of wrong predictions on test setConfusion MatrixFor binary class values, “yes” and “no”, a matrix showing true positive, true negative, false positive and false negative ratesSpeed and scalabilitythe time to build the classifier and to classify new cases, and the scalability with respect to the data size.Robustness: handling noise and missing valuesPredicted classYesNoActual classTrue positiveFalse negativeFalse positiveTrue negative
6 Evaluation Techniques Holdout: the training set/testing set.Good for a large set of data.k-fold Cross-validation:divide the data set into k sub-samples.In each run, use one distinct sub-sample as testing set and the remaining k-1 sub-samples as training set.Evaluate the method using the average of the k runs.This method reduces the randomness of training set/testing set.
7 Cross Validation: Holdout Method Break up data into groups of the same sizeHold aside one group for testing and use the rest to build modelRepeatiterationTest7
8 Cross validationNatural performance measure for classification problems: error rate#Success: instance’s class is predicted correctly#Error: instance’s class is predicted incorrectly%Error rate: proportion of errors made over the whole set of instancesConfidence2% error in 100 tests2% error in testsWhich one do you trust more?
9 Confidence intervalsAssume the estimated error rate (f) is 25%. How close is this to the true error rate p?Depends on the amount of test dataWe can say: p lies within a certain specified interval with a certain specified confidenceExample: S=750 successes in N=1000 trialsEstimated success rate: f=75%How close is this to true success rate p?Answer: with 80% confidence p[73.2,76.7]Another example: S=75 and N=100Estimated success rate: 75%With 80% confidence p[69.1,80.1]
10 Mean and variance For large enough N, p follows a normal distribution c% confidence interval [-z X z] for random variable X with 0 mean is given by:
11 Confidence limitsConfidence limits for the normal distribution with 0 mean and a variance of 1:Thus:To use this we have to reduce our random variable p to have 0 mean and unit variancePr[Xz]z0.1%3.090.5%2.581%2.335%1.6510%1.2820%0.8440%0.25
12 Transforming f Transformed value for f: (i.e. subtract the mean and divide by the standard deviation)Resulting equation:Solving for p:
13 Examples f=75%, N=1000, c=80% (so that z=1.28): Note that normal distribution assumption is only valid for large N (i.e. N > 100)f=75%, N=10, c=80% (so that z=1.28):
14 More on cross-validation Standard method for evaluation: stratified ten-fold cross-validationIf M is the model, D is training data, N is the # of folds, then M is a function of D/N.M(D/N) gives an error M(D)??Why setting N=10?Extensive experiments have shown that this is the best choice to get an accurate estimateThere is also some theoretical evidence for thisStratification (each fold is close to real distribution)reduces the estimate’s variance
15 Leave-one-out cross-validation Leave-one-out cross-validation is a particular form of cross-validation:The number of folds is set to the number of training instancesI.e., a classifier has to be built n times, where n is the number of training instancesMakes maximum use of the dataNo random sampling involvedVery computationally expensive
16 LOO-CV and stratification Another disadvantage of LOO-CV: stratification is not possibleIt guarantees a non-stratified sample because there is only one instance in the test set!Extreme example:completely random dataset with two classesand equal proportions for both of themBest classifier predicts majority class (results in 50% on fresh data from this domain)LOO-CV estimate on error rate for this classifier will be 100%!
17 The bootstrap CV uses sampling without replacement The same instance, once selected, can not be selected again for a particular training/test setThe bootstrap is an estimation method that uses sampling with replacement to form the training setA dataset of n instances is sampled n times with replacement to form a new dataset of n instancesThis data is used as the training setThe instances from the original dataset that don’t occur in the new training set are used for testing
18 The 0.632 bootstrap This method is also called the 0.632 bootstrap A particular instance has a probability of 1-1/n of not being pickedThus its probability of ending up in the test data (not selected) is:This means the training data will contain approximately 63.2% of the instances
19 Comparing data mining methods Frequent situation: we want to know which one of two learning schemes performs betterObvious way: compare 10-fold CV estimatesProblem: variance in estimateVariance can be reduced using repeated CVHowever, we still don’t know whether the results are reliableSolution: include confidence interval
20 Taking costs into account The confusion matrix:There many other types of costs!E.g.: cost of collecting training data, test dataPredicted classYesNoActual classTrue positiveFalse negativeFalse positiveTrue negative
21 Lift charts In practice, ranking may be important Question: Decisions are usually made by comparing possible scenariosSort the likelihood of x being +ve class from high to lowQuestion:How do we know if one ranking is better than the other?
22 Example Example: promotional mailout Situation 1: classifier A predicts that 0.1% of all one million households will respondSituation 2: classifier B predicts that 0.4% of the 10,000 most promising households will respondWhich one is better?Suppose to mail out a package, it costs 1 dollarBut to get a response, we obtain 1000 dollarsA lift chart allows for a visual comparison
23 Generating a lift chart Instances are sorted according to their predicted probability of being a true positive:In lift chart, x axis is sample size and y axis is number of true positivesRankPredicted probabilityActual class10.95Yes20.933No40.88…
24 Steps in Building a Lift Chart 1. First, produce a ranking of the data, using your learned model (classifier, etc):Rank 1 means most likely in + class,Rank n means least likely in + class2. For each ranked data instance, label with Ground Truth label:This gives a list likeRank 1, +Rank 2, -,Rank 3, +,Etc.3. Count the number of true positives (TP) from Rank 1 onwardsRank 1, +, TP = 1Rank 2, -, TP = 1Rank 3, +, TP=2,4. Plot # of TP against % of data in ranked order (if you have 10 data instances, then each instance is 10% of the data):10%, TP=120%, TP=130%, TP=2,…This gives a lift chart.
26 ROC curves ROC curves are similar to lift charts “ROC” stands for “receiver operating characteristic”Used in signal detection to show tradeoff between hit rate and false alarm rate over noisy channelDifferences from a lift chart:y axis shows percentage of true positives in sample (rather than absolute number)x axis shows percentage of false positives in sample (rather than sample size)
28 Cost-sensitive learning Most learning schemes do not perform cost-sensitive learningThey generate the same classifier no matter what costs are assigned to the different classesExample: standard decision tree learnerSimple methods for cost-sensitive learning:Resampling of instances according to costsWeighting of instances according to costsSome schemes are inherently cost-sensitive, e.g. naïve Bayes
29 Measures in information retrieval Percentage of retrieved documents that are relevant: precision=TP/TP+FPPercentage of relevant documents that are returned: recall =TP/TP+FNPrecision/recall curves have hyperbolic shapeSummary measures: average precision at 20%, 50% and 80% recall (three-point average recall)F-measure=(2recallprecision)/(recall+precision)
31 Evaluating numeric prediction Same strategies: independent test set, cross-validation, significance tests, etc.Difference: error measuresActual target values: a1, a2,…,anPredicted target values: p1, p2,…,pnMost popular measure: mean-squared errorEasy to manipulate mathematically
32 Other measures The root mean-squared error: The mean absolute error is less sensitive to outliers than the mean-squared error:Sometimes relative error values are more appropriate (e.g. 10% for an error of 50 when predicting 500)
33 The MDL principle MDL stands for minimum description length The description length is defined as:space required to describe a theory+space required to describe the theory’s mistakesIn our case the theory is the classifier and the mistakes are the errors on the training dataAim: we want a classifier with minimal DLMDL principle is a model selection criterion
34 Model selection criteria Model selection criteria attempt to find a good compromise between:The complexity of a modelIts prediction accuracy on the training dataReasoning: a good model is a simple model that achieves high accuracy on the given dataAlso known as Occam’s Razor: the best theory is the smallest one that describes all the facts
35 Elegance vs. errorsTheory 1: very simple, elegant theory that explains the data almost perfectlyTheory 2: significantly more complex theory that reproduces the data without mistakesTheory 1 is probably preferableClassical example: Kepler’s three laws on planetary motionLess accurate than Copernicus’s latest refinement of the Ptolemaic theory of epicycles
36 MDL and compressionThe MDL principle is closely related to data compression:It postulates that the best theory is the one that compresses the data the mostI.e. to compress a dataset we generate a model and then store the model and its mistakesWe need to compute (a) the size of the model and (b) the space needed for encoding the errors(b) is easy: can use the informational loss functionFor (a) we need a method to encode the model
37 Discussion of the MDL principle Advantage:makes full use of the training data when selecting a modelDisadvantage1: appropriate coding scheme/prior probabilities for theories are crucial2: no guarantee that the MDL theory is the one which minimizes the expected errorNote: Occam’s Razor is an axiom!Occam's razor the principle that you should not make more assumptions than are necessary when you are explaining something奧康姆的剃刀（認為在解釋某事時若無必要不應做太多的臆斷的原則）