Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam

Outline 1.Model evaluation basics 2.Performance measures 3.Evaluation tasks Model selection Performance assessment Model comparison 4.Summary

Basic evaluation procedure 1.Choose performance measure 2.Choose evaluation design 3.Build model 4.Estimate performance 5.Quantify uncertainty

Basic evaluation procedure 1.Choose performance measure (e.g. error rate) 2.Choose evaluation design (e.g. split sample) 3.Build model (e.g. decision tree) 4.Estimate performance (e.g. compute test sample error rate) 5.Quantify uncertainty (e.g. estimate confidence interval)

Notation and terminology x  R m feature vector (stat: covariate pattern) y  {0,1} class (med: outcome) p(x) density (probability mass) of x P(Y=1| x) class-conditional probability h : R m → {0,1} classifier (stat: discriminant model; Mitchell: hypothesis) f : R m →  [0,1] probabilistic classifier (stat: binary regression model) f (Y=1| x) estimated class-conditional probability

Error rate The error rate (misclassification rate, inaccuracy) of a given classifier h is the probability that h will misclassify an arbitrary instance x : Probability that a given x is misclassified by h

Error rate The error rate (misclassification rate, inaccuracy) of a given classifier h is the probability that h will misclassify an arbitrary instance x : Expectation over instances x randomly drawn from R m

Sample error rate Let S = { (x i, y i ) | i =1,...,n } be a sample of independent and identically distributed (i.i.d.) instances, randomly drawn from R m. The sample error rate of classifier h in sample S is the proportion of instances in S misclassified by h :

The estimation problem How well does error s (h) estimate error(h) ? To answer this question, we must look at some basic concepts of statistical estimation theory. Generally speaking, a statistic is a particular calculation made from a data sample. It describes a certain aspect of the distribution of the data in the sample.

Understanding randomness

What can go wrong?

Sources of bias Dependence using data for both training/optimization and testing purposes Population drift underlying densities have changed e.g. ageing Concept drift class-conditional distributions have changed e.g. reduced mortality due to better treatments

Sources of variation Sampling of test data “bad day” (more probable with small samples) Sampling of training data instability of the learning method, e.g. trees Internal randomness of learning algorithm stochastic optimization, e.g. neural networks Class inseparability 0 « P(Y=1| x) « 1 for many x  R m

Solutions Bias –is usually be avoided through proper sampling, i.e. by taking an independent sample –can sometimes be estimated and then used to correct a biased error s (h) Variance –can be reduced by increasing the sample size (if we have enough data...) –is usually estimated and then used to quantify the uncertainty of error s (h)

Uncertainty = spread We investigate the spread of a distribution by looking at the average distance to the (estimated) mean.

Quantifying uncertainty (1) The variance of e 1,..., e n is defined as When e 1,..., e n are binary, then Let e 1,..., e n be a sequence of observations, with average.

Quantifying uncertainty (2) The standard deviation of e 1,..., e n is defined as When the distribution of e 1,..., e n is approximately Normal, a 95% confidence interval of is obtained by. Under the same assumption, we can also compute the probability (p-value) that the true mean equals a particular value (e.g., 0).

Example training settest set n train = 80 n test = 40 We split our dataset into a training sample and a test sample. The classifier h is induced from the training sample, and evaluated on the independent test sample. The estimated error rate is then unbiased.

Example (cont’d) Suppose that h misclassifies 12 of the 40 examples in the test sample. So Now, with approximately 95% probability, error(h) lies in the interval In this case, the interval ranges from.16 to.44

Basic evaluation procedure 1.Choose performance measure (e.g. error rate) 2.Choose evaluation design (e.g. split sample) 3.Build model (e.g. decision tree) 4.Estimate performance (e.g. compute test sample error rate) 5.Quantify uncertainty (e.g. estimate confidence interval)

Confusion matrix true positivesfalse positives false negativestrue negatives A common way to refine the notion of prediction error is to construct a confusion matrix: Y=1 Y=0 h(x)=1 h(x)=0 outcome prediction

Example 10 23 Y 0 1 1 1 0 0 0 0 h(x)h(x) 0 0 1 0 Y=1Y=0 h(x)=1 h(x)=0

Sensitivity “hit rate”: correctness among positive instances TP / (TP + FN) = 1 / (1 + 2) = 1/3 Terminology sensitivity (medical diagnostics) recall (information retrieval) 10 23 Y=1Y=0 h(x)=1 h(x)=0

Specificity correctness among negative instances TN / (TN + FP) = 3 / (0 + 3) = 1 Terminology specificity (medical diagnostics) precision (information retrieval) 10 23 Y=1Y=0 h(x)=1 h(x)=0

ROC analysis When a model yields probabilistic predictions, e.g. f (Y=1| x) = 0.55, then we can evaluate its performance for different classification thresholds   [0,1] This corresponds to assigning different (relative) weights to the two types of classification error The ROC curve is a plot of sensitivity versus 1-specificity for all 0    1

ROC curve sensitivity 1- specificity each point corresponds to a threshold value  =1  =0 (0,1): perfect model

sensitivity 1- specificity the area under the ROC curve is a good measure of discrimination Area under ROC curve (AUC)

sensitivity 1- specificity Area under ROC curve (AUC) when AUC=0.5, the model does not predict better than chance

sensitivity 1- specificity Area under ROC curve (AUC) when AUC=1.0, the model discriminates perfectly between Y=0 and Y=1

Discrimination vs. accuracy The AUC value only depends on the ordering of instances by the model The AUC value is insensitive to order-preserving transformations of the predictions f(Y=1|x), e.g. f’(Y=1|x) = f(Y=1|x) · 10 -4711 In addition to discrimination, we must therefore investigate the accuracy of probabilistic predictions.

100 170 321 …… 1001 0.10 0.25 0.30 0.90 0.15 0.20 0.25 0.75 Probabilistic accuracy P(Y=1|x) xY f(Y=1|x)

Quantifying probabilistic error Let (x i, y i ) be an observation, and let f (Y | x i ) be the estimated class-conditional distribution. Option 1:  i = | y i – f (Y=1| x i ) | Not good: does not lead to the correct mean Option 2:  i = (y i – f (Y=1| x i )) 2 (variance-based) Correct, but mild on severe errors Option 3:  i = ln( f (Y=y i | x i )) (entropy-based) Better from a probabilistic viewpoint

Evaluation tasks Model selection Select the appropriate size (complexity) of a model Performance assessment Quantify the performance of a given model for documentation purposes Method comparison Compare the performance of different learning methods

1 0.024 n=4843 creatinin level  169 2 0.020 n=4738 4 0.015 n=4382 elective surgery age  67 5 0.027 n=1918 VII 0.093 n=118 no mitral valve surgery mitral valve surgery creatinin level  169 XI 0.200 n=105 age  67 I 0.006 n=2464 emergency procedure 3 0.076 n=356 X 0.150 n=80 mod./poor LVEF good LVEF 7 0.054 n=276 IX 0.089 n=123 age  67 age  67 VIII 0.026 n=153 6 0.023 n=1800 VI 0.069 n=160 first cardiac surgery prior cardiac surgery 8 0.018 n=1640 V 0.069 n=101 age  81 age  81 9 0.015 n=1539 no COPD COPD II 0.011 n=1293 10 0.037 n=246 IV 0.067 n=104 BMI  25 BMI < 25 III 0.014 n=142 How far should we grow a tree?

When we build a model, we must decide upon its size (complexity) Simple models are robust but not flexible: they may neglect important features of the problem Complex models are flexible but not robust: they tend to overfit the data set Model induction is a statistical estimation problem! The model selection problem

training sample error rate true error rate optimistic bias How can we minimize the true error rate?

The split-sample procedure 1.Data set is randomly split into training set and test set (usually 2/3 vs. 1/3) 2.Models are built on training set Error rates are measured on test set training settest set Drawbacks –data loss –results are sensitive to split

Cross-validation 1.Split data set randomly into k subsets ("folds") 2.Build model on k-1 folds 3.Compute error on remaining fold 4.Repeat k times fold 1fold 2fold k… Average error on k test folds approximates true error on independent data Requires automated model building procedure

Estimating the optimistic bias We can also estimate the error on the training set and subtract an estimated bias afterwards. Roughly, there exist two methods to estimate an optimistic bias: a)Look at the model’s complexity e.g. the number of parameters in a generalized linear model (AIC, BIC) b)Take bootstrap samples simulate the sampling distribution (computationally intensive)

Summary: model selection In model selection, we trade-off flexibility in the representation for statistical robustness The problem is minimize the true error without suffering from a data loss We are not interested in the true error (or its uncertainty) itself – we just want to minimize it Methods: –Use independent observations –Estimate the optimistic bias

Performance assessment In a performance assessment, we estimate how well a given model would perform on new data. The estimated performance should be unbiased and its uncertainty must be quantified. Preferrably, the performance measure used should be easy to interpret (e.g. AUC).

Types of performance Internal performance Performance on patients from the same population and in the same setting Prospective performance Performance for future patients from the same population and in the same setting External performance Performance for patients from another population or another setting

Internal performance Both the split-sample and cross-validation procedures can be used to assess a model's internal performance, but not with the same data that was used in model selection A commonly applied procedure looks as follows: fold 1fold 2fold k … validation model selection

Mistakes are frequently made Schwarzer et al. (2000) reviewed 43 applications of artificial neural networks in oncology Most applications used a split-sample or cross- validation procedure to estimate performance In 19 articles, an incorrect (optimistic) performance estimate was presented –E.g. model selection and validation on a single set In 6 articles, the test set contained less than 20 observations Schwarzer G, et al. Stat Med 2000; 19:541–61.

Summary Both model induction and evaluation are statistical estimation problems In model induction we increase bias to reduce variation (and avoid overfitting) In model evaluation we must avoid bias or correct for it In model selection, we trade-off flexibility for robustness by optimizing the true performance A common pitfall is to use data twice without correcting for the resulting bias

Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Similar presentations

Presentation on theme: "Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

Similar presentations

Presentation on theme: "Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam."— Presentation transcript:

Similar presentations

About project

Feedback