Presentation is loading. Please wait.

Presentation is loading. Please wait.

Model Evaluation. CRISP-DM CRISP-DM Phases Business Understanding – Initial phase – Focuses on: Understanding the project objectives and requirements.

Similar presentations


Presentation on theme: "Model Evaluation. CRISP-DM CRISP-DM Phases Business Understanding – Initial phase – Focuses on: Understanding the project objectives and requirements."— Presentation transcript:

1 Model Evaluation

2 CRISP-DM

3 CRISP-DM Phases Business Understanding – Initial phase – Focuses on: Understanding the project objectives and requirements from a business perspective Converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives Data Understanding – Starts with an initial data collection – Proceeds with activities aimed at: Getting familiar with the data Identifying data quality problems Discovering first insights into the data Detecting interesting subsets to form hypotheses for hidden information

4 CRISP-DM Phases Data Preparation – Covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data – Data preparation tasks are likely to be performed multiple times, and not in any prescribed order – Tasks include table, record, and attribute selection, as well as transformation and cleaning of data for modeling tools Modeling – Various modeling techniques are selected and applied, and their parameters are calibrated to optimal values – Typically, there are several techniques for the same data mining problem type – Some techniques have specific requirements on the form of data, therefore, stepping back to the data preparation phase is often needed

5 CRISP-DM Phases Evaluation – At this stage, a model (or models) that appears to have high quality, from a data analysis perspective, has been built – Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives – A key objective is to determine if there is some important business issue that has not been sufficiently considered – At the end of this phase, a decision on the use of the data mining results should be reached

6 CRISP-DM Phases Deployment – Creation of the model is generally not the end of the project – Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that the customer can use it – Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process – In many cases it will be the customer, not the data analyst, who will carry out the deployment steps – However, even if the analyst will not carry out the deployment effort it is important for the customer to understand up front what actions will need to be carried out in order to actually make use of the created models

7 Evaluating Classification Systems Two issues – What evaluation measure should we use? – How do we ensure reliability of our model?

8 EVALUATION How do we ensure reliability of our model?

9 How do we ensure reliability? Heavily dependent on training

10 Data Partitioning Randomly partition data into training and test set Training set – data used to train/build the model. – Estimate parameters (e.g., for a linear regression), build decision tree, build artificial network, etc. Test set – a set of examples not used for model induction. The model’s performance is evaluated on unseen data. Aka out-of-sample data. Generalization Error: Model error on the test data. Set of training examples Set of test examples

11 Complexity and Generalization S train (  ) S test (  ) Complexity = degrees of freedom in the model (e.g., number of variables) Score Function e.g., squared error Optimal model complexity

12 Holding out data The holdout method reserves a certain amount for testing and uses the remainder for training – Usually: one third for testing, the rest for training For “unbalanced” datasets, random samples might not be representative – Few or none instances of some classes Stratified sample: – Make sure that each class is represented with approximately equal proportions in both subsets 12

13 Repeated holdout method Holdout estimate can be made more reliable by repeating the process with different subsamples – In each iteration, a certain proportion is randomly selected for training (possibly with stratification) – The error rates on the different iterations are averaged to yield an overall error rate This is called the repeated holdout method 13

14 Cross-validation Most popular and effective type of repeated holdout is cross-validation Cross-validation avoids overlapping test sets – First step: data is split into k subsets of equal size – Second step: each subset in turn is used for testing and the remainder for training This is called k-fold cross-validation Often the subsets are stratified before the cross- validation is performed 14

15 Cross-validation example: 15

16 More on cross-validation Standard data-mining method for evaluation: stratified ten-fold cross-validation Why ten? Extensive experiments have shown that this is the best choice to get an accurate estimate Stratification reduces the estimate’s variance Even better: repeated stratified cross-validation – E.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the sampling variance) Error estimate is the mean across all repetitions 16

17 Leave-One-Out cross-validation Leave-One-Out: a particular form of cross-validation: – Set number of folds to number of training instances – I.e., for n training instances, build classifier n times Makes best use of the data Involves no random subsampling Computationally expensive, but good performance 17

18 Leave-One-Out-CV and stratification Disadvantage of Leave-One-Out-CV: stratification is not possible – It guarantees a non-stratified sample because there is only one instance in the test set! Extreme example: random dataset split equally into two classes – Best model predicts majority class – 50% accuracy on fresh data – Leave-One-Out-CV estimate is 100% error! 18

19 Three way data splits One problem with CV is since data is being used jointly to fit model and estimate error, the error could be biased downward. If the goal is a real estimate of error (as opposed to which model is best), you may want a three way split: – Training set: examples used for learning – Validation set: used to tune parameters – Test set: never used in the model fitting process, used at the end for unbiased estimate of hold out error

20 The Bootstrap The Statistician Brad Efron proposed a very simple and clever idea for mechanically estimating confidence intervals: The Bootstrap The idea is to take multiple resamples of your original dataset. Compute the statistic of interest on each resample you thereby estimate the distribution of this statistic!

21 Sampling with Replacement Draw a data point at random from the data set. Then throw it back in Draw a second data point. Then throw it back in… Keep going until we ’ ve got 1000 data points. You might call this a “ pseudo ” data set. This is not merely re-sorting the data. Some of the original data points will appear more than once; others won ’ t appear at all.

22 Sampling with Replacement In fact, there is a chance of (1-1/1000) 1000 ≈ 1/e ≈.368 that any one of the original data points won ’ t appear at all if we sample with replacement 1000 times.  any data point is included with Prob ≈.632 Intuitively, we treat the original sample as the “ true population in the sky ”. Each resample simulates the process of taking a sample from the “ true ” distribution.

23 Bootstrapping & Validation This is interesting in its own right. But bootstrapping also relates back to model validation. Along the lines of cross-validation. You can fit models on bootstrap resamples of your data. For each resample, test the model on the ≈.368 of the data not in your resample. Will be biased, but corrections are available. Get a spectrum of ROC curves.

24 Closing Thoughts The “ cross-validation ” approach has several nice features: – Relies on the data, not likelihood theory, etc. – Comports nicely with the lift curve concept. – Allows model validation that has both business & statistical meaning. – Is generic  can be used to compare models generated from competing techniques… – … or even pre-existing models – Can be performed on different sub-segments of the data – Is very intuitive, easily grasped.

25 Closing Thoughts B ootstrapping has a family resemblance to cross- validation: – Use the data to estimate features of a statistic or a model that we previously relied on statistical theory to give us. – Classic examples of the “ data mining ” (in the non-pejorative sense of the term!) mindset: Leverage modern computers to “ do it yourself ” rather than look up a formula in a book! Generic tools that can be used creatively. – Can be used to estimate model bias & variance. – Can be used to estimate (simulate) distributional characteristics of very difficult statistics. – Ideal for many actuarial applications.

26 METRICS What evaluation measure should we use?

27 Evaluation of Classification Accuracy = (a+d) / (a+b+c+d) – Not always the best choice Assume 1% fraud, model predicts no fraud What is the accuracy? 10 1 ab 0 cd predicted outcome actual outcome FraudNo Fraud Fraud00 No Fraud10990 Predicted Class Actual Class

28 Evaluation of Classification Other options: – recall or sensitivity (how many of those that are really positive did you predict?): a/(a+c) – precision (how many of those predicted positive really are?) a/(a+b) Precision and recall are always in tension – Increasing one tends to decrease another 10 1 ab 0 cd predicted outcome actual outcome

29 Evaluation of Classification Yet another option: – recall or sensitivity (how many of the positives did you get right?): a/(a+c) – Specificity (how many of the negatives did you get right?) d/(b+d) Sensitivity and specificity have the same tension Different fields use different metrics 10 1 ab 0 cd predicted outcome actual outcome

30 Evaluation for a Thresholded Response Many classification models output probabilities These probabilities get thresholded to make a prediction. Classification accuracy depends on the threshold – good models give low probabilities to Y=0 and high probabilities to Y=1.

31 Test Data predicted probabilities 8 3 0 9 1 0 0 1 actual outcome predicted outcome Suppose we use a cutoff of 0.5…

32 8 3 0 9 1 0 0 1 actual outcome predicted outcome Suppose we use a cutoff of 0.5… sensitivity: = 100% 8 8+0 specificity: = 75% 9 9+3 we want both of these to be high

33 6 2 2 10 1 0 0 1 actual outcome predicted outcome Suppose we use a cutoff of 0.8… sensitivity: = 75% 6 6+2 specificity: = 83% 10 10+2

34 Note there are 20 possible thresholds Plotting all values of sensitivity vs. specificity gives a sense of model performance by seeing the tradeoff with different thresholds Note if threshold = minimum c=d=0 so sens=1; spec=0 If threshold = maximum a=b=0 so sens=0; spec=1 If model is perfect sens=1; spec=1 a b c d 1 0 0 1 actual outcome

35 ROC curve plots sensitivity vs. (1-specificity) – also known as false positive rate Always goes from (0,0) to (1,1) The more area in the upper left, the better Random model is on the diagonal “Area under the curve” (AUC) is a common measure of predictive performance

36 ROC CURVES

37 Receiver Operating Characteristic curve ROC curves were developed in the 1950's as a by-product of research into making sense of radio signals contaminated by noise. More recently it's become clear that they are remarkably useful in decision-making. They are a performance graphing method. True positive and False positive fractions are plotted as we move the dividing threshold. They look like:

38 ROC Space ROC graphs are two-dimensional graphs in which TP rate is plotted on the Y axis and FP rate is plotted on the X axis. An ROC graph depicts relative trade- offs between benefits (true positives) and costs (false positives). Figure shows an ROC graph with five classifiers labeled A through E. A discrete classier is one that outputs only a class label. Each discrete classier produces an (fp rate, tp rate) pair corresponding to a single point in ROC space. Classifiers in figure are all discrete classifiers.

39 Several Points in ROC Space Lower left point (0, 0) represents the strategy of never issuing a positive classification; – such a classier commits no false positive errors but also gains no true positives. Upper right corner (1, 1) represents the opposite strategy, of unconditionally issuing positive classifications. Point (0, 1) represents perfect classification. – D's performance is perfect as shown. Informally, one point in ROC space is better than another if it is to the northwest of the first – tp rate is higher, fp rate is lower, or both.

40 Specific Example Test Result Pts with disease Pts without the disease

41 Test Result Call these patients “ negative ” Call these patients “ positive ” Threshold

42 Test Result Call these patients “ negative ” Call these patients “ positive ” without the disease with the disease True Positives Some definitions...

43 Test Result Call these patients “ negative ” Call these patients “ positive ” without the disease with the disease False Positives

44 Test Result Call these patients “ negative ” Call these patients “ positive ” without the disease with the disease True negatives

45 Test Result Call these patients “ negative ” Call these patients “ positive ” without the disease with the disease False negatives

46 Test Result without the disease with the disease ‘‘ - ’’ ‘‘ + ’’ Moving the Threshold: right

47 Test Result without the disease with the disease ‘‘ - ’’ ‘‘ + ’’ Moving the Threshold: left

48 True Positive Rate (sensitivity) 0% 100% False Positive Rate (1-specificity) 0% 100% ROC curve

49 True Positive Rate 0% 100% False Positive Rate 0%0% 100% True Positive Rate 0% 100% False Positive Rate 0% 100% A good test: A poor test: ROC curve comparison

50 Best Test: Worst test: True Positive Rate 0%0% 100% False Positive Rate 0%0% 100 % True Positive Rate 0%0% 100% False Positive Rate 0%0% 100 % The distributions don ’ t overlap at all The distributions overlap completely ROC curve extremes

51 How to Construct ROC Curve for one Classifier Sort the instances according to their P pos. Move a threshold on the sorted instances. For each threshold define a classifier with confusion matrix. Plot the TPr and FPr rates of the classfiers. P pos True Class 0.99pos 0.98pos 0.7neg 0.6pos 0.43neg True Predicted posneg pos21 neg11

52 Creating an ROC Curve A classifier produces a single ROC point. If the classifier has a “ sensitivity ” parameter, varying it produces a series of ROC points (confusion matrices). Alternatively, if the classifier is produced by a learning algorithm, a series of ROC points can be generated by varying the class ratio in the training set.

53 ROC for one Classifier Good separation between the classes, convex curve.

54 ROC for one Classifier Reasonable separation between the classes, mostly convex.

55 ROC for one Classifier Fairly poor separation between the classes, mostly convex.

56 ROC for one Classifier Poor separation between the classes, large and small concavities.

57 ROC for one Classifier Random performance.

58 The AUC Metric The area under ROC curve (AUC) assesses the ranking in terms of separation of the classes. AUC estimates that randomly chosen positive instance will be ranked before randomly chosen negative instances.

59 Comparing Models Highest AUC wins But pay attention to ‘Occam’s Razor’ – ‘the best theory is the smallest one that describes all the facts’ – Also known as the ‘parsimony principle’ – If two models are similar, pick the simpler one


Download ppt "Model Evaluation. CRISP-DM CRISP-DM Phases Business Understanding – Initial phase – Focuses on: Understanding the project objectives and requirements."

Similar presentations


Ads by Google