Presentation is loading. Please wait.

Presentation is loading. Please wait.

Announcements HW1 assigned due, Monday Oct 9HW1 assigned due, Monday Oct 9 Reading AssignmentReading Assignment Paper on ROC curvesPaper on ROC curves.

Similar presentations


Presentation on theme: "Announcements HW1 assigned due, Monday Oct 9HW1 assigned due, Monday Oct 9 Reading AssignmentReading Assignment Paper on ROC curvesPaper on ROC curves."— Presentation transcript:

1 Announcements HW1 assigned due, Monday Oct 9HW1 assigned due, Monday Oct 9 Reading AssignmentReading Assignment Paper on ROC curvesPaper on ROC curves Chapter 4 of the text bookChapter 4 of the text book Midterm examMidterm exam In class on Oct 25In class on Oct 25

2 Last Time Naïve BayesNaïve Bayes A simple generative classifierA simple generative classifier Train / Tune / Test MethodologyTrain / Tune / Test Methodology

3 Today’s Topics A lot of (very important) statisticsA lot of (very important) statistics Error bars on error ratesError bars on error rates t-testst-tests ROC and Recall-Precision curvesROC and Recall-Precision curves Next Time, back to algorithmsNext Time, back to algorithms Logistic regression, Discriminative vs generative, perceptrons and neural networksLogistic regression, Discriminative vs generative, perceptrons and neural networks

4 Why not learn after each test example? In “production mode”, this would make sense (assuming one received the correct label and that this could be done efficiently).In “production mode”, this would make sense (assuming one received the correct label and that this could be done efficiently). In “experiments”, we wish to estimate:In “experiments”, we wish to estimate: Probability we’ll label the next example correctlyProbability we’ll label the next example correctly need several samples to accurately estimate.need several samples to accurately estimate.

5 Choosing a Good N for CV (from Weiss & Kulikowski textbook) # of Examples < 50< 50 50 < ex’s < 10050 < ex’s < 100 > 100> 100 > 1000> 1000Method Instead, use Bootstrapping (Ephron) Section Weiss & Kulikowski ML TextInstead, use Bootstrapping (Ephron) Section Weiss & Kulikowski ML Text Leave-one-outLeave-one-out (“Jack knife”) - N = size of data set (leave out one example each time) (leave out one example each time) 10-fold cross validation (CV)10-fold cross validation (CV) useful for t-tests 2-fold CV – however, for statistical significant comparisons between ML algorithms, might want more folds to get reliable estimates of testset accuracy.2-fold CV – however, for statistical significant comparisons between ML algorithms, might want more folds to get reliable estimates of testset accuracy.

6 How should we empirically compare the accuracy of two machine learning algorithms?

7 Performance Scatterplots If you have multiple data sets, you can use scatterplots Which algorithm is better?

8 Confusion Matrices Confusion matrices are useful on multi-class problems Example: speech recognition system where the classes correspond to words

9 Accuracy Acc. of model from Alg. A: 87% Acc. of model from Alg. B: 75% How confident are we that Algorithm A has a higher accuracy than Algorithm B on unseen examples? What information would help us answer this question? Given: Given: model (hypothesis) learned by ML algorithm model (hypothesis) learned by ML algorithm Test set of N examples Test set of N examples Determine: Determine: a range of values in which the error rate likely falls a range of values in which the error rate likely falls

10 Statistical Analysis of Sampling Effects RoadmapRoadmap Assume test set examples independently and identically drawn from the same distribution as the unseen examples (i.i.d assumption)Assume test set examples independently and identically drawn from the same distribution as the unseen examples (i.i.d assumption) Allows us to use Central Limit TheoremAllows us to use Central Limit Theorem Model the probability density function for the error rate on N examplesModel the probability density function for the error rate on N examples Use PDF to get confidence intervalUse PDF to get confidence interval

11 Whiteboard examples + Gaussian (Normal) Distribution Table 5.4

12 Central Limit Theorem Roughly, for large enough N all distributions look Gaussian when summing/averaging N values.Roughly, for large enough N all distributions look Gaussian when summing/averaging N values. Surprisingly, N=30 is large enough! (in most cases at least). - see pg 132 of the textbook

13 Confidence Intervals 1)Estimate  and  to determine Gaussian (in a few slides) 2)Use Gaussian PDF to obtain confidence bounds on future measure English version of confidence interval: We are M% certain that the true error rate of the hypothesis is within Δ of 

14 Solving for Solving for Δ Solving the integral on the previous slide gives:Solving the integral on the previous slide gives: Δ = Z M  Z M is a constant that depends on the confidence level, M Tables of Z M exist (see Table 5.1 for a small version)

15 Random Variables, Expected Values, and Variances A random variable is one whose value is the outcome of a probabilistic experiment.A random variable is one whose value is the outcome of a probabilistic experiment. - e.g., ask an oracle for an example (point in feature space) and see if ML algorithm gets it wrong.- e.g., ask an oracle for an example (point in feature space) and see if ML algorithm gets it wrong.

16 Rewriting the Variance

17 Measuring Errors on a Dataset Typo, should be E[Y i 2 ]

18 Calculating Variance Some useful properties of varianceSome useful properties of variance

19 Putting it All Together N independent times we evaluated our learned hypothesisN independent times we evaluated our learned hypothesis Assuming N>30, the expected error rate will be approximately Normal(p, p*(1-p)/N)Assuming N>30, the expected error rate will be approximately Normal(p, p*(1-p)/N)

20 Estimating p with confidence interval:

21 Next Topic: t-tests What we’ve done so far involves a single train/test splitWhat we’ve done so far involves a single train/test split Now we’ll see how we can use N-fold cross-validation to get better estimatesNow we’ll see how we can use N-fold cross-validation to get better estimates

22 Variance goes down as sample size increases If we independently measure a binary random variable N timesIf we independently measure a binary random variable N times Single Measurement Mean of N Measurements Probability Value

23 Computing Confidence Intervals with 10-fold Cross Validation Approach #1Approach #1 Pool all 10 test-sets and usePool all 10 test-sets and use Assumes that the same model was learned on each train-set (reasonable approximation)Assumes that the same model was learned on each train-set (reasonable approximation) Use in HW1Use in HW1 With CV N is the total number of labeled examples

24 Computing Confidence Intervals with 10-fold Cross Validation Approach #2Approach #2 Compute mean and std. deviation for the 10 test-set accuraciesCompute mean and std. deviation for the 10 test-set accuracies Can’t approximate by a Gaussian since < 30 samples (so don’t know how much prob. mass covered)Can’t approximate by a Gaussian since < 30 samples (so don’t know how much prob. mass covered) The 10 samples are not independentThe 10 samples are not independent 90% overlap in training sets90% overlap in training sets

25 Paired Student t-tests Given:Given: 10 training/test sets10 training/test sets 2 ML algorithms2 ML algorithms Error rates of 2 algorithms on the 10 test-setsError rates of 2 algorithms on the 10 test-sets Determine:Determine: Which algorithm is better on this problem?Which algorithm is better on this problem? Is the difference statistically significant?Is the difference statistically significant?

26 Paired Student t-Tests ExampleExample Algorithm A: 805075…99 Algorithm B:794974…98 δ: +1+1+1…+1 Algorithm A’s mean is better, but the two std. Deviations will clearly overlap.Algorithm A’s mean is better, but the two std. Deviations will clearly overlap. But, A is always better than B.But, A is always better than B.

27 The Random Variable in the t-test Consider random variable δ= Algo A’s Algo B’s test-set __ test-set test-set __ test-set error error error error Notice we’re “factoring out” test-set difficulty by looking at relative performance. In general, one tries to explain variance in results across experiments. Here we’re saying that Variance = f( Problem difficulty) + g( Algorithm strength)

28 More on the Paired t-Test Our NULL HYPOTHESIS is that the two ML algorithms have equivalent average accuracies i.e. differences (in the scores) are due to the “random fluctuations” about the mean of zeroi.e. differences (in the scores) are due to the “random fluctuations” about the mean of zero We compute the probability that the observed δ arose from the null hypothesis If this probability is low we reject the null hypothesis and say that the two algorithms appear differentIf this probability is low we reject the null hypothesis and say that the two algorithms appear different Low is usually taken as 0.05Low is usually taken as 0.05

29 Two Equivalent views of the Null Hypothesis: 1 δ Assume zero mean and use the sample’s variance i.e. the experiments. δ P(δ) 1. 0.05 total probability mass Does our measured lie in the regions indicated by arrows? If so, reject null hypothesis. If we sampled from the above distribution would be “surprised” to get δ ?

30 Two Equivalent views of the Null Hypothesis: 2 δ Use sample’s mean and variance. 2. 0.05 probability mass Are we M% confident that δ = 0 isn’t likely? If so, reject null hypothesis. δ P(δ) M prob mass

31 The t-test Confidence Interval Given: δ 1, …, δ N whereGiven: δ 1, …, δ N where where each δ i is measured on a testset of at least 30 examples (so the “Central Limit Theorem” applies, and the ‘s are samples from Gaussians) Compute: Confidence interval, at the M% levelCompute: Confidence interval, at the M% level for the mean difference See if contains ZERO. If not, we can reject the NULL HYPOTHESIS that algorithms A & B perform equivalently. Hence if N is the typical 10, our dataset must have >= 300 examples. δiδiδiδi

32 The t-Test Calculation Compute MeanMean Sample VarianceSample Variance Lookup t value for N folds and M confidence level.Lookup t value for N folds and M confidence level. - “N-1” is called the degrees of freedom - “N-1” is called the degrees of freedom - As N->∞ the t-distribution approaches normal See table 5.6 in Mitchell δiδiδiδi

33 The t-test Calculation CalculateCalculate The interval contains 0 if PDF δ

34 Some Jargon: p-values p-Value = Probability of getting one’s results or something less likely, given that the NULL HYPOTHESIS holds (Typically P <= 0.05 is taken to mean that a difference is statistically significant.) PNULL HYPO DISTRIBUTION

35 More on the t-Distribution We typically don’t have enough folds to assume the central- limit theorem. (i.e. N < 30)We typically don’t have enough folds to assume the central- limit theorem. (i.e. N < 30) So, we need to use the t distributionSo, we need to use the t distribution It’s wider (and hence, shorter) than the Gaussian (Z) distribution (sums integrate to 1)It’s wider (and hence, shorter) than the Gaussian (Z) distribution (sums integrate to 1) Hence, our confidence intervals will be widerHence, our confidence intervals will be wider Fortunately, t-tables existFortunately, t-tables exist Gaussian tNtN different curve for each N

36 Comments on the Sample Variance

37 General Central Limit Theorem applies >= 30 measurements averaged ML-Specific: #errors/#tests: accurately estimates p, probability of error on single ex used in formula for  which characterizes expected future deviations about mean  = p Assumptions on examples representative of future examples independent of training sets individual ex’s independently (and identically “iid”) drawn Paired t-tests overlap in training sets ignored. Assumes that difference in error rate is iid Some Assumptions

38 Stability Stability = change in learned model due to minor perturbations of the training set. Paired t-test assumptions are a better match to stable algorithm. K-NN, higher the k, the more stable.

39 More on paired t-test assumption Ideally train on one data set and then do a 10-fold paired t-test What we should do: traintest1 … test10 What we usually do: train1test1 … train10test10 However, not enough data usually If we assume that train data is part of each paired experiment then we violate independence assumptions - each train set overlaps 90% with every other train set. Learned result doesn’t vary while we’re measuring its performance

40 A Great Debate Should you use a “one tailed” or a “two tailed” t-test?Should you use a “one tailed” or a “two tailed” t-test? A two tailed test asks the question: Are algorithms A and B statistically different?A two tailed test asks the question: Are algorithms A and B statistically different? Null Hypo: A and B have the same error rateNull Hypo: A and B have the same error rate A one tailed test asks the question: Is algorithm A statistically better than algorithm B?A one tailed test asks the question: Is algorithm A statistically better than algorithm B? Null Hypo: A does not have a better error rate than BNull Hypo: A does not have a better error rate than B

41 One vs Two Tailed Graphic P(x) x 2.5% One Tailed Test Two Tailed Test

42 A Great Debate (More) Which of these tests should you use when comparing your new algorithm to a state-of-the-art algorithm?Which of these tests should you use when comparing your new algorithm to a state-of-the-art algorithm? You should use two tailed, because by using two tailed, you are saying there is a chance I am better and a chance I am worse.You should use two tailed, because by using two tailed, you are saying there is a chance I am better and a chance I am worse. One tailed is saying, I know my algorithm is better, and therefore you are allowed a larger margin of error.One tailed is saying, I know my algorithm is better, and therefore you are allowed a larger margin of error. See http://www.psychstat.smsu.edu/introbook/sbk25m.htm

43 Two Sided vs One Sided You need to very carefully think about the question you are askingYou need to very carefully think about the question you are asking Are we within x of the correct value?Are we within x of the correct value? Measured mean mean - xmean + x

44 Two Sided vs One Sided How confident are we that ML System A’s accuracy is at least 85%?How confident are we that ML System A’s accuracy is at least 85%? 85%

45 Two Sided vs One Sided Is ML algorithm A no more accurate than algorithm B?Is ML algorithm A no more accurate than algorithm B? A - B

46 Two Sided vs One Sided Are algorithms A and B equivalently accurate?Are algorithms A and B equivalently accurate? A - B

47 Next Topics ROC CurvesROC Curves Recall-PrecisionRecall-Precision

48 Contingency Tables TRUEPositivesFALSEPositives FALSENegativesTRUENegatives TRUE Counts of occurrences FALSE TRUEFALSE Actual Label Predicted Label TP Rate = TP / (TP + FN) FP Rate = FP / (FP + TN) also called recall and sensitivity 1 - specificity

49 Roc Curves ROC: Receiver Operating Characteristic curvesROC: Receiver Operating Characteristic curves Started during radar research during WWIIStarted during radar research during WWII Judging algorithms on accuracy alone may not be good enough when false negatives and false positives have different costsJudging algorithms on accuracy alone may not be good enough when false negatives and false positives have different costs

50 ROC Curve Graphic 1.0 True Positive Rate False Positive Rate Ideal Spot Alg 1 Alg 2 Different algorithms can work better in different parts of ROC space. This depends on cost of false + vs false - Random Guess ROC curve

51 Creating an ROC Curve You need an ML algorithm that outputs NUMERIC results such as prob(example is +)You need an ML algorithm that outputs NUMERIC results such as prob(example is +) You can use ensembles to get this from a boolean model (later).You can use ensembles to get this from a boolean model (later).

52 Create ROC Curve Algo Step 1: Sort predictions on test setStep 1: Sort predictions on test set Step 2: locate a threshold between examples with opposite outputs.Step 2: locate a threshold between examples with opposite outputs. Step 3: Compute TP rate & FP rate for each threshold of Step 2.Step 3: Compute TP rate & FP rate for each threshold of Step 2. Step 4: Connect the dotsStep 4: Connect the dots

53 Plot ROC Curve Example Ex 9.99+ Ex 7.98+ Ex 1.72- Ex 2.7+ Ex 6.65+ Ex 10.51- Ex 3.39- Ex 5.24+ Ex 4.11- Ex 3.01- ML Algo Output (Sorted) Correct Category 1.0 FP rate TP rate Prob (alg outputs + | + is correct) Prob (alg outputs + | - is correct) TPR=(2/5) FPR=(0/5) TPR=(2/5) FPR=(1/5) TPR=(4/5) FPR=(1/5) TPR=(4/5) FPR=(3/5) TPR=(5/5) FPR=(3/5) TPR=(5/5) FPR=(5/5)

54 Area Under ROC Curve A common metric for experiments is to numerically integrate the ROC CurveA common metric for experiments is to numerically integrate the ROC Curve 1.0 FP Rate TP Rate

55 ROC’s and Many Models It is not necessary that we learn one model and then threshold its output to produce an ROC curve.It is not necessary that we learn one model and then threshold its output to produce an ROC curve. You could learn different models for different regions of ROC spaceYou could learn different models for different regions of ROC space See Goadrich, Oliphant, & Shavlik ILP’04.See Goadrich, Oliphant, & Shavlik ILP’04.

56 Asymmetric Error Costs Assume that cost(FP) != cost(FN)Assume that cost(FP) != cost(FN) You would like to pick a threshold that mimimizesYou would like to pick a threshold that mimimizes E(total cost) = cost(FP) x prob(FP) x (# of -) + cost(FN) x prob(FN) x (# of +) You could also have (negative) costs for TP and TNYou could also have (negative) costs for TP and TN

57 ROC’s & Skewed Data One strength of ROC curves is that they are a good way to deal with skewed data (|+| >> |-|) since the axes are fractions (rates) independent of the # of examplesOne strength of ROC curves is that they are a good way to deal with skewed data (|+| >> |-|) since the axes are fractions (rates) independent of the # of examples You must be careful though!You must be careful though! Low FPR * (many negative ex) = sizable number of FPLow FPR * (many negative ex) = sizable number of FP Possibly more than # of TPPossibly more than # of TP

58 Precision vs Recall Precision = TP / (TP + FP)Precision = TP / (TP + FP) Recall = TP / (TP + FN)Recall = TP / (TP + FN) Notice that TN is not used in either formulaNotice that TN is not used in either formula

59 ROC vs Recall-Precision You can get very different visual results on the same data.You can get very different visual results on the same data. vs P ( + | - )Recall Precision P ( + | + ) The reason for this is because there may be lots of -

60 Recall-Precision Curve You cannot simply connect the dots in Recall-Precision curves.You cannot simply connect the dots in Recall-Precision curves. See Goadrich, Oliphant, & Shavlik ILP’04See Goadrich, Oliphant, & Shavlik ILP’04 Recall Precision x

61 The Permutation Test Another way to judge significance of an empirical resultAnother way to judge significance of an empirical result This is just starting to appear in a few ML papers, but is its from an old idea in the stats community.This is just starting to appear in a few ML papers, but is its from an old idea in the stats community. Method: multiple times, permute the class labels of train and prune sets, then train and evaluate on the test sets.Method: multiple times, permute the class labels of train and prune sets, then train and evaluate on the test sets. See how likely it is that you get as good of results on random outputs.See how likely it is that you get as good of results on random outputs.

62 Exp Methodology Wrapup Never train on test sets. (use tune sets)Never train on test sets. (use tune sets) Use central-limit theorem to place confidence intervals on measurementsUse central-limit theorem to place confidence intervals on measurements Paired t-test’s provide a sensitive way to judge whether two algorithms perform differently.Paired t-test’s provide a sensitive way to judge whether two algorithms perform differently. t-test is a useful heuristic for guiding researcht-test is a useful heuristic for guiding research Use a two-tailed testUse a two-tailed test ROC curves are better than accuracyROC curves are better than accuracy


Download ppt "Announcements HW1 assigned due, Monday Oct 9HW1 assigned due, Monday Oct 9 Reading AssignmentReading Assignment Paper on ROC curvesPaper on ROC curves."

Similar presentations


Ads by Google