Download presentation
Presentation is loading. Please wait.
1
Data Mining CSCI 307, Spring 2019 Lecture 31
Comparing Data Mining Schemes
2
5.6 Comparing Data Mining Schemes
Question: Suppose we have 2 classifiers, M1 and M2, which one is better? Obvious way: Use 10-fold cross-validation to obtain and These mean error rates are just estimates of error on the true population of future data cases
3
Comparing Schemes continued
Want to show that scheme M1 is better than scheme M2 in a particular domain For a given amount of training data On average, across all possible training sets Assume we have an infinite amount of data from the domain: Obtain cross-validation estimate on each dataset for each scheme Check if mean accuracy for scheme M1 is better than mean accuracy for scheme M2 We probably don't have an infinite amount of data
4
What about ML research? What if the difference between the 2 error rates is just attributed to chance? Use a test of statistical significance Obtain confidence limits for our error estimates
5
Overview Estimating Confidence Intervals: Null Hypothesis
Perform 10-fold cross-validation Assume samples follow a t distribution with k–1 degrees of freedom (here, k=10) Use t-test (or Student’s t-test) If fail to reject Null Hypothesis: M1 & M2 are the "same" If can reject Null Hypothesis, then Conclude that the difference between M1 & M2 is statistically significant Choose model with lower error rate In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary. The number of independent ways by which a dynamic system can move, without violating any constraint imposed on it, is called number of degrees of freedom. 5
6
Sidebar: the Null Hypothesis
Computer Scientists like to prove a hypothesis true, or false, but we don't do that here. We might be less sure and say we accept the hypothesis or reject it, but we shouldn't do that either. Statisticians fail to reject the hypothesis or reject the hypothesis. If can reject Null Hypothesis, then we conclude that the difference between two machine learning methods is statistically significant. If we fail to reject Null Hypothesis, then we conclude that the differences between two machine learning methods could be just chance. 6
7
Paired t-test In practice we have limited data and a limited number of estimates for computing the mean Student’s t-test tells whether the means of two samples are significantly different In our case, the samples are cross-validation estimates for datasets from the domain Use a paired t-test because the individual samples are paired Same Cross Validation is applied twice William Gosset Born: 1876 in Canterbury; Died: 1937 in Beaconsfield, England Obtained a post as a chemist in the Guinness brewery in Dublin in Invented the t-test to handle small samples for quality control in brewing. Wrote under the name "Student."
8
Estimating Confidence Intervals: t-test
If only 1 test set available: pairwise comparison For ith round of 10-fold cross-validation, the same cross partitioning is used to obtain err(M1)i and err(M2)i Average over 10 rounds to get t-test computes t-statistic with k-1 degrees of freedom: and 8
9
Table for t-distribution
Significance level, e.g., sig = 0.05 or 5% means we have Confidence limit, z = value(sig/2) Symmetric, so -z = -value(sig/2) Rejection region 9
10
Statistical Significance
Are M1 & M2 significantly different? Compute t. Select significance level (e.g. sig = 5%) Consult table for t-distribution: Find t value corresponding to k-1 degrees of freedom (here, 9) t-distribution is symmetric: typically upper % points of distribution shown, so look up value for confidence limit z=sig/2 (here, 0.025) If t > z or t < -z, then t value lies in rejection region: Reject null hypothesis that mean error rates of M1 & M2 are different Conclude: statistically significant difference between M1 & M2 Otherwise if –z ≤ t ≤ z, then fail to reject null hypothesis that mean error rates of M1 & M2 are same Conclude: that any difference is likely due to chance 10
11
Recap: Performing the Test
Fix a significance level If a difference is significant at the a% level, there is a (100-a)% chance that the true means differ Divide the significance level by two because the test is two-tailed Look up the value for z that corresponds to a/2 If t ≤ –z or t ≥z then the difference is significant i.e. the null hypothesis (that the difference is zero) can be rejected
12
EXAMPLE Have two prediction models, M1 and M2. We have performed 10 rounds of 10-fold cross validation on each model, where the same data partitioning in round i is used for both M1 and M2. The error rates obtained for M1 are 30.5, 32.2, 20.7, 20.6, 31.0, 41.0, 27.7, 26.0, 21.5, The error rates for M2 are 22.4, 14.5, 22.4, 19.6, 20.7, 20.4, 22.1, 19.4, 16.2, Is one model is significantly better than the other considering a significance level of 1%?
13
EXAMPLE continued We hypothesis test to determine if there is a significant difference in average error. We used the same test data for each observation so we use the “paired observation” hypothesis test to compare two means: H0: ¯x1 − ¯x2 = 0 (Null hypothesis, difference is chance) H1: ¯x1 − ¯x2 ≠ 0 (Statistical difference in the model errors) Where ¯x1 is the mean error of model M1 , and ¯x2 is the mean error of model M2. Compute the test statistic t using the formula: t = (mean of the differences in error) (std dev of the differences in error) / sqrt (number of observations)
14
EXAMPLE (the Calculations)
t = (mean of the differences in error) (std dev of the differences in error) / sqrt (number of observations) M1 M2 Average= 6.45 30.5 22.4 32.2 14.5 20.7 20.6 19.6 31.0 41.0 20.4 27.7 22.1 26.0 19.4 21.5 16.2 35.0 8.1 17.7 -1.7 1.0 10.3 20.6 5.6 6.6 5.3 -9.0 ( )2 ( )2 ( )2 ( )2 ( )2 ( )2 ( )2 ( )2 ( )2 ( )2 Std Dev= 8.25 Average and take square root to get 2.7225 0.7225 0.0225 1.3225
15
Example: Table Lookup if –z ≤ t ≤ z, i.e. –3.25 ≤ 2.47 ≤ 3.25
Significance level 1% (0.01), so look up tsig/2 value for probability 0.005 9 degrees of freedom if –z ≤ t ≤ z, i.e. –3.25 ≤ 2.47 ≤ 3.25 then accept fail to reject null hypothesis, i.e., the two models are not different at a significance level of 0.01 15
16
Estimating Confidence Intervals: t-test
RECALL: If only 1 test set available: pairwise comparison t-test computes t-statistic with k-1 degrees of freedom: If two test sets available: use non-paired t-test where where where k1 & k2 are # of cross-validation samples used for M1 & M2, respectively. 𝑑𝑒𝑛𝑜𝑚𝑖𝑛𝑎𝑡𝑜𝑟 = 𝑣𝑎𝑟(𝑀1 𝑘 𝑣𝑎𝑟(𝑀2 𝑘 2 16
17
In Other Words: Unpaired Observations
If the CV estimates are from different datasets, they are no longer paired (or maybe we have k estimates for one scheme, and j estimates for the other one) Then we have to use an unpaired t-test with min(k , j) – 1 degrees of freedom The estimate of the variance of the difference of the means becomes:
18
Dependent Estimates We assumed that we have enough data to create several datasets of the desired size Need to re-use data if that's not the case e.g. running cross-validations with different randomizations on the same data Samples become dependent and then insignificant differences can become significant A heuristic test is the corrected resampled t-test: Assume we use the repeated hold-out method, with n1 instances for training and n2 for testing New test statistic is
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.