Presentation on theme: "On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach"— Presentation transcript:
1On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. SalzbergDepartment of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USAPresenter Jiyeon Kim(April14th, 2014)
2IntroductionHow does the researcher choose which classification algorithm to use for a new problem?Comparing the effectiveness of different algorithms on public databases – opportunities or dangers ?Are many comparisons relied on widely shared datasets statistically valid ?
51/1Paired T-TestTo determine whether two paired sets differ from each other in a significant wayUnder this assumption- the paired differences areindependent and identically normally distributed
61/2 Hypothesis Testing Null Hypothesis (H0) vs. Alternative Hypothesis (H1)Reject the null hypothesis (H0),if the p-value is less than the significance levele.g. In the case of Paired T-test,H0 : There is no difference in two populations.H1 : There is a statistically significant difference.
71/3 Significance Level, α The percentage of the time in which the experimenters make an errorUsually, the significance level is chosen to be 0.05 (or equivalently, 5%)A fixed probability of wrongly rejecting the null hypothesis H0, if it is in fact true( = P(type I error) )
81/4P-ValueThe probability of obtaining a test statistic result at least as extreme as the one that was actually observed, assuming that the null hypothesis is true“Reject the null hypothesis (H0) "when the p-value turns out to be less thana certain significance level, often 0.05
92Comparing AlgorithmsEmpirical validation of Classification Research hasserious experimental deficienciesBe careful when making conclusion thata new method is significantly better on well-studied datasets
103 Statistical Validity < Multiplicity Effect > 154 * 0.05 = 7.7 e.g. Assume that you do 154 experiments ( two-tailed, paired t-test) with significant level 0.05You have 154 chances to be significantThe expected number of significant results is154 * 0.05 = 7.7Now You have 770 % error rate !!
11< Bonferroni Adjustment > 3Statistical Validity< Bonferroni Adjustment >Let α* be the error rate of each experimentThen, (1- α*) become the chance that we can get right conclusion③ If we conduct n independent experiments,the chance of getting them all right is (1- α*)ⁿSo, the chance that we will make at least onemistake, α is 1 - (1- α*)ⁿ
12Now You have 99.96% error rate !! 3Statistical Validity< Bonferroni Adjustment >e.g. (This is not correct usage!!)Assume that you do 154 experiments ( two-tailed, paired t-test ) with significant level 0.05 againThe significance level for each experiment, α* = 0.05Then, the right conclusion rate, (1 - α*) = ( ) = 0.95The chance of getting them all right is ( ) ^154So, the significance level for all experiments is1 - (1 - α*)ⁿ = 1 - (1 – 0.05)^154 =Now You have 99.96% error rate !!
13< Bonferroni Adjustment > 3Statistical Validity< Bonferroni Adjustment >⇒ “ Then, what should we do? ”e.g. (This is ‘correct’ usage!!)α = 1 - (1 – α*)^154 ≤ 0.05 in order to obtain results significant at the 0.05 level with 154 resultsThen, it gives α*≤ which is more stringent than the original significance level 0.05!
14This argument is very rough 3Statistical Validity< Bonferroni Adjustment >/ CAVEAT /This argument is very roughbecause it assumes that all the experiments are independent of one another!
15( with Bonferroni Adjustment ) 3/1Alternative statistical testsStatistical Validity /* Recommended TestsSimple Binomial TestANOVA(Analysis of Variance)( with Bonferroni Adjustment )
163/1 Alternative statistical tests To compare two algorithms ( A&B ), Statistical Validity /To compare two algorithms ( A&B ),a comparison must consider four numbers ;The number of examples that A got right andB got wrong ⇒ A>B② The number of examples that B got right andA got wrong ⇒ A>B③ The number that both algorithms got right④ The number that both algorithms got wrong
173/1 Alternative statistical tests To compare two algorithms ( A&B ), Statistical Validity /To compare two algorithms ( A&B ),a comparison must consider four numbers ;The number of examples that A got right andB got wrong ⇒ A>B② The number of examples that B got right andA got wrong ⇒ A>B⇒ simple but much improved way, Binomial Test!
183/2 Community Experiments Statistical Validity /Even when using strict significance criteria and the appropriate significance tests, there would be mere ‘ accidents of chance ’In order to deal with this phenomenon,the most helpful resolution is duplication !
193/3 Repeated Tuning Algorithms are tuned repeatedly on some datasets Statistical Validity /Algorithms are tuned repeatedly on some datasetsWhenever tuning takes place, every adjustment should be considered a separate experimente.g. If 10 ‘ tuning ’ experiments were attempted,then significance level should be instead of 0.05
20< Recommended Approach> 3/3Repeated TuningStatistical Validity /< Recommended Approach>To establish the new algorithm’s comparative merits,Choose other algorithm that is most similar to the new one to include in the comparisonChoose a benchmark data set that illustrates the strengths of the new algorithmDivide the data set into k subsets for cross-validationRun a cross-validationTo compare algorithms, use the appropriate statistical test
21< Cross-Validation> 3/3Repeated TuningStatistical Validity /< Cross-Validation>For each of the k subsets of the data set D, create atraining set T = D - k(B) Divide each training set into two smaller subsets, T1 and T2; T1 will be used for training, and T2 for tuning(C) Once the parameters are optimized, re-run trainingon the larger set T(D) Finally, measure accuracy on k(E) Overall accuracy is averaged across all k partitions; These k values also give an estimate of the variance of the algorithms
224ConclusionsNo single technique is likely to work best on all databasesEmpirical comparisons should be done for validityof algorithms but these studies must be very careful!Comparative work should be done in a statistically acceptable frameworkThe contents above are to help experimental researchers steer clear of problems in designing a comparative study.
23> Exam QuestionsQ) Why should we apply Bonferroni Adjustment to comparing classifiers?1
24> Exam QuestionsA) In case of multiple tests, multiplicity effect occurs if we use same significant level for each test as for all tests. So we need to get more stringent level for each experiment by Bonferroni Adjustment.1
25> Exam QuestionsAssume that you will do 10 experiments for comparing two classification algorithms.Using Bonferroni Adjustment, determine the criterion of α* (the significant level for each experiment) in order to get results that are truly significant at the 0.01 level for 10 tests.2
27> Exam QuestionsQ) Specify the difference between paired t-test and simple binomial test in comparing two algorithms.3
283 > Exam Questions A) Paired t-test : determine whether the difference between two algorithms exists or notBinomial test :compare the percentage of times ‘ algorithm A > algorithm B ’ versus ‘ A < B ’, with throwing out the ties3