A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007.

Slides:



Advertisements
Similar presentations
Significance Testing.  A statistical method that uses sample data to evaluate a hypothesis about a population  1. State a hypothesis  2. Use the hypothesis.
Advertisements

Hypothesis testing Another judgment method of sampling data.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Ch. 21 Practice.
Review of the Basic Logic of NHST Significance tests are used to accept or reject the null hypothesis. This is done by studying the sampling distribution.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Nonparametric Methods Chapter 15.
STAT 135 LAB 14 TA: Dongmei Li. Hypothesis Testing Are the results of experimental data due to just random chance? Significance tests try to discover.
© 2010 Pearson Prentice Hall. All rights reserved Hypothesis Testing Using a Single Sample.
Using Statistics in Research Psych 231: Research Methods in Psychology.
Using Statistics in Research Psych 231: Research Methods in Psychology.
© 2001 Prentice-Hall, Inc.Chap 9-1 BA 201 Lecture 15 Test for Population Mean Known.
Hypothesis Testing Steps of a Statistical Significance Test. 1. Assumptions Type of data, form of population, method of sampling, sample size.
Statistics 07 Nonparametric Hypothesis Testing. Parametric testing such as Z test, t test and F test is suitable for the test of range variables or ratio.
DEPENDENT SAMPLES t-TEST What is the Purpose?What Are the Assumptions?How Does it Work?
Independent Sample T-test Often used with experimental designs N subjects are randomly assigned to two groups (Control * Treatment). After treatment, the.
15-1 Introduction Most of the hypothesis-testing and confidence interval procedures discussed in previous chapters are based on the assumption that.
Today Concepts underlying inferential statistics
Using Statistics in Research Psych 231: Research Methods in Psychology.
Nonparametric and Resampling Statistics. Wilcoxon Rank-Sum Test To compare two independent samples Null is that the two populations are identical The.
Descriptive Statistics
One Sample Z-test Convert raw scores to z-scores to test hypotheses about sample Using z-scores allows us to match z with a probability Calculate:
Statistical Methods II
Choosing Statistical Procedures
Tuesday, September 10, 2013 Introduction to hypothesis testing.
Chapter 8 Introduction to Hypothesis Testing
Means Tests Hypothesis Testing Assumptions Testing (Normality)
Statistics PSY302 Quiz Chapters 16 & Alpha α is also known as: A.The null hypopthesis B.Chi Square C.The significance level D.The Analysis of Variance.
+ Chapter 9 Summary. + Section 9.1 Significance Tests: The Basics After this section, you should be able to… STATE correct hypotheses for a significance.
The Hypothesis of Difference Chapter 10. Sampling Distribution of Differences Use a Sampling Distribution of Differences when we want to examine a hypothesis.
Statistics & Biology Shelly’s Super Happy Fun Times February 7, 2012 Will Herrick.
Hypothesis Testing: One Sample Cases. Outline: – The logic of hypothesis testing – The Five-Step Model – Hypothesis testing for single sample means (z.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Lesson Inferences about the Differences between Two Medians: Dependent Samples.
Statistics (cont.) Psych 231: Research Methods in Psychology.
1 Lecture note 4 Hypothesis Testing Significant Difference ©
Statistics (cont.) Psych 231: Research Methods in Psychology.
Independent Samples 1.Random Selection: Everyone from the Specified Population has an Equal Probability Of being Selected for the study (Yeah Right!)
Lecture 5: Chapter 5: Part I: pg Statistical Analysis of Data …yes the “S” word.
Hypothesis Testing A procedure for determining which of two (or more) mutually exclusive statements is more likely true We classify hypothesis tests in.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
4 Hypothesis & Testing. CHAPTER OUTLINE 4-1 STATISTICAL INFERENCE 4-2 POINT ESTIMATION 4-3 HYPOTHESIS TESTING Statistical Hypotheses Testing.
Chapter 15 – Analysis of Variance Math 22 Introductory Statistics.
Experimental Design and Statistics. Scientific Method
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Inference for a Population Mean
Experimentation in Computer Science (Part 2). Experimentation in Software Engineering --- Outline  Empirical Strategies  Measurement  Experiment Process.
Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.
Hypothesis Testing Errors. Hypothesis Testing Suppose we believe the average systolic blood pressure of healthy adults is normally distributed with mean.
Statistical Inference Drawing conclusions (“to infer”) about a population based upon data from a sample. Drawing conclusions (“to infer”) about a population.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
T tests comparing two means t tests comparing two means.
Copyright © 2010, 2007, 2004 Pearson Education, Inc Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
Testing Differences in Means (t-tests) Dr. Richard Jackson © Mercer University 2005 All Rights Reserved.
Nonparametric Statistics - Dependent Samples How do we test differences from matched pairs of measurement data? If the differences are normally distributed,
Introduction to Inference Tests of Significance. Wording of conclusion revisit If I believe the statistic is just too extreme and unusual (P-value < 
Statistics (cont.) Psych 231: Research Methods in Psychology.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.
Inferential Statistics Psych 231: Research Methods in Psychology.
Nonparametric Statistics Overview. Objectives Understand Difference between Parametric and Nonparametric Statistical Procedures Nonparametric methods.
1 Underlying population distribution is continuous. No other assumptions. Data need not be quantitative, but may be categorical or rank data. Very quick.
Introduction to Hypothesis Testing. Hypothesis Testing The general goal of a hypothesis test is to rule out chance (sampling error) as a plausible explanation.
CHAPTER 15: THE NUTS AND BOLTS OF USING STATISTICS.
Hypothesis Testing I The One-sample Case
Evaluation of IR Systems
Lesson Inferences about the Differences between Two Medians: Dependent Samples.
Reasoning in Psychology Using Statistics
Introduction to Hypothesis Testing
Doing t-tests by hand.
Rest of lecture 4 (Chapter 5: pg ) Statistical Inferences
Presentation transcript:

A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007

Summary  Motivation  Significance Testing  General Approach  Significance Test’s  Randomization test, Wilcoxon test, Sign test, Bootstrap test, Student’s t test;  Results  Discussion  Conclusions

Motivation  Goal => Promote retrieval methods that truly are better rather than methods that by chance perform better given a set of topics, judgments, and documents used in the evaluation.  Given two information retrieval (IR) systems, how can we determine which one is better than the other?  Common approaches like TREC use the difference of the Mean Average Precision (MAP). Problems? How can they be solved? Use significance tests!  What significance test should IR researchers use?  Student’s paired test t? Wilcoxon signed ranked test? Sing test? bootstrap? Fisher’s randomization?

Significance Testing  Significance Testing  1. A test statistic or criterion by which to judge the two systems. IR researchers commonly use the difference in mean average precision (MAP) or the difference in the mean of another IR metric.  2. A distribution of the test statistic given a null hypothesis. A typical null hypothesis is that there is no difference in our two systems  3.A significance level (p-value) that is computed by taking the value of the test statistic for our experimental systems and determining how likely a value could have occurred under the null hypothesis.

General Approach

Randomization test p-value =

Wilcoxon Test p-value =

Sign Test p-value = p-value =

Bootstrap Test p-value =

Student’s Paired t-test p-value =

Results

Discussion  Sing and Wilcoxon tests:  The use this tests should not be use because they test criteria that do not match the criteria of interest.  Randomization and Bootstrap tests:  This tests can use whatever criterion we specify while the other tests are fixed in their test statistics.  Bootstrap test and Student’s t test:  The scores from the two IR Systems are random samples from a single population. Test topics are not random samples from the population of topics but hand selected to meet various criteria.  Student’s t test:  This test can only be used for the difference between means and not for median or other test statistics.  At smaller sample sizes, violations in normality may result in errors in the t-test.

Conclusion  The Randomization test is the recomendaded test to used to compare two IR systems.  The Wilcoxon Signed Ranked Test and Sign tests should no longer be used in this context.  The Randomization test, Bootstrap shifted method test, and Student’s t test all produced comparable significance values => there’s is no practical difference between them!  The Wilcoxon Signed Ranked test and Sign tests both procuded very different p-values => can incorrectly predict significance and can fail to detect significance results.