Effect size reporting reveals the weakness Fisher believed inherent in the Neyman-Pearson approach to statistical analysis Michael T. Bradley & A. Luke.

Slides:



Advertisements
Similar presentations
Sample size estimation
Advertisements

Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.
Chapter 9 Hypothesis Testing Understandable Statistics Ninth Edition
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Null Hypothesis Significance Testing What the heck have we been doing this whole time?
Beyond Null Hypothesis Testing Supplementary Statistical Techniques.
PSY 307 – Statistics for the Behavioral Sciences
Using Statistics in Research Psych 231: Research Methods in Psychology.
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
Basic Elements of Testing Hypothesis Dr. M. H. Rahbar Professor of Biostatistics Department of Epidemiology Director, Data Coordinating Center College.
Hypothesis Testing: Type II Error and Power.
Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~
PSY 307 – Statistics for the Behavioral Sciences
Understanding Statistics in Research
PSY 1950 Confidence and Power December, Requisite Quote “The picturing of data allows us to be sensitive not only to the multiple hypotheses that.
BCOR 1020 Business Statistics Lecture 21 – April 8, 2008.
Chapter 9 Hypothesis Testing.
Today Concepts underlying inferential statistics
Using Statistics in Research Psych 231: Research Methods in Psychology.
Significance Tests for Proportions Presentation 9.2.
Introduction to Testing a Hypothesis Testing a treatment Descriptive statistics cannot determine if differences are due to chance. A sampling error occurs.
Inferential Statistics
Choosing Statistical Procedures
AM Recitation 2/10/11.
Estimation and Hypothesis Testing Faculty of Information Technology King Mongkut’s University of Technology North Bangkok 1.
Testing Hypotheses I Lesson 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics n Inferential Statistics.
Hypothesis Testing.
1/2555 สมศักดิ์ ศิวดำรงพงศ์
Comparing Means From Two Sets of Data
Statistical Analysis Statistical Analysis
Chapter 9 Large-Sample Tests of Hypotheses
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 17 Inferential Statistics.
Hypothesis Testing: One Sample Cases. Outline: – The logic of hypothesis testing – The Five-Step Model – Hypothesis testing for single sample means (z.
The Argument for Using Statistics Weighing the Evidence Statistical Inference: An Overview Applying Statistical Inference: An Example Going Beyond Testing.
Learning Objectives In this chapter you will learn about the t-test and its distribution t-test for related samples t-test for independent samples hypothesis.
Biostatistics Class 6 Hypothesis Testing: One-Sample Inference 2/29/2000.
Lecture 16 Dustin Lueker.  Charlie claims that the average commute of his coworkers is 15 miles. Stu believes it is greater than that so he decides to.
Statistical Hypotheses & Hypothesis Testing. Statistical Hypotheses There are two types of statistical hypotheses. Null Hypothesis The null hypothesis,
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
L Berkley Davis Copyright 2009 MER301: Engineering Reliability Lecture 9 1 MER301:Engineering Reliability LECTURE 9: Chapter 4: Decision Making for a Single.
Essential Question:  How do scientists use statistical analyses to draw meaningful conclusions from experimental results?
Correct decisions –The null hypothesis is true and it is accepted –The null hypothesis is false and it is rejected Incorrect decisions –Type I Error The.
Economics 173 Business Statistics Lecture 4 Fall, 2001 Professor J. Petry
Chapter 8 Parameter Estimates and Hypothesis Testing.
Chapter 10 The t Test for Two Independent Samples
Inen 460 Lecture 2. Estimation (ch. 6,7) and Hypothesis Testing (ch.8) Two Important Aspects of Statistical Inference Point Estimation – Estimate an unknown.
Hypothesis Testing Errors. Hypothesis Testing Suppose we believe the average systolic blood pressure of healthy adults is normally distributed with mean.
Inferential Statistics Inferential statistics allow us to infer the characteristic(s) of a population from sample data Slightly different terms and symbols.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 10 Comparing Two Groups Section 10.1 Categorical Response: Comparing Two Proportions.
Introduction to Testing a Hypothesis Testing a treatment Descriptive statistics cannot determine if differences are due to chance. Sampling error means.
CHAPTER 1 EVERYTHING YOU EVER WANTED TO KNOW ABOUT STATISTCS.
Descriptive and Inferential Statistics Descriptive statistics The science of describing distributions of samples or populations Inferential statistics.
Chapter 13 Understanding research results: statistical inference.
Hypothesis Testing. Suppose we believe the average systolic blood pressure of healthy adults is normally distributed with mean μ = 120 and variance σ.
1 Chapter9 Hypothesis Tests Using a Single Sample.
Chapter ?? 7 Statistical Issues in Research Planning and Evaluation C H A P T E R.
BIOL 582 Lecture Set 2 Inferential Statistics, Hypotheses, and Resampling.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Inferential Statistics Psych 231: Research Methods in Psychology.
Chapter 9 Hypothesis Testing Understanding Basic Statistics Fifth Edition By Brase and Brase Prepared by Jon Booze.
Critical Appraisal Course for Emergency Medicine Trainees Module 2 Statistics.
Slides to accompany Weathington, Cunningham & Pittenger (2010), Chapter 11: Between-Subjects Designs 1.
Chapter 8: Inferences Based on a Single Sample: Tests of Hypotheses
Michael T. Bradley & A. Luke MacNeill
Understanding Results
Central Limit Theorem, z-tests, & t-tests
Hypothesis Testing: Hypotheses
Chapter 6 Hypothesis tests.
Psych 231: Research Methods in Psychology
Testing Hypotheses I Lesson 9.
Presentation transcript:

Effect size reporting reveals the weakness Fisher believed inherent in the Neyman-Pearson approach to statistical analysis Michael T. Bradley & A. Luke MacNeill Department of Psychology, University of New Brunswick Abstract Effect sizes provide concrete support for Fisher’s arguments against Neyman and Pearson’s approach to statistical analysis. Neyman and Pearson approached statistics from an applied perspective and sampled from production lines. They could specify a probability that some batch was abnormal and then estimate a power of detecting that abnormality. They accepted p<.05 for a type 1 error and established the.10 level as the willingness to make a type 2 error, and suggested that these two specifications belonged in scientific research. Fisher felt both specifications implied a precision defensible in routine production but not in scientific investigation. He worried that estimates based on a singular or small set of experiments would be unstable. For Fisher accepting a value at.05, only means that new or manipulated data don’t fit a “model” of a null Hypothesis with a specified mean and variance. The question for Fisher is “What is the correct model for such a result?” Neyman and Pearson, on slim data, act as if the model was specified before testing began. The recommended reporting of effect sizes compounds problems with the Neyman and Pearson approach since the model is accepted as correct, and effect size estimates are then considered accurate without considering distribution factors, number of potential attempts to test a hypothesis, or “file drawer effects”. Conclusion APA compounds data analysis problems by endorsing the N-P approach, and recommending the reporting of Null Hypothesis Significance Tests (NHST) in conjunction with Effect Size Estimates, Confidence Intervals (CIs), and clear description of the problem area. One cannot go wrong with a clear description of a problem area. As we have seen, the other recommendations are incompatible with each other. Fisher did not live to see the ultimate missteps in testing based upon the N-P approach, but if he could have predicted how influential N-P were to become, he may well have moved from mild fury that characterized his attacks to apoplectic fits. Effect Sizes and NHST The use of effect sizes is incompatible with the Neyman and Pearson approach. NHST is not a precise measurement technique, whereas effect size estimates are meant to be point estimations. Effect sizes are based on standard deviation, or the average difference from the mean. Theoretically, adding more participants to a study will not have an effect on standard deviation, and so it will not have an impact on effect size. NHST is based on standard error. Adding more participants to a study will decrease the standard error, which will increase power and the chances for significance. Confidence Intervals Confidence intervals are based on an inferential approach and clearly belong to the NHST family. Like NHST, confidence intervals are calculated with the standard error and are subject to N. Since Ns can vary, it is analogous to using an elastic band as a ruler. Significance comes and goes, and worse, the only calculations of effect sizes come from misestimates of mean differences, variability, or a combination of both. An additional problem is that confidence intervals are often huge and exceed the magnitude of the effect size they are meant to bracket. If replication involves achieving a similar effect size, then it is difficult to achieve confirmatory results. At the same time, if replication involves fitting into the confidence interval, it is difficult to fail to replicate. This contradictions illustrates further issues with the N-P approach. Andrew Brand Institute of Psychiatry, King’s College London Background Fisher felt inferential tests were most primitive forms of measurement. According to his view, p is an imprecise estimate that indicates from an individual study whether anything worth pursuing is present. If so, then superior design, measurement, and data analysis strategies could be pursued. Neyman and Pearson (N-P) approached statistics from an applied perspective. They sampled from production lines, and could specify (1) the probability that some batch was abnormal, and (2) the power of detecting that abnormality. N-P suggested that this measure of precision could be obtained with inferential tests. Researchers will either accept a null hypothesis (H O ) or reject it in favor of an alternative hypothesis (H A ). The Type I error (α) is the false rejection of H O, and a Type II error (β) is the false acceptance of H O. N-P accepted.05 as the threshold for a Type I error, and established the.10 level as the willingness to make a Type II error. They suggested that these two specifications belonged in scientific research. Fisher felt both specifications implied a precision defensible in routine production but not in scientific investigation. He worried that estimates based on a singular or small set of experiments would be unstable. For Fisher, accepting a value at.05 only means that the data do not fit a “model” of a null hypothesis with a specified mean and variance. The question for Fisher is “What is the correct model for such a result?” Neyman and Pearson, on slim data, act as if the model was specified before testing began. Calculating an effect size only after a significant NHST could result in an overestimation of the effect size in a given research area. If studies are underpowered (i.e., there is a greater chance that a researcher will miss results), then only effect sizes for studies that improbably achieve significance will be considered and published. Even when appropriately powerful (90% in N- P terms), significance tests exclude 10% of effect size estimates from availability. In many areas of research, significance would appear to truncate the potential family of effect size estimates. If a hallmark of science is accuracy of measurement, then it is precluded in the N-P model. β α power H A Distribution H 0 Distribution Reject H 0 Do Not Reject H 0 Figure 1. Even when a series of studies are appropriately powerful (90%), significance tests exclude 10% of effect size estimates from availability (β). Figure 2. If replication involves achieving a significant difference between groups, then it is difficult to achieve confirmatory results. If replication involves fitting the new value into a confidence interval, it is difficult to fail to replicate. Confidence Intervals ____ control ____ failed replication ____ sig manipulation