The t Test for a Single Group Mean (Part 3): Effect Size

Slides:

Advertisements

Similar presentations

Chapter 11: The t Test for Two Related Samples

Advertisements

Psych 5500/6500 t Test for Two Independent Groups: Power Fall, 2008.

Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.

Chapter 12: Testing hypotheses about single means (z and t) Example: Suppose you have the hypothesis that UW undergrads have higher than the average IQ.

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 21 More About Tests and Intervals.

Sampling Distributions

Statistical Issues in Research Planning and Evaluation

COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Instructor: Dr. John J. Kerbs, Associate Professor Joint Ph.D. in Social Work and Sociology.

RIMI Workshop: Power Analysis Ronald D. Yockey

Theoretical Probability Distributions We have talked about the idea of frequency distributions as a way to see what is happening with our data. We have.

1 Psych 5500/6500 Measures of Central Tendency Fall, 2008.

1 Psych 5500/6500 The t Test for a Single Group Mean (Part 5): Outliers Fall, 2008.

1 Psych 5500/6500 The t Test for a Single Group Mean (Part 2): p Values One-Tail Tests Assumptions Fall, 2008.

Using Statistics in Research Psych 231: Research Methods in Psychology.

Intro to Parametric Statistics, Assumptions & Degrees of Freedom Some terms we will need Normal Distributions Degrees of freedom Z-values of individual.

Chapter 9: Introduction to the t statistic

1 Psych 5500/6500 Statistics and Parameters Fall, 2008.

Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.

AM Recitation 2/10/11.

Statistics 11 Hypothesis Testing Discover the relationships that exist between events/things Accomplished by: Asking questions Getting answers In accord.

© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 9. Hypothesis Testing I: The Six Steps of Statistical Inference.

Hypothesis Testing II The Two-Sample Case.

Statistical Techniques I

1 GE5 Lecture 6 rules of engagement no computer or no power → no lesson no SPSS → no lesson no homework done → no lesson.

Testing Hypotheses Tuesday, October 28. Objectives: Understand the logic of hypothesis testing and following related concepts Sidedness of a test (left-,

1 Psych 5500/6500 Chi-Square (Part Two) Test for Association Fall, 2008.

1 Today Null and alternative hypotheses 1- and 2-tailed tests Regions of rejection Sampling distributions The Central Limit Theorem Standard errors z-tests.

Testing Theories: Three Reasons Why Data Might not Match the Theory Psych 437.

RMTD 404 Lecture 8. 2 Power Recall what you learned about statistical errors in Chapter 4: Type I Error: Finding a difference when there is no true difference.

Chapter 8 Introduction to Hypothesis Testing

Concept of Power ture=player_detailpage&v=7yeA7a0u S3A.

Chapter 12: Introduction to Analysis of Variance

Psy B07 Chapter 4Slide 1 SAMPLING DISTRIBUTIONS AND HYPOTHESIS TESTING.

1 Psych 5500/6500 t Test for Two Independent Means Fall, 2008.

Essential Statistics Chapter 131 Introduction to Inference.

1 Psych 5500/6500 The t Test for a Single Group Mean (Part 4): Power Fall, 2008.

1 Psych 5500/6500 Standard Deviations, Standard Scores, and Areas Under the Normal Curve Fall, 2008.

Hypotheses tests for means

Chapter 10: Analyzing Experimental Data Inferential statistics are used to determine whether the independent variable had an effect on the dependent variance.

Correlation Analysis. Correlation Analysis: Introduction Management questions frequently revolve around the study of relationships between two or more.

1 Psych 5500/6500 The t Test for a Single Group Mean (Part 1): Two-tail Tests & Confidence Intervals Fall, 2008.

1 Psych 5500/6500 t Test for Dependent Groups (aka ‘Paired Samples’ Design) Fall, 2008.

Statistical Power The power of a test is the probability of detecting a difference or relationship if such a difference or relationship really exists.

Chapter 7 Sampling Distributions Statistics for Business (Env) 1.

1 Psych 5500/6500 Introduction to the F Statistic (Segue to ANOVA) Fall, 2008.

Review Hints for Final. Descriptive Statistics: Describing a data set.

Chapter 21: More About Test & Intervals

Chapter 10 For Explaining Psychological Statistics, 4th ed. by B. Cohen 1 A perfect correlation implies the ability to predict one score from another perfectly.

Lab 9: Two Group Comparisons. Today’s Activities - Evaluating and interpreting differences across groups – Effect sizes Gender differences examples Class.

Hypothesis Testing An understanding of the method of hypothesis testing is essential for understanding how both the natural and social sciences advance.

Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc.

1 Psych 5500/6500 Measures of Variability Fall, 2008.

Stats Lunch: Day 3 The Basis of Hypothesis Testing w/ Parametric Statistics.

1 Psych 5510/6510 Chapter 13 ANCOVA: Models with Continuous and Categorical Predictors Part 2: Controlling for Confounding Variables Spring, 2009.

Hypothesis Testing Introduction to Statistics Chapter 8 Feb 24-26, 2009 Classes #12-13.

Statistics (cont.) Psych 231: Research Methods in Psychology.

SPSS Problem and slides Is this quarter fair? How could you determine this? You assume that flipping the coin a large number of times would result in.

SPSS Homework Practice The Neuroticism Measure = S = 6.24 n = 54 How many people likely have a neuroticism score between 29 and 34?

Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.

Statistical principles: the normal distribution and methods of testing Or, “Explaining the arrangement of things”

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 21 More About Tests and Intervals.

Dependent-Samples t-Test

INF397C Introduction to Research in Information Studies Spring, Day 12

Chapter 21 More About Tests.

INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Test Review: Ch. 7-9

Chapter 12 Power Analysis.

Psych 231: Research Methods in Psychology

Psych 231: Research Methods in Psychology

Presentation transcript:

The t Test for a Single Group Mean (Part 3): Effect Size Psych 5500/6500 The t Test for a Single Group Mean (Part 3): Effect Size Fall, 2008

Effect Size In the t test for a single group the ‘effect size’ is the difference between the actual value of the population mean (the population from whence the sample was drawn) and the value of the population mean proposed by H0. For example, math scores in some class have traditionally had a mean of 50, a new teaching program is tested to see if it changes the math scores. H0 can be written as μH0=50 (no effect due to new program). But say that the new program actually leads to an improvement in math scores such that now μY=55. The effect of the new program was to raise the scores by 5: μY - μH0= 5

p Values Until fairly recently the report of the statistical analysis of experiments in psychology focused primarily on the ‘p values’ that were obtained. In the example of the math scores and the effect of the new teaching program the results of the analysis would be presented as t(23)=2.45, p=0.02, from this it can be concluded that the effect of the new teaching program was statistically significant (H0 is rejected). With p values the focus is on whether or not the effect is statistically significant; with a p value of .051 (don’t reject H0) being fundamentally different than a p value of .049 (reject H0).

p Values and Effect Size p values simply tell us whether the effect was statistically significant (i.e. unlikely to have occurred due to chance alone). p values are a poor indication of the size of the effect in an experiment as the value of p is influenced by a variety of things, including the effect size, the size of the sample, and the variance of the populations. The trend in psychology is to report both the size of the effect and it’s p value.

“Until quite recently in the history of experimental psychology, when researchers spoke of ‘the results of a study,’ they almost invariably were referring to whether they had been able to ‘reject the null hypothesis,’ that is, to whether the p values of their tests of significance were .05 or less. Spurred on by a spirited debate over the failings and limitations of the ‘accept/reject’ rhetoric of the paradigm of null hypothesis testing, the American Psychological Association (APA) created a task force to review journal practices and to propose guidelines for the reporting of statistical results. Among the ensuing recommendations were that effect sizes...be reported.” Rosnow & Rosenthal, 2003. p. 221)

“It is no longer considered sufficient to ask of an effect or relationship: ‘Is it there?’ It is increasingly considered essential to also ask ‘How much is there?’ and sometimes even ‘Is it enough to care?’ McGrath & Meyer, 2006, p. 386.

Measures of Effect Size There are many ways of measuring and reporting effect size, and various authors provide various ways of clumping these approaches into categories. We will consider three categories: Simply reporting ‘raw’ effect size. Standardized effect size Strength of Association

1) Simply Reporting ‘Raw’ Effect Size If the measures are easily comprehendible then you can simply state the effect size. In our math example you can report that the expected value of the μ of math scores given H0 was 50 while the estimate of μ given your sample was 55, or you could simply state that the mean math scores in the sample was 5 greater than what was predicted by H0. Belying the concept that if it isn’t complicated it can’t be good, this is actually the approach favored by the APA.

2) Standardized Effect Size If the measure is something that is hard to grasp (i.e. inverse reaction times, where an effect of ‘0.2’ would be hard to intuitively understand) or if you want to do a meta-analysis (comparing results across several studies) then a standardized effect size may be more useful. In a standardized effect size you are turning the effect size into something that is similar to a standard score.

Cohen’s d (population) This formula is for computing the actual effect size as it occurs in the population from which we sampled. The difference between the mean proposed by H0 and the actual mean of the population is divided by the standard deviation of the population from which the sample was drawn.

Example Say that in our example the actual mean score of the population of students taught using the new method was 56 (slightly higher than the sample mean we happened to get) with a population standard deviation of 7.3. The formula turns the difference of 6 between H0 and the actual population mean into a standard score of 0.82. Is that a big effect or a small effect? We will cover that in a minute.

Cohen’s d (sample) This formula is for computing the effect size as it occurs in the sample. The difference between the mean proposed by H0 and the mean of the sample is divided by the standard deviation of the sample.

Cohen’s d (sample) Alternative Formula This is an easy way to compute d if you have the tobtained value and df from the t test for a single group mean. It has the disadvantage of not making it clear what d is actually measuring (i.e. the standardized effect size).

Hedges’s g (estimate of the effect size in population) This formula uses the data from the sample to estimate the effect size in the population.

Hedges’ g and Cohen’s d For large samples the difference between est.σY and SY (the estimate of the population std dev and the std dev of the sample) will be quite small, and thus the values of Hedges’s g and Cohen’s d will be quite close.

Interpreting ‘d’ Cohen proposed a simple way to evaluate the size of an effect based upon the value of ‘d’ (and as there is only a small difference between ‘d’ and ‘g’ it could apply to Hedges’s g as well). Note: take the absolute value of the d, whether it is negative or positive is irrelevant to the strength of the effect. |d|= .2 a ‘small’ effect |d|= .5 a ‘medium’ effect |d|= .8 a ‘large’ effect

Interpreting ‘d’ (cont.) Where did this come from? According to Cohen an effect size of d=.5 (a ‘medium’ effect size) is usually noticeable to someone looking at graphs of the data. Subsequent surveys of the literature have found that the average size of effects reported in various fields is approximately equal to a d of .5. A small effect (.2) is smaller than that but still not too trivial, and a large effect (.8) is the same distance above a medium effect as a small effect is below it.

Interpreting ‘d’ (cont.) Cohen offered these criteria with some misgivings. His goal was to make the value of ‘d’ more meaningful but he was worried that people would take them too seriously (he was right). These criteria are fairly arbitrary and are based upon what might be considered the size of the effect view purely through the lens of statistics. A ‘small’ effect might still be of great theoretical interest, a ‘small’ effect in the field of medicine might lead to saving 10’s of thousands of lives (giving it great social or pragmatical interest). A ‘large’ effect might be of little theoretical or practical significance.

Interpreting ‘d’ (cont.) The real value of Cohen’s effect size values (small, medium, and large) will be seen when we discuss ‘power’. When computing the possible power of an experiment that you are designing, you need to guess what the effect size will be. Cohen’s criteria provide one way to help you guess. If you anticipate that the effect you will be looking at will be small, then plug in a value of .2 for d, etc. We will take a look at this later.

More on Standardized Effect Sizes 1) These formulas are for the context of testing a single mean versus what is predicted by H0, different forms of the formula are necessary for other experimental designs (we will cover these later).

More on Standardized Effect Sizes 2) Beware that there is a bewildering lack of consistency in the literature on how to compute Cohen’s d. Often the formula for finding the effect size in the population will be given, followed by an example where the mean and standard deviation of the sample are plugged into the formula (under the assumption, I assume, that by not generalizing to a larger population we are treating the sample as our population of interest). One of the reasons I like the way I use the symbols in this class is the way in which it makes it easy to discriminate exactly what is being accomplished by the various formulas.

More on Standardized Effect Sizes 3) What does SPSS provide? None of these. SPSS will provide the mean of Y and the ‘standard deviation’ of Y (which is actually the est. σY), making it a simple process to calculate Hedges’s g. If you want to calculate Cohen’s d (for the sample) you can either translate est. σY into S using the formula given earlier this semester (provided again below) or you can use the formula for computing ‘d’ from ‘t’ (as SPSS will give you both tobt and df).

More on Standardized Effect Sizes Advantages of using standardized effect sizes: If the effect size involves some metric that is hard to conceptualize (i.e. an effect size of -0.2 in a measure of inverse reaction times) then turning it into a standard score will help. Cohen’s criteria for what constitutes a small, medium, and large effect size can give the standardized effect size some level of meaning.

More on Standardized Effect Sizes Advantages of using standardized effect sizes: Standardizing the effect size makes it easier to do meta-analysis (where you compare the effect size of several different studies) particularly when the studies are examining the same topic but with different measures. By translating the effect sizes found in all of the studies into standardized effect sizes you turn them into essentially the same metric so that they can be directly compared.

Example Say one study used inches to measure the variable ‘length’ and found an effect size of 24 inches. Say another study measured exactly the same subjects but used feet to measure length and found an effect size of 2 feet. The standard deviation of scores in the first study was 15, that would make the standard deviation of the second study be 1.25 (i.e. 15/12....trust me). While the first study had an effect size of 24 (inches) and the second study had an effect size of 2 (feet) we can see by computing the d’s that they found the same effect.

Example (cont.) While it is obvious in the example that the effect size should be the same, let’s apply the idea to a more realistic scenario. In this scenario ‘Study A’ measures intelligence using one IQ test and finds an effect size of 4. ‘Study B’ measures intelligence using a different IQ test and finds an effect size of 6. The two IQ tests have different means and different variances and it is hard to know how the two effect sizes really compare, but if we change each effect size to standardized differences we can compare then directly.

More on Standardized Effect Sizes 5) Disadvantages of using standardized effect sizes: The problem with standard scores is that they take you away from the units of measure that you used in the study. It might be more useful to know that the fertilizer increased growth rate by 24 inches a year than to know that d=0.3.

More on Standardized Effect Sizes 5) Disadvantages of using standardized effect sizes: Standardized effect sizes bring the standard deviation of the scores into the expression of effect size, which in some cases can hide the pure understanding of the effect. Say that the math teaching method raised the scores of students on the average by 5, and these students were a pretty varied lot (differed a lot in terms of math ability). Now say that in another study the teaching method raised scores again by 5, but in a class where the students were similar in math ability. Even though the teaching method had the same effect in both classes the values of d would differ in the two studies (as the denominator would differ in the d formulas).

More on Standardized Effect Sizes Which to use ‘g’ or ‘d’? ‘d’ gets more press, ‘g’ seems to be of more interest (to me), you can use either, and with any kind of large N they will be very close in value. If you want to compare your study to other similar studies see which one most of them use so you can more easily compare.

3) Strength of Association This category of effect size measures is also called ‘correlation’ or ‘amount of variance accounted for’. Everything we do next semester will automatically crank these out and in that context they will be quite understandable. In the context of what we are doing this semester (ANOVA) standardized measures (such as ‘d’) are often used, consequently we will hold off discussion of ‘Strength of Association’ measures of effect size until next semester.