# Effect Sizes and Power Review

## Presentation on theme: "Effect Sizes and Power Review"— Presentation transcript:

Effect Sizes and Power Review

Statistical Power Statistical power refers to the probability of finding a particular sized effect Specifically, it is 1- type II error rate Probability of rejecting the null hypothesis if it is false It is a function of type I error rate, sample size, and effect size Its utility lies in helping us determine the sample size needed to find an effect size of a certain magnitude

Two kinds of power analysis
A priori Used when planning your study What sample size is needed to obtain a certain level of power? Post hoc Used when evaluating study What chance did you have of significant results? Not really useful If you do the power analysis and conduct your analysis accordingly then you did what you could. To say after, “I would have found a difference but didn’t have enough power” isn’t going to impress anyone.

A priori power Can use the relationship of n, d, and d (the noncentrality parameter, i.e. what the sampling distribution is centered on if H0 is false) plus our specified  to calculate how many subjects we need to run Decide on your a level Decide an acceptable level of power/type II error rate Figure out the effect size you are looking for Calculate n

A priori Effect Size? Figure out an effect size before I run my experiment? Several ways to do this: Base it on substantive knowledge What you know about the situation and scale of measurement Base it on previous research Use conventions

An acceptable level of power?
Why not set power at .99? Practicalities Howell shows how for a 1 sample t test, and an effect size d of 0.33: Power = .80, then n = 72 Power = .95, then n = 119 Power = .99, then n = 162 Cost of increasing power (usually done through increasing n) can be high

Howell’s general rule Look for big effects or Use big samples
You may now start to understand how little power many of the studies in psych have considering they are often looking for small effects Many seem to think that if they use the central limit theorem rule of thumb (n=30), which doesn’t even hold that often, that power is solved too This is clearly not the case

Post hoc power: the power of the actual study
If you fail to reject the null hypothesis might want to know what chance you had of finding a significant result – defending the failure As many point out this is a little dubious One thing we can understand regarding the power of a particular study at hand is that it can be affected by a number of issues such as Reliability of measurement An increase in reliability can actually result in power increasing or decreasing as we will see later, though here I stress the decrease due to unreliable measures Outliers Skewness Unequal N for group comparisons The analysis chosen

Something to consider Doing a sample size calculation is nice in that it gives a sense of what to shoot for, but rarely if ever do the data or circumstances bare out such that it provides a perfect estimate for our needs Mike’s sample size calculation for all studies: The sample size needed is the largest N you can obtain based on practical considerations (e.g. time, money) Also, even the useful form of power analysis (for sample size calculation) involves statistical significance as its focus While it gives you something to shoot for, our real interest regards the effect size itself and how comfortable we are with its estimation Emphasizing effect size over statistical significance in a sense de-emphasizes the power problem

Always a relationship Commonly define the null hypothesis as ‘no difference’ or ‘no relationship’ There is always a non-zero relationship (to some decimal place) seen in sample data As such obtaining statistical significance can be seen as just a matter of sample size Furthermore, the importance and magnitude of an effect are not reflected (because of the role of sample size in probability value attained)

What should we be doing? Want to make sure we have looked hard enough for the difference – power analysis Figure out how big the thing we are looking for is – effect size

Effect Size There are different ways to speak about the relationship between variables, but in general effect size refers to the practical, rather than statistical, significance This is what we are really interested in No one cares about the statistical particulars if the effect is real and will change the way we think about things and how we act However, the effect size, like our other measures, varies from sample to sample I.e. if we did a study 5 times, we would get 5 different effect sizes So while we are primarily interested in effect size, we will need to be cautious in our interpretation there too, and use other available evidence also to come to our final conclusions

Calculating effect size
Different statistical tests have different effect sizes developed for them However, the general principle is the same Effect size refers to the magnitude of the impact of the independent variable (factor) on the outcome variable

d family: Focused on standardized mean differences Allows comparison across samples and variables with differing variance Equivalent to z scores Note sometimes no need to standardize (units of the scale have inherent meaning) r family: Variance-accounted-for Amount of variance explained versus the total d family and r family

Example: Cohen’s d – Differences Between Means
Used with independent samples t test Cohen initially suggested could use either sample standard deviation, since they should both be equal in the population according to our assumptions. In practice people now use the pooled variance. Variations of this are for control group settings, dependent samples, more than two groups… but the notion of standardized mean difference is the same

Cohen’s d – Differences Between Means
Relationship to t Relationship to rpb P and q are the proportions of the total each group makes up. If equal groups p=.5, q=.5.

Characterizing effect size
Cohen emphasized that the interpretation of effects requires the researcher to consider things narrowly in terms of the specific area of inquiry Evaluation of effect sizes inherently requires a personal value judgment regarding the practical or clinical importance of the effects Even though rules of thumb exist, use only as a last resort and be wary of “mindlessly invoking” these criteria

Association A measure of association describes the amount of the covariation between the independent and dependent variables It is expressed in an unsquared metric or a squared metric—the former is usually a correlation, the latter a variance-accounted-for effect size We can apply the measure to continuous data(r and R2), categorical predictors with continuous DV (eta2), and strictly categorical settings (e.g. phi) Again the notion is the same, a measure of linear association which, if squared, provides a measure of variance in the DV can be accounted for by the predictor

Case-level effect sizes for group differences
Indexes such as Cohen’s d and eta2 estimate effect size at the group or variable level only However, it is often of interest to estimate differences at the case level Case-level indexes of group distinctiveness are proportions of scores from one group versus another that fall above or below a reference point Examples Cohen’s Us, common language effect size, tail ratios Reference points can be relative (e.g., a certain number of standard deviations above or below the mean in the combined frequency distribution) or more absolute (e.g., the cutting score on an admissions test) Note that all three effect size types applicable to the group difference setting are transferable to the other, it is just a matter of preference as to which one we use for communication

Confidence Intervals for Effect Size
Effect size statistics such as Cohen’s d and η2 have complex distributions General form is the same as any CI

Confidence Intervals for Effect Size
Traditional methods of interval estimation rely on approximate standard errors assuming large sample sizes We need a computer program to help us find the correct noncentrality parameters to use in calculating exact confidence intervals for effect sizes Both standalone programs (Steiger) and statistical packages (R) can do this for us, and thus provide a measure of effect while noting the uncertainty with that estimate

Limitations of effect size measures
Variability across samples No more a limitation than other statistics, but one needs to be fully aware of this Just because you found a moderate effect doesn’t mean that there is one Standardized mean differences: Heterogeneity of within-conditions variances across studies can limit their usefulness—the unstandardized contrast may be better in this case Measures of association: Correlations can be affected by sample variances and whether the samples are independent or not, the design is balanced or not, or the factors are fixed or not Also affected by artifacts such as missing observations, range restriction, categorization of continuous variables, and measurement error (see Hunter & Schmidt, 1994, for various corrections) Variance-accounted-for indexes can make some effects look smaller than they really are in terms of their substantive significance

Limitations of effect size measures
How to fool yourself with effect size estimation: 1. Measure effect size only at the group level 2. Apply generic definitions of effect size magnitude without first looking to the literature in your area 3. Believe that an effect size judged as “large” according to generic definitions must be an important result and that a “small” effect is unimportant 4. Ignore the question of how theoretical or practical significance should be gauged in your research area 5. Estimate effect size only for statistically significant results

Limitations of effect size measures
6. Believe that finding large effects somehow lessens the need for replication 7. Forget that effect sizes are subject to sampling error 8. Forget that effect sizes for fixed factors is specific to the particular levels selected for study 9. Forget that standardized effect sizes encapsulate other quantities such as the unstandardized effect size, error variance, and experimental design 10. As a journal editor or reviewer, substitute effect size magnitude for statistical significance as a criterion for whether a work is published

Recommendations Report effect sizes along with statistical significance Report confidence intervals Use graphics Use common sense combined with theoretical considerations Do not rely on any one result to support your conclusions