Statistical Rules of Thumb

Statistical Rules of Thumb
(a.k.a. crimes against statistics that you should avoid) PART II Ben Meuleman Peers Meeting Swiss Center for Affective Sciences October 26, 2016, Geneva

Contents PART I 1. Quiz I 2. Overview 3. Design and planning 4. Data processing 5. Data analysis PART II 6. Quiz II 7. Data analysis (continued) 8. Results interpretation 9. Results reporting

Overview Last time I discussed issues related to designing an experiment, processing data, and analyzing data. Today I continue where I left off with data analysis. After that, I discuss issues of interpretation and reporting: But first, let’s have some fun! Design Data processing Data analysis Results reporting inter-pretation

6. Quiz II

Q1. Which of the following statistics is not a pure effect size?
Quiz Q1. Which of the following statistics is not a pure effect size? Pearson correlation coefficient F-value Odds ratio Eta-squared

Quiz Q2. When follow-up tests for an ANOVA are planned, I do not need to correct for multiple testing. TRUE FALSE

Quiz Q3. When the overall ANOVA F-test is significant, I do not need to correct the follow-up tests for multiple testing. TRUE FALSE FALSE-ish

Quiz Q4. It is possible to infer causality from observational data (i.e., non-experimental data). TRUE TRUE-ish FALSE

Quiz Q5. Which of the following models is appropriate for analyzing this table of (independent) frequencies? Logistic regression Log-linear analysis Chi-square test All of the above Survived Age Gender No Yes Child Male 35 29 Female 17 28 Adult 1329 338 109 316 Survival on the Titanic (British Board of Trade, 1990)

7. Data analysis (continued)

Recap of regression assumptions
In the first part I talked about the importance of checking the assumptions of your analysis. I cited the following order of importance: Constant (co)variance (structure) of residuals Multicollinearity among the independent variables Independence of the residuals Outliers and influential cases Linearity of associations Normality of residuals When problems are diagnosed, researchers sometimes switch to non-parametric analyses…

Non-parametric tests: When?
Non-parametric tests can be applied for the following problems: Non-parametric tests cannot solve everything! These tests are not robust against unequal variances, serial correlation, or multicollinearity. These tests are "distribution-free", not "assumption-free", as is sometimes claimed. Problem Rank tests Permutation tests Bootstrap tests Non-normal residuals  Small samples Outliers Mild non-linearity

Small sample problems Permutation tests are among the oldest statistical tests in existence. One of the earliest was Fisher’s exact test for contingency tables. Non-parametric tests do not "solve" small samples! They improve the generalizability of your results within your particular sample (by simulating alternative scenarios with the same data). They do not improve generalizability to your population of interest According to legend, Fisher’s exact test was invented for a problem called the Lady Tasting Tea. A friend of Fisher, Muriel Bristol, claimed that she could tell if water or milk had been added to a tea cup first (simply by tasting it). Fisher tested her skill by presenting her with 8 tea cups, four of each type. Allegedly, Muriel got all 8 cups right. Out of all possible scenarios, getting all 8 cups right had a 1 in 80 chance of occurring!

Non-parametric: How? The term non-parametric is often used in two different ways, which are actually not comparable: Non-parametric as not assuming a shape of the distribution of a test statistic (e.g., normally distributed). Non-parametric as not assuming a shape of an association. Lowess smoothers are often called non-parametric models for association. The latter usage is actually misleading. A standard regression slope is quantified by only two parameters, an intercept and a slope. A lowess smoother, on the other hand, will estimate a new slope at every point along the data cloud. All these slopes could be said to be parameters of the model, so very parametric!

No free lunch in statistics!
In 2011, Science published a paper by Reshef et al. (2011) that introduced the maximal information coefficient (MIC). The authors claimed that MIC was capable of detecting linear, nonlinear, and even nonfunctional relations. An editorial in the same issue of Science touted the MIC measure as a "Pearson correlation for the 21st century". However, in a critical review, Simon and Tibshirani (2011) presented simulations which showed that the MIC statistic performed poorly in a variety of settings which the original authors had not considered. Tibshirani noted: Any test that claims to offer superior power over existing alternatives usually does so by introducing additional assumptions, or by sacrificing generalizability in important settings

I never correct for multiple testing
Multiple testing nightmares I never correct for multiple testing * Two extremes… * When I began research, I projected that I would conduct 743,229 statistical tests throughout my entire career. As such, I have consistently multiplied all my p-values with this number * Artist’s impression of statisticians

Multiple testing nightmares
Standard practice in statistics is to conduct hypothesis tests at the 5% significance level Per contrast error (PCE) rate: the probability for a single comparison to be a false positive. By convention set at 5%. Expected number of errors per experiment (ENEPE): the expected number of false positives in a single experiment for a given PCE. Can be calculated as PCE × C, where C is the number of tests. Family-wise error (FWE) rate: the probability of finding at least one false positive in a family of comparisons (e.g., for a single experiment). For a set of independent tests, this can be calculated as 1−(1−PCE)C. For non-independent tests, the probability will be between this number and PCE. Most statisticians agree that FWE should be controlled at 5% significance rather than PCE. For 10 independent tests, the probability of finding at least one false positive is already For a 100 independent tests, that probability is virtually 1, so the experiment is guaranteed to turn up at least one false positive!

Multiple testing nightmares
Maintaining FWE at 5% means that PCE has to be reduced. Opinions differ about which correction should be applied. A non-exhaustive overview: Bonferroni: Divides the significance level by the number of tests. Assumes independent tests (which is rarely true in practice). Very conservative for large families of comparisons. Step-down methods: Incremental adjustment to Bonferroni. The most significant test in the family (i.e., the lowest p-value) must still survive the strictest Bonferroni adjustment! Fisher’s least significant difference (LSD): Implies that follow-up comparisons should not be corrected after a significant overall/omnibus test. Tukey’s honest significant difference (HSD): Developed specifically for pairwise contrasts following a between-ANOVA. Optimal in this setting. False discovery rate (FDR): Developed for (extremely) large families of comparisons (e.g., > 1000). Optimal for these problems Simulation-based: Simulates the correct adjustment factor for a given family of comparisons. Most likely optimal for all problems but can be difficult to execute in practice. An excellent discussion of different methods (with strengths and weaknesses) can be found in Maxwell & Delaney (2004)!

Fisher’s LSD It is sometimes claimed that individual comparisons following an overall ANOVA do not have to be corrected for multiple testing. This approach is known formally as Fisher’s LSD. No explicit correction factor is applied. Instead, testing proceeds in two stages: (I) an omnibus test for the family of comparisons. If significant, we proceed with (II) individual comparisons. The idea is that the omnibus test will act as an initial filter, making it less likely that false positives show up for the individual comparisons. This is sound intuition but in practice this approach can easily falter…

Fisher’s LSD As an example, consider a multiple regression model with 100 predictors. 99 predictors are random noise and are uncorrelated with the dependent. 1 predictor has a very strong correlation with the dependent. Most likely, the overall F-test for this regression model would turn out significant, driven by the presence of the 1 correlated predictor. At the second stage, however, we are suddenly testing 100 coefficients without correction. Many false positives will likely turn up among the 99 random predictors! In practice, Fisher’s LSD works better in the other direction of reasoning: if an overall test is insignificant, you should not proceed with individual comparisons!

Planned versus unplanned
It is sometimes claimed that when individual comparisons are planned a priori, they do not need to be corrected for multiple testing. This is obviously not true! What is true is that planned comparisons are less likely to turn up false positives, due to them not being guided by post-hoc data exploration. Otherwise a family of planned comparisons should not be treated much differently from any other! Besides, if not correcting planned comparisons was acceptable, one could "plan" dozens of comparisons and get away with no corrections. For a completely exploratory family of comparisons, it is recommended to use very conservative corrections, such as Bonferroni or (better) Scheffé (see Maxwell & Delaney, 2004).

Avoid visual multiple testing

Further example α = 0.05

Further example α = 0.05 435 Bonferroni corrected!

Categorical data analysis
Analysis of categorical data is a very interesting area of statistics that includes the analysis of frequency tables (e.g., chi-square test, log-linear analysis), discriminant analysis, and GLMs such as binary logistic regression (two-class dependent), multinomial logistic regression (multi-class dependent), and poisson regression.* Although I cannot discuss the many issues related to categorical data analysis, I do want to provide you with two general cautions: Exercise care with the interpretation of parameter estimates in GLMs. Typically parameters have to be interpreted on a log- or log-odds scale, which means that effects are multiplicative rather than additive. For this reason GLM coefficients are often exponentiated prior to interpretation. Avoid repeated-measures designs with categorical dependents. Models for repeated categorical responses can be very complicated (e.g., random effects GLM) and (in my experience) are not well understood by most users! * Many of these models turn out to be special cases of each other

Data analysis (continued) recap
Use non-parametric tests for dealing with the right problems Always control the family-wise error rate of tests for an experiment at 5% significance! Do not proceed with testing after an non-significant overall/omnibus test Exercise caution with GLMs for categorical data

8. Results interpretation

P-values Conventionally, a p-value quantifies the probability of observing the estimated value of a statistic (or larger), under the assumption that the null hypothesis is true. These values are obtained directly from the null distribution of the statistic (e.g., a t-distribution). When the p-value drops below a fixed threshold (i.e., the significance level), we reject the null hypothesis. Otherwise we conclude that there is insufficient evidence to reject the null hypothesis.* P-values do not quantify the probability that the null hypothesis is true. Nor do they capture the probability that any given alternative hypothesis is true. Often an infinite number of alternative hypotheses is compatible with a rejection of the null hypothesis. For symmetric test distributions such as the t- and z-distributions, we typically calculate two-sided p-values. These reflect our scientific ignorance on the direction that our effect will have. Rarely there is a reason to report one-sided p-values! Do not switch arbitrarily between both types! * This is not the same as accepting the null hypothesis!

Trends to nowhere P-values in the range of 0.05 to 0.10 are sometimes interpreted as marginally significant, or as reflecting trend effects. These are dubious claims for a number of reasons: The point of having a certain significance level is to maintain an objective treshold for statistical inference. Stretching the level introduces subjective bias to these decisions. The interpretation of the trend’s direction is basically arbitrary. A trend to significance might just as well reflect a trend to insignificance. In fact, if we allow p = 0.06 to be marginally significant we should also allow p = 0.04 to be marginally insignificant! If the null hypothesis is actually true, p-values for tests have a uniform distribution, meaning you are just as likely to observe p = 0.99 as p = This makes insignificant p-values not actually comparable to significant p-values!

Alternatives to p-values
Null-hypothesis significance testing remains the primary method in research for evaluating the significance of an effect. However, there are others! Information criteria: Measures such as AIC and BIC capture the relative increase/decrease in information of one model over another. They can be used, for example, to select an optimal set of predictors in stepwise regression. Bayesian inference: Closely related to information criteria, Bayesian statistics allow a comparison of models through the use of Bayes factors, which quantify the relative evidence for a model considering both the data and prior beliefs on the parameter distribution. Machine learning: In the context of machine learning, several models have been developed that allow the user to make binary decisions on variable relevance (e.g., sparse regression models with automatic variable selection), some of which are related to Bayesian principles. Cross-validation: A general framework that can be used in conjunction with any of the methods above. Cross-validation is usually aimed at evaluating the predictive strength of a model, by testing how well it fits independent data.

Causality and confounding
One of the primary application of statistics is to infer causality. Typically, we will run an experiment with several conditions (reflecting the systematic manipulation of some property), and randomly assign participants to those conditions (i.e., randomized condition assignment; RCA). For example: In this case it would be correct to state that there is a causal effect of RCA on the dependent Y. Stating that there is a causal effect of the manipulation on Y does not necessarily follow, however! RCA Y

Causality and confounding
Consider a manipulation of word concreteness (abstract vs concrete words) for investigating differences in reaction times (RT). However, it is known that word abstracteness is correlated with word length (abstract words tend to be longer). The word concreteness effect might disappear once word length is controlled for! This example illustrates that (a) causal conclusions beyond a simple RCA statement can be complicated and (b) that RCA is primarily aimed at excluding confounding by subject-level characteristics (e.g., age, gender), and not confounding by stimulus-level characteristics! Word length RCA Word concreteness Y

Confounding versus multicollinearity
Both confounding and multicollinearity (to some extent) refer to a correlation among independent variables. In the case of multicollinearity, the correlation is considered to be problematically high (e.g., .95), such that standard errors on the coefficients become severely inflated. In this situation it is almost always advised to remove the multicollinear variables. In the case of confounding, on the other hand, keeping two correlated variables in your model is often a good idea! Doing this will actually unconfound the relationship of these two variables with the dependent. For example: Age Controlling for age allows the researcher to isolate the unique effect of number of health complaints on life satisfaction (LS). Health complaints LS

Confounding Confounding can be a reason to keep independent variables in a model, even when those variables themselves have no significant relationship with the dependent. Some methods, such as stepwise regression, perform a best-subset selection of variables in the model. Typically, these methods keep only variables that have a significant relationship with the dependent. Such a selection could miss out on important confounders, however, and potentially more serious problems such as suppression effects. Sometimes an effect reverses when a controlling variable is added to a model. This situation is known as Simpson’s paradox (see Hernán, Clayton & Keiding, 2011, for an in-depth treatment). The most famous example is the birth weight paradox.

Simpson’s paradox

Simpson’s paradox Negative association between X1 and Y

Simpson’s paradox Now we consider the presence of an additional covariate X2

Simpson’s paradox When controlling for X2, the association between X1 and Y becomes strongly positive!

Causal inference for observational data
It is sometimes assumed that causal inference can only be conducted on experimental data (e.g., as resulting from a randomized experiment). This is actually not true. Many methods and models have been developed for this purpose (e.g., the Rubin causal model). These methods become necessary in situations where running an experiment would be ethically or practically unfeasible (e.g., smoking and health). One of the most widely used methods for causal inference in social sciences is structural equation modelling (SEM), which includes its special cases of path analysis and confirmatory factor analysis. In SEM, relations between variables are spelled out completely by means of structural equations or, graphically, by a directed acyclical graph (DAG).

Structural equation modelling
(Silveira et al., 2004)

Causal inference for observational data
Methods for causal inference in observational data (e.g., SEM, mediation analysis) typically share the same central assumption. In order for the causal conclusion to be valid, the model must include all relevant confounders and mediators for the effect in question. This assumption makes causal inference for observational data slightly absurd in practice. The best we can hope for is to make an educated guess within the limitations of the data we collected! Researchers often take too much comfort on their diagram. Encoding relations between variables in such an explicit manner makes for extremely strong assumptions. Mutiple alternative models could be compatible with the same data! SEM is one of the least exploratory methods in statistics! You should always apply this in a confirmatory manner, preferably by validating your SEM on external/independent data.

Interactions As a general rule, you should not interpret/report the main effect of a variable when that variable is also involved in a (significant) interaction in your model! Interactions by definition contradict the notion of main effects! A main effect states that the effect of a variable remains the same across levels of other variables. An interaction states the reverse! For this reason, when reporting results of an analysis with interactions, you should always start with reporting on the highest-order interactions, and move your way down from there (see Maxwell & Delaney, 2004, for a structured approach). However, there are also statistical reasons for not interpreting "main effects" in interaction models. For models that use dummy coding (0/1) for categorical predictors and type III ANOVA for effects testing (=the default in most standard software packages), main effects parameters do not exist!

Interactions We consider a regression of Y on X with a grouping factor G (dummy coded, A=0, B=1)

Interactions A main effects model, a.k.a. parallel slopes E(Y) = β0 + β1X + β2G

Interactions Effect of G remains consistent across levels of X A main effects model, a.k.a. parallel slopes E(Y) = β0 + β1X + β2G

Interactions An interaction model, a.k.a. intersecting slopes E(Y) = β0 + β1X + β2G + β3(X×G)

Interactions Main effect of G disappears at the intersection! An interaction model, a.k.a. intersecting slopes E(Y) = β0 + β1X + β2G + β3(X×G)

Interactions E(Y) = β0 + β1X + β2G + β3(X×G)
These parameters no longer reflect main effects! E(Y) = β0 + β1X + β2G + β3(X×G) For type III ANOVA and dummy-coded factors, the so-called "main effects" parameters actually reflect conditional contrasts. In the above model, β1 encodes the slope of X when G=0, and β2 encodes the effect of G when X=0.

Interactions However, there are two nuances that must be added to this: Main effects can be tested in models that use orthogonal contrasts for factors (e.g., -1/1 coding) and/or type II ANOVA. The latter is a default in many software for within-subjects ANOVA without between variables. Whether testing these main effects is meaningful in the presence of a significant interaction is another matter, however (see earlier)! Sometimes main effects can still be interpreted in the presence of interactions, when the point of intersection (i.e., where the effect reverses) occurs outside of the observed or even the observable data space. This occurs often in factorial ANOVA, where the data range is restricted to a handful of levels.

Interactions What if we had observed these data…?

Interactions If the point of "interaction" occurs outside of the observed data space but inside the observable data space, I would still be very cautious of interpreting main effects (for all the reasons stated earlier). If the point of "interaction" occurs outside of the observable data space (e.g., at 0 IQ score, or beyond categorical levels of a factor), then it may be acceptable to interpret a main effect on top of an interaction. The latter requires visual inspection of the interaction pattern, however, which may become very complicated for higher-order interactions. All of the aforementioned pitfalls are especially dangerous in mixed ANOVA (i.e., models that include within- and between-variables simultaneously). SPSS twists the knife deeper by its inconsistent handling of categorical versus continuous covariates. Inform yourself well on these analyses!

Types of interactions

Types of interactions Interaction between two continuous variables (reinforcement effect).

Retrospective power Sometimes reviewers/journals will request a power analysis for a submitted study—especially when an effect is insignificant/weak—to rule out the possibility that the study was underpowered. Do not conduct a retrospective power analysis on your own data! This reasoning is circular. If your effect is not significant, a retrospective power analysis will almost always conclude that the study was underpowered (and vice-versa). The problem with retrospective power analysis is plugging your observed effect size into the power calculation. A correct power analysis should use the effect size that was projected when the study was planned! Speaking of planning…

3. Design and planning

Planning power Many software are available that will help you calculate power and/or sample size, including online applications. A popular tool is G*Power, which covers a large variety of different tests (e.g., t-test, chi-square test) and research designs (e.g., ANOVA, ANCOVA, repeated measures). One reason why psychology researchers often have a hard time calculating power and/or sample size is due to the requirement of knowing in advance the effect size. Several strategies could be employed to determine a suitable effect size: Base your effect size on past literature (if available) Run a simulation for the given design/analysis for a range of plausible effect sizes Pick a very conservative effect size (likely necessitates large samples!) Run a pilot to get an initial estimate of effect size

Sample size Typically setting the sample size will take priority over setting the power level during the planning of a study. Just as the significance level is often fixed at 5%, the power level is often fixed at 80%. There are three main reasons for having a reasonably large sample size: Power: for detecting the effect of interest at a certain significance level Reliability: to obtain reliable/precise parameter estimates (e.g., averages, coefficients) Generalizability: to be able to generalize the results to the population of interest For a multiple regression model, there exists a rule of thumb for a minimal sample size to ensure reliable estimation: Nmin = × Q With Q the number of predictors in the model

Power surplus Big data have presented unexpected new challenges to classical hypothesis testing. Sometimes the sample size is so large that all tests turn out to be significant! While obviously a luxury problem, there are two ways for dealing with this problem: Subsample your data at a desired power level and run your intended analysis as before. For large datasets, you could even repeat the subsampling process (in a meta-analytic fashion) or do a full cross-validation. Switch to models of machine learning, many of which were designed specifically to handle big data.

8. Results interpretation

Results interpretation recap
Avoid reporting marginally significant results or trend effects Exercise caution with causal models for observational data such as SEM! Do not interpret main effects in the presence of significant interactions! Set the power/sample size of your study in advance

9. Results reporting

General advice Always provide complete information on the data analysis in your paper. In the method section, devote a paragraph to your analysis, specifying (a) the main analysis, (b) follow-up tests, and (c) corrections for multiple testing. Be honest about any failings in the data/study, such as missing values, problematic outliers, unusable subjects, etc. If you use a novel/non-standard method of analysis, motivate it. Normally reviewers will not complain if you motivate it well. Non-standard analyses can be a plus for a paper! Mention the software that you used for the data analysis, including the version number. Different software have different defaults for analyses, so not trivial!

Effect sizes Almost all journals now insist on the reporting of effect sizes, in addition to the usual inferential statistics. Standard inferential statistics (e.g., t-values, F-values, Chi-square values) are not pure effect sizes. These statistics reflect both effect size and sample size! For small values of these statistics (e.g., F < 1), effect sizes tend to be small too. The reverse is generally not true however! A good effect size should satisfy the comparability criterion. It should not depend on the particular study design that was used. In practice this requirement often turns out to be complicated (e.g., within- versus between-designs, manipulated versus observed variables). Once again, an excellent reference on suitable effect sizes for different analysis is Maxwell and Delaney (2004).

Types of effect sizes Several types of effect sizes exist:
Correlational: Pearson correlation, Spearman correlation, Kendall’s tau, standardized regression coefficients Proportion-of-variance-explained: R2, adjusted R2, η2, ε2, ω2, and their partial and/or generalized versions (see Olejnik & Algina, 2000; 2003). Contingencies: Odds, odds ration, relative risk, Signal detection: Precision, recall, sensitivity (d’), specificity, F1 measure, AUC Difference-of-means: Cohen’s d, Hedges’ g, Glass’ Δ Other: Confidence intervals, Bayes factors

Rules of thumb Pick an effect size that is appropriate for quantifying your effect and that you know how to interpret. Rules of thumb exist for interpreting selected effect sizes (Cohen, 1988): However, these rules should be treated with caution. They dependent strongly on the research design (e.g., experimental versus observational) and the constructs that are being measured. For example, in physics, where high-precision instrumentation and measurement is the norm, a correlation of r=0.50 would be considered quite small! Pearson correlation Cohen’s d Small 0.10 0.20 Medium 0.30 0.50 Large 0.80 Grains of salt

Proportion-of-variance-explained
Measures for the proportion-of-variance-explained are frequently a source of confusion. Many researchers (including me) often struggle to interpret (partial) eta-squared correctly. The picture is further complicated by an intrinsic weakness in unadjusted proportion-of-variance-explained measures (e.g., R2): they tend to increase even when junk variables are added to the model! For this reason, adjusted, partial, and generalized versions of these measures have been developed (see papers by Olejnik and Algina). Typically, these measures penalize a model for increasing the number of parameters. This approach is similar to information criteria such as AIC and BIC.

Deviations Researchers do not always understand the difference between standard deviations (SD) and standard errors (SE), and which to report in a paper. A standard deviation (SD) generally refers to a measure of spread around an average. Conventionally, this term is reserved for denoting the spread of some sample data around its average. A standard error (SE) is the standard deviation of an estimated statistic (e.g., an average, a variance, or a regression coefficient). SEs are theoretical quantities, derived from distributional assumptions on the statistic in question. SEs are directly used for hypothesis testing. Almost all test statistics are constructed by dividing the estimated statistic by its standard error.

Types of deviations Sample data Sample SD Sample 95% CI Mean SE Mean
Sample average

Error bars In a paper, it is rarely of interest to report sample SDs. Normally we are interested in making inferences about typical effects (e.g., an average response). Therefore we are also interested in the variation of that typical response, which is quantified by an SE Adding error bars to graphs would be recommended but in practice this turns out to be quite complicated. It is widely misunderstood that overlapping error bars between groups indicate non-significant group differences. This is not true (see Belia et al., 2005)!* For within-designs, error bars should be adjusted for repeated-measures correlation. However, even for between-designs, overlapping error bars do not necessarily indicate non-significant group differences! * Non-overlapping error bars do reflect significant group differences!

Error bars Error bars and confidence intervals are primarily interesting for quantifying the uncertainty/reliability of an estimate within a particular condition. As a tool for between-condition comparison, its utility may be limited. Error bands can provide an informative visual around regression slopes or time series plots. I msyelf never add error bars to effect plots. Instead I favor model-based effect plots. Such plots—when the underlying model has been stripped of all redundant parameters—can provide very clear information on significant differences between conditions (free of the noise that comes with data-based effect plots).

Results reporting recap
Provide honest and detailed information and the data analysis process for your study Select your effect size appropriately For quantifying uncertainty in estimates, reporting of SEs can be informative

Thank you for your attention!

Statistical Rules of Thumb

Similar presentations

Presentation on theme: "Statistical Rules of Thumb"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical Rules of Thumb

Similar presentations

Presentation on theme: "Statistical Rules of Thumb"— Presentation transcript:

Similar presentations

About project

Feedback