# Carles Falcon & Suz Prejawa

## Presentation on theme: "Carles Falcon & Suz Prejawa"— Presentation transcript:

Carles Falcon & Suz Prejawa
t-tests, ANOVAs & Regression and their application to the statistical analysis of neuroimaging Carles Falcon & Suz Prejawa

OVERVIEW Basics, populations and samples T-tests ANOVA Beware!
Summary Part 1 Part 2

Basics Hypotheses Descriptive vs inferential statistics
H0 = Null-hypothesis H1 = experimental/ research hypothesis Descriptive vs inferential statistics (Gaussian) distributions p-value & alpha-level (probability and significance) Experimental hypothesis is essentially a prediction that you have about a set of data/ an event/ a group/ a topic and this prediction must be falsifiable. Usually, an experiment is aimed at disproving that null-hypothesis is true. Null-hypothesis is the opposite of the experimental condition; usually you expect a difference/ an effect in your experiment, thus the null-hypothesis claims that this effect does not exist. With regards to fMRI, this would relate to activation being expected and observed, or, if Null-hypothesis is true, then not. Example: activation in the left occipitotemporal regions , esp the visual word form area (VWFA), is greatest for written words (visual word forms). See Cohen and Dehaene (2004) NeuroImage Descriptive stats: Allow to summarise data (which is huge amounts of numbers essentially) Allows one to grasp the essential features of data (quickly and easily) Often in image form Mean, median, mode, SDs, histograms, etc Inferential* stats: Goes beyond the pure data Informs about likelihood of the findings, ie about the probability that findings would turn out they way they actually have, ie if effects are genuine and due to experimental manipulation ot occur simply by chance Inferential stats is possible because research data is rarely random (ie there usually is a similar pattern in distribution) 2 types: distribution tests (t-tests and ANOVAs) and correlation tests * inference: the act or process of deriving a conclusion based on what one already knows VWFA example: activation in the VWFA is present when - reading is compared to a rest condition or false fonts (unknown script), OR - when pictures of objects are named (eg, tiger) relative to resting condition, - when picture naming is compared to reading aloud those exact object labels (eg, naming the picture of a tiger versus reading the word “tiger”), - when colours are named, - when an action associated with an object (shown as a picture) is carried out (eg, moving fingers quickly along an imaginary board to illustrate touch typing when presented with a picture of a touch typing machine), - when “reading” Braille with abstract meaning, - when seeing novel objects (previously unknown and thus without any kind of word label attached to it) BOLD signal intensity can be measured for these conditions and the values can be listed (the number crunching = descriptive stats) But the question really is: if there is a numerical difference in these values, is this difference meaningful? Inferential stats can tell you! Probability Relates to the probability of the null-hypothesis being true Expressed as a positive decimal fraction, eg (1/10) .05 (5/100) .01 (1/100) Probability can never be higher than 1 because probability of 1 means that something happened every single time Expressed as p-value (a simple numerical number) Significance (alpha-level) Closely related to probability; significance levels inform whether differences or correlations found in the data are actually important or not Even though probability may be small, the effect may not necessarily be important whereas very small effects can turn out to be statistically significant (the latter is often true for huge sample sizes, esp in medical research) Expressed as p-value; often set at P < .05 (even though that may depend, esp in fMRI often lower) Attaching statistical meaning to a numerical number; significance levels have LOW probabilities expressed in p-values If p < α level then we reject the null hypothesis and accept the experimental hypothesis - concluding that we are 95% certain that our experimental effect is genuine If however, p > α level then we reject the experimental hypothesis and accept the null hypothesis as true - that there was no sig diff in brain activation levels between the two conditions (that you are comparing) Activation in the left occipitotemporal regions , esp the visual word form area, is greatest for written words.

Populations and samples
 z-tests and distributions Populations require z-tests Samples require t-tests General: hypothesis testing relates to POPULATIONS, not samples. Because we usually only test/ study samples, we need to use sample means in order to infer to population means. T-distribution has to be used for samples (this is similar to z-distributions in that it is symmetrical but flatter and changes with sample size). Z-tests and tables: used with normally distributed population data (see above) T-tests and tables: used with normally distributed sample data Sample (of a population) t-tests and distributions NOTE: a sample can be 2 sets of scores, eg fMRI data from 2 conditions

Comparison between Samples
Usually, you only have access to samples which means you never capture a population as a whole Need to be careful that your samples is representative of your population Are these groups different?

Comparison between Conditions (fMRI)
Reading aloud (script) vs “Reading” finger spelling (sign) Is the activation different when you compare 2 different conditions? Exp. 1: reading script is compared to “Reading” finger spelling (sign) or Exp. 2: when picture naming is compared to reading aloud those exact object labels (eg, naming the picture of a tiger versus reading the word “tiger”)? Reading aloud vs Picture naming

t-tests Compare the mean between 2 samples/ conditions
right hemisphere Left hemisphere lesion site 12 10 8 6 95% CI infer comp Exp Exp. 2 Compare the mean between 2 samples/ conditions if 2 samples are taken from the same population, then they should have fairly similar means  if 2 means are statistically different, then the samples are likely to be drawn from 2 different populations, ie they really are different assesses whether the means of two samples are statistically different from each other. This analysis is appropriate whenever you want to compare the means of two samples/ conditions mean arithmetic average a hypothetical value that can be calculated for a data set; it doesn’t have to be a value that is actually observed in the data set calculated by adding up all scores and dividing them by number of scores assumptions of a t-test: from a parametric population not (seriously) skewed no outliers independent samples

Reporting convention: t= 11.456, df= 9, p< 0.001
Formula Difference between the means divided by the pooled standard error of the mean t-value+ an end product of a calculation df = degrees of freedom (the number of individual scores that can vary without changing the sample mean) Standard error Is the standard deviation of sample means It is a measure of how representative a sample is likely to be of the population Large SE (relative to the sample mean): lots of variability between means of different samples  used sample may not be representative of a population Small SE: most sample means are similar to the population mean  sample is likely to be an accurate reflection of the population Reporting convention: t= , df= 9, p< 0.001

Formula cont. Cond. 1 Cond. 2 I admit I stole this from last year’s presentation: you may read this at your own leisure  To calculate the t score, Take mean from condition 1 (x bar) Then the mean from condition 2 Find the difference between these two And divide this by their shared standard error Calculate this by… Taking the variance for group 1 and dividing it by the sample size for this group Do the same for group 2 and add together. Then finally take the square root of this and put the resulting value back into the original calculation to get your t value.

Types of t-tests Independent Samples Related Samples
also called dependent means test Interval measures/ parametric Independent samples t-test* Paired samples t-test** Ordinal/ non-parametric Mann-Whitney U-Test Wilcoxon test There are lots of different types of t-tests, which need to be used depending on the type of data you have (equal) interval measures Scales in which the difference between consecutive measuring points on the scale is of equal value throughout No arbitrary zero, ie positive and negative measures, eg temperature Ordinal measures Scales on which the items can be ranked in order There is an order of magnitude but intervals may vary, ie one item on the scale is more or less than another but it is not clear by how much as this cannot be measured Often statements/ feelings are attached to numbers which can then be used for rating; in fact, this data can only be ranked (from highest to lowest) , ie what score had the highest turn-out 1= very good, 2= good, 3= neutral, 4= bad, 5= very bad Different measurements have a direct influence on the way analysis is conducted because some of them are more amenable to mathematical operations than others * 2 experimental conditions and different participants were assigned to each condition ** 2 experimental conditions and the same participants took part in both conditions of the experiments

Types of t-tests cont. 2-tailed tests vs one-tailed tests
2 sample t-tests vs 1 sample t-tests 2.5% 2.5% Mean Mean 2-tailed tests vs one-tailed tests If you set alpha-level at 0.05, there is 5% “area” in which a score would fall (outside where it is expected) 2-tailed tests leave room for this area to occur on both extreme ends, ie improvement may be either positive or negative (ie, 2,5% at either end) One-tailed tests deny this and assume that the 5% can only occur at one extreme end (5%) Both tails are meaningful in behavioural studies and should be favoured 2 sample t-tests vs 1 sample t-tests 2 sample t-tests compare the mean of 2 samples (both means are measured within the experiment) 1 sample t-tests compare the mean of one sample to a known value, often a (previously known) population mean In fMRI: compare either 2 activation means of 2 different groups to each other or see if your one sample reaches a level of activation that you are expecting given some prior knowledge A known value Mean 5%

Comparison of more than 2 samples
Tell me the difference between these groups… Thank God I have ANOVA In an experiment with more than 2 samples or more than 2 tasks (or 2 samples and 2 tasks), one could do lots of t-tests and compare all the different groups with each other this way but actually you increase the possibility of falsely rejecting the null-hypothesis tremendously (this is referred to as familywise/ experimentwise error rate). It is much better to use ANOVA.

Reporting convention: F= 65.58, df= 4,45, p< .001
ANOVA ANalysis Of VAriance (ANOVA) Still compares the differences in means between groups but it uses the variance of data to “decide” if means are different Terminology (factors and levels) F- statistic Magnitude of the difference between the different conditions p-value associated with F is probability that differences between groups could occur by chance if null-hypothesis is correct need for post-hoc testing (ANOVA can tell you if there is an effect but not where) ANOVA is concerned with differences between means of groups, not differences between variances. The name analysis of variance comes from the way the analysis uses variances to decide whether the means are different. A better acronym for this model would be ANOVASMAD (analysis of variance to see if means are different)! The way it works is simple: the program looks to see what the variation (variance) is within the groups, then works out how that variation would translate into variation (i.e. differences) between the groups, taking into account how many subjects there are in the groups. If the observed differences are a lot bigger than what you'd expect by chance, you have statistical significance. (so if the patterns of data spread are similar in your different samples, then the mean won’t be much different, ie the samples are probably from the same population; reversely, if the pattern of variance differs between groups, so will the mean, thus the samples are likely to be drawn from different populations) Terminology: Factors: the overall “things” being compared (eg, age vs task) Levels: the different elements of a factor (young vs old AND naming vs reading aoud) ANOVA tests for one overall effect only (this makes it an omnibus test), so it can tell us if experimental manipulation was generally successful but it doesn’t provide specific information about which specific groups were affected. need for post-hoc testing! ANOVA produces F-statistic or F-ratio which is similar to t-score as it compares the amount of systematic variance in the data to the amount of unsystematic variance. As such, it is the ratio of the experimental effect to the individual differences in performance. If the F=ratio’s value is less than 1, it must represent a non-significant event (so you always want a F-ratio greater than 1, indicating that experimental manipulation had some effect above and beyond the effect of individual differences in performance). To test for significance, compare obtained F-ratio against maximum value one would expect to get by chance alone in an F-distribution with the same degrees of freedom. p-value associated with F is probability that differences between groups could occur by chance if null-hypothesis is correct. Reporting convention: F= 65.58, df= 4,45, p< .001

2-way ANOVA for independent groups repeated measures ANOVA
Types of ANOVAs Type 2-way ANOVA for independent groups repeated measures ANOVA mixed ANOVA Participants Condition I Condition II Task I Participant group A Participant group B Task II Participant group C Participant group D Condition I Condition II Task I Participant group A Task II Condition I Condition II Task I Participant group A Participant group B Task II The reasons for choosing one over another relate to feasibility and ease of recruiting (difficulties matching), wanting to investigate change in performance over time (developmental trajectories/ recovery over time) and others. NOTE: You may have more than 2 levels in each condition/ task (the above are 2x2 ANOVAs) So you could compare across 4 different ages and compare action naming vs action reading vs object naming vs object reading (this would be a 4X4 ANOVA) There are also one-way ANOVAS which only test one factor over various levels (eg, naming activation in different ages) both Between-subject design Within-subject design NOTE: You may have more than 2 levels in each condition/ task

BEWARE! Errors Multiple comparison problem Type I: false positives
Type II: false negatives Multiple comparison problem  esp prominent in fMRI type-I-errors (false positives, in other words identifying effects when in fact they don’t exist) type-II-errors (false negatives, in other words not identifying effects when in fact they exist) These depend on the conservatism of significance-levels (see also multiple comparison problem) given the very high number of statistical tests which are employed during voxel-wise analysis, one must control for multiple comparisons to reduce the likelihood of introducing false positives (detecting a difference/ an effect when in fact it does not exist). One solution would be to apply Bonferroni-correction which adjusts the statistical threshold to a much lower p-value (to the scale of p< or similarly conservative). Whilst this indeed controls the occurrence of false positives, it also leads to very low statistical power, in other words it reduces the ability of a statistical test to actually detect an effect if it exists, due to the very conservative significance levels (Kimberg et al, 2007; Rorden et al, 2007; Rorden et al, 2009).

SUMMARY t-tests compare means between 2 samples and identify if they are significantly/ statistically different may compare two samples to each other OR one sample to a predefined value ANOVAs compare more than two samples, over various conditions (2x2, 2x3 or more) They investigate variances to establish if means are significantly different Common statistical problems (errors, multiple comparison problem)

PART 2 Correlation - How much linear is the relationship of two variables? (descriptive) Regression - How good is a linear model to explain my data? (inferential)

Correlation: How much depend the value of one variable on the value of the other one? Y X high positive correlation poor negative correlation no correlation

Covariance How to describe correlation (1):
The covariance is a statistic representing the degree to which 2 variables vary together (note that Sx2 = cov(x,x) )

cov(x,y) = mean of products of each point desviation from mean values
Geometrical interpretation: mean of ‘signed’ areas from rectangles defined by points and the mean value lines

sign of covariance = sign of correlation Y X
Positive correlation: cov > 0 Negative correlation: cov < 0 No correlation. cov ≈ 0

Pearson correlation coefficient (r)
How to describe correlation (2): Pearson correlation coefficient (r) r is a kind of ‘normalised’ (dimensionless) covariance r takes values fom -1 (perfect negative correlation) to 1 (perfect positive correlation). r=0 means no correlation (S = st dev of sample)

Pearson correlation coefficient (r)
Problems: It is sensitive to outlayers r is an estimate from the sample, but does it represent the population parameter?

Linear regression: - Regression: Prediction of one variable from knowledge of one or more other variables How good is a linear model (y=ax+b) to explain the relationship of two variables? If there is such a relationship, we can ‘predict’ the value y for a given x. But, which error could we be doing? (25, 7.498)

Lineal dependence between 2 variables
Preliminars: Lineal dependence between 2 variables Two variables are linearly dependent when the increase of one variable is proportional to the increase of the other one x y Samples: - Energy needed to boil water - Money needed to buy coffeepots

The equation y= mx+n that connects both variables has two parameters:
‘m’ is the unitary increase/decerease of y (how much increases or decreases y when x increases one unity) ‘n’ the value of y when x is zero (usually zero) n m 1 Samples: ‘m’= Energy needed to boil one liter of water , ‘n’=0 ‘m’ = prize of one coffeepot, ‘n’= fixed tax/comission to add

Fiting data to a straight line (o viceversa):
Here, ŷ = ax + b ŷ : predicted value of y a: slope of regression line b: intercept ε i ŷ = ax + b εi = residual = yi , observed = ŷi, predicted Residual error (εi): Difference between obtained and predicted values of y (i.e. yi- ŷi) Best fit line (values of b and a) is the one that minimises the sum of squared errors (SSerror) (yi- ŷi)2

Adjusting the straight line to data:
Minimise (yi- ŷi)2 , which is (yi-axi+b)2 Minimum SSerror is at the bottom of the curve where the gradient is zero – and this can found with calculus Take partial derivatives of (yi-axi-b)2 respect parametres a and b and solve for 0 as simultaneous equations, giving: This calculus can allways be done, whatever is the data!!

How good is the model? We can calculate the regression line for any data, but how well does it fit the data? Total variance = predicted variance + error variance: Sy2 = Sŷ2 + Ser2 Also, it can be shown that r2 is the proportion of the variance in y that is explained by our regression model r2 = Sŷ2 / Sy2 Insert r2Sy2 into Sy2 = Sŷ2 + Ser2 and rearrange to get: Ser2 = Sy2 (1 – r2) From this we can see that the greater the correlation the smaller the error variance, so the better our prediction

Is the model significant?
i.e. do we get a significantly better prediction of y from our regression equation than by just predicting the mean? F-statistic: And it follows that: F (dfŷ,dfer) = sŷ2 ser2 r2 (n - 2)2 1 – r2 =......= complicated rearranging t(n-2) = r (n - 2) √1 – r2 So all we need to know are r and n !!!

Generalization to multiple variables
Multiple regression is used to determine the effect of a number of independent variables, x1, x2, x3 etc., on a single dependent variable, y The different x variables are combined in a linear way and each has its own regression coefficient: y = b0 + b1x1+ b2x2 +…..+ bnxn + ε The a parameters reflect the independent contribution of each independent variable, x , to the value of the dependent variable, y i.e. the amount of variance in y that is accounted for by each x variable after all the other x variables have been accounted for

Geometric view, 2 variables:
‘Plane’ of regression: Plane nearest all the sample points distributed over a 3D space: y = b0 + b1x1+b 2x2 + ε y ε x2 x1 ŷ = b0 + b1x1+ b2x2

Multiple regression in SPM:
y : voxel value x1, x2,… : parameters that are supposed to justify y variation (regressors) GLM: given a set of values yi, (voxel value at a determinated position for a sample of images) and a set of explanatories variables xi (group, factors, age, TIV, … for VBM or condition, movement parameters,…. for fMRI) find the (hiper)plane nearest all the points. The coeficients defining the plane are named b1, b2,…, bn equation: y = b0 + b1x1+ b2x2 +…..+ bnxn + ε

Matrix representation and results:

Last remarks: Correlated doesn’t mean related.
e.g, any two variables increasing or decreasing over time would show a nice correlation: C02 air concentration in Antartica and lodging rental cost in London. Beware in longitudinal studies!!! Relationship between two variables doesn’t mean causality (e.g leaves on the forest floor and hours of sun) Cov(x,y)=0 doesn’t mean x,y being independents (yes for linear relationship but it could be quadratic,…)