Presentation on theme: "Carles Falcon & Suz Prejawa"— Presentation transcript:
1Carles Falcon & Suz Prejawa t-tests, ANOVAs & Regression and their application to the statistical analysis of neuroimagingCarles Falcon &Suz Prejawa
2OVERVIEW Basics, populations and samples T-tests ANOVA Beware! Summary Part 1Part 2
3Basics Hypotheses Descriptive vs inferential statistics H0 = Null-hypothesisH1 = experimental/ research hypothesisDescriptive vs inferential statistics(Gaussian) distributionsp-value & alpha-level (probability and significance)Experimental hypothesis is essentially a prediction that you have about a set of data/ an event/ a group/ a topic and this prediction must be falsifiable.Usually, an experiment is aimed at disproving that null-hypothesis is true.Null-hypothesis is the opposite of the experimental condition; usually you expect a difference/ an effect in your experiment, thus the null-hypothesis claims that this effect does not exist.With regards to fMRI, this would relate to activation being expected and observed, or, if Null-hypothesis is true, then not.Example: activation in the left occipitotemporal regions , esp the visual word form area (VWFA), is greatest for written words (visual word forms).See Cohen and Dehaene (2004) NeuroImageDescriptive stats:Allow to summarise data (which is huge amounts of numbers essentially)Allows one to grasp the essential features of data (quickly and easily)Often in image formMean, median, mode, SDs, histograms, etcInferential* stats:Goes beyond the pure dataInforms about likelihood of the findings, ie about the probability that findings would turn out they way they actually have, ie if effects are genuine and due to experimental manipulation ot occur simply by chanceInferential stats is possible because research data is rarely random (ie there usually is a similar pattern in distribution)2 types: distribution tests (t-tests and ANOVAs) and correlation tests* inference: the act or process of deriving a conclusion based on what one already knowsVWFA example:activation in the VWFA is present when- reading is compared to a rest condition or false fonts (unknown script),OR- when pictures of objects are named (eg, tiger) relative to resting condition,- when picture naming is compared to reading aloud those exact object labels (eg, naming the picture of a tiger versus reading the word “tiger”),- when colours are named,- when an action associated with an object (shown as a picture) is carried out (eg, moving fingers quickly along an imaginary board to illustrate touch typing when presented with a picture of a touch typing machine),- when “reading” Braille with abstract meaning,- when seeing novel objects (previously unknown and thus without any kind of word label attached to it)BOLD signal intensity can be measured for these conditions and the values can be listed (the number crunching = descriptive stats)But the question really is: if there is a numerical difference in these values, is this difference meaningful?Inferential stats can tell you!ProbabilityRelates to the probability of the null-hypothesis being trueExpressed as a positive decimal fraction, eg(1/10).05 (5/100).01 (1/100)Probability can never be higher than 1 because probability of 1 means that something happened every single timeExpressed as p-value (a simple numerical number)Significance (alpha-level)Closely related to probability; significance levels inform whether differences or correlations found in the data are actually important or notEven though probability may be small, the effect may not necessarily be important whereas very small effects can turn out to be statistically significant (the latter is often true for huge sample sizes, esp in medical research)Expressed as p-value; often set at P < .05 (even though that may depend, esp in fMRI often lower)Attaching statistical meaning to a numerical number; significance levels have LOW probabilities expressed in p-valuesIf p < α level then we reject the null hypothesis and accept the experimental hypothesis- concluding that we are 95% certain that our experimental effect is genuineIf however, p > α level then we reject the experimental hypothesis and accept the null hypothesis as true- that there was no sig diff in brain activation levels between the two conditions (that you are comparing)Activation in the left occipitotemporal regions , esp the visual word form area, is greatest for written words.
4Populations and samples z-tests and distributionsPopulations require z-testsSamples require t-testsGeneral: hypothesis testing relates to POPULATIONS, not samples. Because we usually only test/ study samples, we need to use sample means in order to infer to population means. T-distribution has to be used for samples (this is similar to z-distributions in that it is symmetrical but flatter and changes with sample size).Z-tests and tables: used with normally distributed population data (see above)T-tests and tables: used with normally distributed sample dataSample(of a population)t-tests and distributionsNOTE: a sample can be 2 sets of scores, eg fMRI data from 2 conditions
5Comparison between Samples Usually, you only have access to samples which means you never capture a population as a wholeNeed to be careful that your samples is representative of your populationAre these groups different?
6Comparison between Conditions (fMRI) Reading aloud (script) vs “Reading” finger spelling (sign)Is the activation different when you compare 2 different conditions?Exp. 1: reading script is compared to “Reading” finger spelling (sign)orExp. 2: when picture naming is compared to reading aloud those exact object labels (eg, naming the picture of a tiger versus reading the word “tiger”)?Reading aloud vs Picture naming
7t-tests Compare the mean between 2 samples/ conditions right hemisphereLeft hemispherelesion site12108695% CIinfercompExp Exp. 2Compare the mean between 2 samples/ conditionsif 2 samples are taken from the same population, then they should have fairly similar means if 2 means are statistically different, then the samples are likely to be drawn from 2 different populations, ie they really are differentassesses whether the means of two samples are statistically different from each other. This analysis is appropriate whenever you want to compare the means of two samples/ conditionsmeanarithmetic averagea hypothetical value that can be calculated for a data set; it doesn’t have to be a value that is actually observed in the data setcalculated by adding up all scores and dividing them by number of scoresassumptions of a t-test:from a parametric populationnot (seriously) skewedno outliersindependent samples
8t-test in VWFAExp. 1: activation patterns are similar, not significantly different they are similar tasks and recruit the VWFA in a similar wayExp. 2: activation patterns are very (and significantly) different reading aloud recruits the VWFA a lot more than namingExp. 1: reading script (blue) is compared to “Reading” finger spelling (sign) (green) = both tasks are essentially the same, they are “reading” a word (but use different modalities)orExp. 2: when picture naming (green) is compared to reading aloud (blue) those exact object labels (eg, naming the picture of a tiger versus reading the word “tiger”) = reading causes significantly stronger activation in the VWFA and thus requires it differently than naming- they are different tasks and the VWFA is more strongly involved in reading (specialised?)Exp Exp. 2
9Reporting convention: t= 11.456, df= 9, p< 0.001 FormulaDifference between the means divided by the pooled standard error of the meant-value+ an end product of a calculationdf = degrees of freedom (the number of individual scores that can vary without changing the sample mean)Standard errorIs the standard deviation of sample meansIt is a measure of how representative a sample is likely to be of the populationLarge SE (relative to the sample mean): lots of variability between means of different samples used sample may not be representative of a populationSmall SE: most sample means are similar to the population mean sample is likely to be an accurate reflection of the populationReporting convention: t= , df= 9, p< 0.001
10Formula cont.Cond. 1Cond. 2I admit I stole this from last year’s presentation: you may read this at your own leisure To calculate the t score,Take mean from condition 1 (x bar)Then the mean from condition 2Find the difference between these twoAnd divide this by their shared standard errorCalculate this by…Taking the variance for group 1 and dividing it by the sample size for this groupDo the same for group 2 and add together.Then finally take the square root of this and put the resulting value back into the original calculation to get your t value.
11Types of t-tests Independent Samples Related Samples also called dependent means testInterval measures/ parametricIndependent samples t-test*Paired samples t-test**Ordinal/ non-parametricMann-Whitney U-TestWilcoxon testThere are lots of different types of t-tests, which need to be used depending on the type of data you have(equal) interval measuresScales in which the difference between consecutive measuring points on the scale is of equal value throughoutNo arbitrary zero, ie positive and negative measures, eg temperatureOrdinal measuresScales on which the items can be ranked in orderThere is an order of magnitude but intervals may vary, ie one item on the scale is more or less than another but it is not clear by how much as this cannot be measuredOften statements/ feelings are attached to numbers which can then be used for rating; in fact, this data can only be ranked (from highest to lowest) , ie what score had the highest turn-out1= very good, 2= good, 3= neutral, 4= bad, 5= very badDifferent measurements have a direct influence on the way analysis is conducted because some of them are more amenable to mathematical operations than others* 2 experimental conditions and different participants were assigned to each condition** 2 experimental conditions and the same participants took part in both conditions of the experiments
12Types of t-tests cont. 2-tailed tests vs one-tailed tests 2 sample t-tests vs 1 sample t-tests2.5%2.5%MeanMean2-tailed tests vs one-tailed testsIf you set alpha-level at 0.05, there is 5% “area” in which a score would fall (outside where it is expected)2-tailed tests leave room for this area to occur on both extreme ends, ie improvement may be either positive or negative (ie, 2,5% at either end)One-tailed tests deny this and assume that the 5% can only occur at one extreme end (5%)Both tails are meaningful in behavioural studies and should be favoured2 sample t-tests vs 1 sample t-tests2 sample t-tests compare the mean of 2 samples (both means are measured within the experiment)1 sample t-tests compare the mean of one sample to a known value, often a (previously known) population meanIn fMRI: compare either 2 activation means of 2 different groups to each other or see if your one sample reaches a level of activation that you are expecting given some prior knowledgeA known valueMean5%
13Comparison of more than 2 samples Tell me the difference between these groups…Thank God I have ANOVAIn an experiment with more than 2 samples or more than 2 tasks (or 2 samples and 2 tasks), one could do lots of t-tests and compare all the different groups with each other this way but actually you increase the possibility of falsely rejecting the null-hypothesis tremendously (this is referred to as familywise/ experimentwise error rate). It is much better to use ANOVA.
14ANOVA in VWFA (2x2)Is activation in VWFA for different for a) naming and reading and b) influenced by age and if so (a + b) how so?H1 & H0H2 & H0H3 & H0 reading causes significantly stronger activation in the VWFA but only in the older group so the VWFA is more strongly activated during reading but this seems to be affected by age (related to reading skill?)Naming ReadingActivation in VWFA in picture naming (green) is compared to activation in VWFA in reading aloud (blue) those exact object labels (eg, naming the picture of a tiger versus reading the word “tiger”) and activations in both tasks are compared at different ages (10 years vs 50 years)H1 & H0: Activation is significantly different for reading. vs Activation is not significantly different for reading.H2 & H0: Activation is significantly different in older people. vs Activation is not significantly different in older people.H3 & H0: Activation is significantly different in older people in reading. vs Activation is not significantly different in older people in reading. reading causes significantly stronger activation in the VWFA but only in the older group so the VWFA is more strongly activated during reading but this seems to be affected by age (related to reading skill?)TASKNamingReading AloudAGEYoungOld
15Reporting convention: F= 65.58, df= 4,45, p< .001 ANOVAANalysis Of VAriance (ANOVA)Still compares the differences in means between groups but it uses the variance of data to “decide” if means are differentTerminology (factors and levels)F- statisticMagnitude of the difference between the different conditionsp-value associated with F is probability that differences between groups could occur by chance if null-hypothesis is correctneed for post-hoc testing (ANOVA can tell you if there is an effect but not where)ANOVA is concerned with differences between means of groups, not differences between variances. The name analysis of variance comes from the way the analysis uses variances to decide whether the means are different. A better acronym for this model would be ANOVASMAD (analysis of variance to see if means are different)! The way it works is simple: the program looks to see what the variation (variance) is within the groups, then works out how that variation would translate into variation (i.e. differences) between the groups, taking into account how many subjects there are in the groups. If the observed differences are a lot bigger than what you'd expect by chance, you have statistical significance.(so if the patterns of data spread are similar in your different samples, then the mean won’t be much different, ie the samples are probably from the same population; reversely, if the pattern of variance differs between groups, so will the mean, thus the samples are likely to be drawn from different populations)Terminology:Factors: the overall “things” being compared (eg, age vs task)Levels: the different elements of a factor (young vs old AND naming vs reading aoud)ANOVA tests for one overall effect only (this makes it an omnibus test), so it can tell us if experimental manipulation was generally successful but it doesn’t provide specific information about which specific groups were affected. need for post-hoc testing!ANOVA produces F-statistic or F-ratio which is similar to t-score as it compares the amount of systematic variance in the data to the amount of unsystematic variance. As such, it is the ratio of the experimental effect to the individual differences in performance. If the F=ratio’s value is less than 1, it must represent a non-significant event (so you always want a F-ratio greater than 1, indicating that experimental manipulation had some effect above and beyond the effect of individual differences in performance). To test for significance, compare obtained F-ratio against maximum value one would expect to get by chance alone in an F-distribution with the same degrees of freedom. p-value associated with F is probability that differences between groups could occur by chance if null-hypothesis is correct.Reporting convention: F= 65.58, df= 4,45, p< .001
162-way ANOVA for independent groups repeated measures ANOVA Types of ANOVAsType2-way ANOVA for independent groupsrepeated measures ANOVAmixed ANOVAParticipantsCondition ICondition IITask IParticipant group AParticipant group BTask IIParticipant group CParticipant group DCondition ICondition IITask IParticipant group ATask IICondition ICondition IITask IParticipant group AParticipant group BTask IIThe reasons for choosing one over another relate to feasibility and ease of recruiting (difficulties matching), wanting to investigate change in performance over time (developmental trajectories/ recovery over time) and others.NOTE: You may have more than 2 levels in each condition/ task (the above are 2x2 ANOVAs)So you could compare across 4 different ages and compare action naming vs action reading vs object naming vs object reading (this would be a 4X4 ANOVA)There are also one-way ANOVAS which only test one factor over various levels (eg, naming activation in different ages)bothBetween-subject designWithin-subject designNOTE: You may have more than 2 levels in each condition/ task
17BEWARE! Errors Multiple comparison problem Type I: false positives Type II: false negativesMultiple comparison problem esp prominent in fMRItype-I-errors (false positives, in other words identifying effects when in fact they don’t exist)type-II-errors (false negatives, in other words not identifying effects when in fact they exist)These depend on the conservatism of significance-levels (see also multiple comparison problem)given the very high number of statistical tests which are employed during voxel-wise analysis, one must control for multiple comparisons to reduce the likelihood of introducing false positives (detecting a difference/ an effect when in fact it does not exist). One solution would be to apply Bonferroni-correction which adjusts the statistical threshold to a much lower p-value (to the scale of p< or similarly conservative). Whilst this indeed controls the occurrence of false positives, it also leads to very low statistical power, in other words it reduces the ability of a statistical test to actually detect an effect if it exists, due to the very conservative significance levels (Kimberg et al, 2007; Rorden et al, 2007; Rorden et al, 2009).
18SUMMARYt-tests compare means between 2 samples and identify if they are significantly/ statistically differentmay compare two samples to each other OR one sample to a predefined valueANOVAs compare more than two samples, over various conditions (2x2, 2x3 or more)They investigate variances to establish if means are significantly differentCommon statistical problems (errors, multiple comparison problem)
19PART 2Correlation- How much linear is the relationship of two variables? (descriptive)Regression- How good is a linear model to explain my data? (inferential)
20Correlation:How much depend the value of one variable on the value of the other one?YXhigh positive correlationpoor negative correlationno correlation
21Covariance How to describe correlation (1): The covariance is a statistic representing the degree to which 2 variables vary together(note that Sx2 = cov(x,x) )
22cov(x,y) = mean of products of each point desviation from mean values Geometrical interpretation: mean of ‘signed’ areas from rectangles defined by points and the mean value lines
23sign of covariance = sign of correlation Y X Positive correlation: cov > 0Negative correlation: cov < 0No correlation. cov ≈ 0
24Pearson correlation coefficient (r) How to describe correlation (2):Pearson correlation coefficient (r)r is a kind of ‘normalised’ (dimensionless) covariancer takes values fom -1 (perfect negative correlation) to 1 (perfect positive correlation). r=0 means no correlation(S = st dev of sample)
25Pearson correlation coefficient (r) Problems:It is sensitive to outlayersr is an estimate from the sample, but does it represent the population parameter?
26Linear regression:- Regression: Prediction of one variable from knowledge of one or more other variablesHow good is a linear model (y=ax+b) to explain the relationship of two variables?If there is such a relationship, we can ‘predict’ the value y for a given x. But, which error could we be doing?(25, 7.498)
27Lineal dependence between 2 variables Preliminars:Lineal dependence between 2 variablesTwo variables are linearly dependent when the increase of one variable is proportional to the increase of the other onexySamples: - Energy needed to boil water- Money needed to buy coffeepots
28The equation y= mx+n that connects both variables has two parameters: ‘m’ is the unitary increase/decerease of y (how much increases or decreases y when x increases one unity)‘n’ the value of y when x is zero (usually zero)nm1Samples: ‘m’= Energy needed to boil one liter of water , ‘n’=0‘m’ = prize of one coffeepot, ‘n’= fixed tax/comission to add
29Fiting data to a straight line (o viceversa): Here, ŷ = ax + bŷ : predicted value of ya: slope of regression lineb: interceptε iŷ = ax + bεi = residual= yi , observed= ŷi, predictedResidual error (εi): Difference between obtained and predicted values of y (i.e. yi- ŷi)Best fit line (values of b and a) is the one that minimises the sum of squared errors (SSerror) (yi- ŷi)2
30Adjusting the straight line to data: Minimise (yi- ŷi)2 , which is (yi-axi+b)2Minimum SSerror is at the bottom of the curve where the gradient is zero – and this can found with calculusTake partial derivatives of (yi-axi-b)2 respect parametres a and b and solve for 0 as simultaneous equations, giving:This calculus can allways be done, whatever is the data!!
31How good is the model?We can calculate the regression line for any data, but how well does it fit the data?Total variance = predicted variance + error variance: Sy2 = Sŷ2 + Ser2Also, it can be shown that r2 is the proportion of the variance in y that is explained by our regression modelr2 = Sŷ2 / Sy2Insert r2Sy2 into Sy2 = Sŷ2 + Ser2 and rearrange to get:Ser2 = Sy2 (1 – r2)From this we can see that the greater the correlation the smaller the error variance, so the better our prediction
32Is the model significant? i.e. do we get a significantly better prediction of y from our regression equation than by just predicting the mean?F-statistic:And it follows that:F(dfŷ,dfer)=sŷ2ser2r2 (n - 2)21 – r2=......=complicatedrearrangingt(n-2) =r (n - 2)√1 – r2So all we need to know are r and n !!!
33Generalization to multiple variables Multiple regression is used to determine the effect of a number of independent variables, x1, x2, x3 etc., on a single dependent variable, yThe different x variables are combined in a linear way and each has its own regression coefficient:y = b0 + b1x1+ b2x2 +…..+ bnxn + εThe a parameters reflect the independent contribution of each independent variable, x , to the value of the dependent variable, yi.e. the amount of variance in y that is accounted for by each x variable after all the other x variables have been accounted for
34Geometric view, 2 variables: ‘Plane’ of regression: Plane nearest all the sample points distributed over a 3D space:y = b0 + b1x1+b 2x2 + εyεx2x1ŷ = b0 + b1x1+ b2x2
35Multiple regression in SPM: y : voxel valuex1, x2,… : parameters that are supposed to justify y variation (regressors)GLM: given a set of values yi, (voxel value at a determinated position for a sample of images) and a set of explanatories variables xi (group, factors, age, TIV, … for VBM or condition, movement parameters,…. for fMRI) find the (hiper)plane nearest all the points. The coeficients defining the plane are named b1, b2,…, bnequation: y = b0 + b1x1+ b2x2 +…..+ bnxn + ε
37Last remarks: Correlated doesn’t mean related. e.g, any two variables increasing or decreasing over time would show a nice correlation: C02 air concentration in Antartica and lodging rental cost in London. Beware in longitudinal studies!!!Relationship between two variables doesn’t mean causality(e.g leaves on the forest floor and hours of sun)Cov(x,y)=0 doesn’t mean x,y being independents(yes for linear relationship but it could be quadratic,…)