Presentation on theme: "Statistics – Modelling Your Data"— Presentation transcript:
1Statistics – Modelling Your Data Chris RordenModelling data:Signal, Error and CovariatesParametric StatisticsThresholding Results:Statistical power and statistical errorsThe multiple comparison problemFamilywise error and Bonferroni ThresholdingPermutation ThresholdingFalse Discovery Rate ThresholdingImplications: null results uninterruptible
2The fMRI signalLast lecture: we predict areas that are involved with a task will become brighter (after a delay)Therefore, we expect that if someone repeatedly does a task for 12 seconds, and rests for 12 seconds, our signal should look like this:
3Calculating statistics Does this brain area change brightness when we do the task?Top panel: very good predictor (very little error)Lower panel: somewhat less good predictor
4General Linear ModelThe observed data is composed of a signal that is predicted by our model and unexplained noise (Boynton et al., 1996).Amplitude (solve for)Design ModelMeasured DataNoise
5What is your model? Model is predicted effect. Consider Block desgin experiment:Three conditions, each for 11.2secPress left index finger when you see çPress right index finger when you see èDo nothing when you see éIntensityTime
6FSL/SPM display of model Analysis programs display model as grid.Each column is regressore.g. left / right arrows.Each row is a volume of datafor within-subject fMRI = timeBrightness of row is model’s predicted intensity.TimeIntensityTime
7Statistical Contrasts fMRI inference based on contrast.Consider study with left arrow and right arrow as regressors[1 0] identifies activation correlated with left arrows: we could expect visual and motor effects.[1 –1] identifies regions that show more response to left arrows than right arrows. Visual effects should be similar, so should select contralateral motoric.Choice of contrasts crucial to inference.
8Statistical Contrasts t-Test is one tailed, F-test is two-tailed.T-test: [1 –1] mutually exclusive of [-1 1]: left>right vs right>left.F-test: [1 –1] = [-1 1]: difference between left and right.Choice of test crucial to inference.
9How many regressors?We collected data during a block design, where the participant completed 3 tasksLeft hand movementRight hand movementRestWe are only interested in the brain areas involed with Left hand movement.Should we include uninteresting right hand movement as a regressor in our statistical model?I.E. Is a  analysis the same as a [1 0]?Is a [1 0] analysis identical, better, worse or different from a  analysis?=?
10Meaningful regressors decrease noise Meaningful regressors can explain some of the variability.Adding a meaningful regressor can reduce the unexplained noise from our contrast.
11Correlated regressors decrease signal If a regressor is strongly correlated with our effect, it can reduce the residual signalOur signal is excluded as the regressor explains this variability.Example: responses highly correlated to visual stimuli
12Explained Variance Unexplained Variance Single factor…Consider a test to see how well height predicts weight.WeightHeightExplained Variance Unexplained Variancet=Small t-scoreheight only weakly predicts weightHigh t-scoreheight strongly predicts weight
13Adding a second factor… How does an additional factor influence our test?E.G. We can add waist diameter as a regressor.Does this regressor influence the t-test regarding how well height predicts weight?Consider ratio of cyan to green.WeightHeightWaistIncreased tWaist explains portion of weight not predicted by height.Decreased tWaist explains portion of weight predicted by height.
14Regressors and statistics Our analysis identifies three classes of variability:Signal: Predicted effect of interestNoise (aka Error): Unexplained varianceCovariates: Predicted effects that are not relevant.Statistical significance is the ratio:Covariates willImprove sensitivity if they reduce error (explain otherwise unexplained variance).Reduce sensitivity if they reduce signal (explain variance that is also predicted by our effect of interest).Signal Noiset=
15Regressors should be orthogonal SummaryRegressors should be orthogonalEach regressor describes independent variance.Variance should not be explained by more than one regressor.E.G. we will see that including temporal derivatives as regressors tend to help event related designs (temporal processing lecture).
16Group AnalysisWe typically want to make inferences about the general populationConduct time course analysis on many people.Identify which patterns are consistent across group.
17Parametric Statistics SPM and FSL conduct parametric statistics.T-test, F-test, CorrelationThese make assumptions about data.We will not check to see if these assumptions are valid.
18Parametric Statistics Parameters = AssumptionsParametric Statistics assume that data can be accurately defined by two values:Mean = measure of central tendencyVariance = measure of noiseMeans differVariabilities Differ
19Parametric Statistics Parametric Statistics are popularSimple (complex data described by two numbers: mean and variability)Flexible (can look at how multiple factors interact)Powerful: very sensitive at detecting real effectsRobust: usually work even if assumptions violatedTend to fail graciously: by becoming more conservative
20Parametric Statistics Assume Bell-Shaped data: Normal DistributionParametric Statistics Assume Bell-Shaped data:Often, this is wrong. Mean may not be a good measure:Positive Skew: response times, hard examNegative Skew: easy examBimodal: some students got it
21Rank-Order Statistics Rank-order statistics make fewer assumptions.Have less power (if data is normal)Require more measurementsMay fail to detect real resultsComputationally slowClassic examples:Wilcoxon Mann-WhitneyFligner and Policello’s robust rank order test
22Problem with rank order statistics While rank-order statistics are often referred to as non-parametric, most make assumptions:WMW: assume both distributions have same shape.FP: assume both distributions are symmetrical.Both these tests become liberal if their assumptions are not met.They fail catastrophically.
23What to do? In general, use parametric tests. In face of violations, you will simply lose powerOne alternative is to use permutation testing, e.g. SnPM.Permuation testing is only as powerful as the test statistic it uses: SnPM uses the t-test, which is sensitive to changes in mean (so it can be blind to changes in median).Recent alternative is truly non-parametric test of Brunner and Munzel.Can offer slightly better power than t-test if data is skewed. Rorden et al
25Statistics E.G. erythropoietin (EPO) doping in athletes In endurance athletes, EPO improves performance ~ 10%Races often won by less than 1%Without testing, athletes forced to dope to be competitiveDangers: Carcinogenic and can cause heart-attacksTherefore: Measure haematocrit level to identify drug users…haematocrit30%50%If there was no noise in our measure, it would be easy to identify EPO doping:
26The problem of noise Science tests hypotheses based on observations We need statistics because our data is noisyIn the real world, haematocrit levels varyThis unrelated noise in our measure is called ‘error’How to we identify dopers?50%In the real world, hematocrit varies between people30%hematocrit
27Statistical Threshold hematocrit30%50%If we set the threshold too low, we will accuse innocent people (high rate of false alarms).hematocrit30%50%If we set the threshold too high, we will fail to detect dopers (high rate of misses).
28Possible outcomes of drug test Reality (unknown)nonDoperEPO DoperAccuse and expelInnocent accused (false alarm)Type I errorDoper expelled(hit)Allow to competeInnocent competes (correct rejection)Doper sneaks through (miss) Type II errorDecision
29Errors With noisy data, we will make mistakes. Statistics allows us to Estimate our confidenceBias the type of mistake we make (e.g. we can decide whether we will tend to make false alarms or misses)We can be liberal: avoiding missesWe can be conservative: avoiding false alarms.We want liberal tests for airport weapons detection (X-ray often leads to innocent cases being opened).Our society wants conservative tests for criminal conviction: avoid sending innocent people to jail.
30Liberal vs Conservative Thresholds A low threshold, we will accuse innocent people (high rate of false alarms, Type I).CONSERVATIVEA high threshold, we will fail to detect dopers (high rate of misses, Type II).
31Statistical Power is our probability of making a Hit. It reflects our ability to detect real effects.To make new discoveries, we need to optimize power.There are 4 ways to increase power…Type II errorCorrect rejectionAccept HoHitType I errorReject HoHo falseHo trueDecisionReality
321.) Alpha and PowerBy making alpha less strict, we can increase power. (e.g. p < 0.05 instead of 0.01)However, we increase the chance of a Type I error!
332.) Effect Size and PowerPower will increase if the effect size increases. (e.g. higher dose of drug, 7T MRI instead of 1.5T).Unfortunately, effect sizes are often small and fixed.
343.) Variability and Power Reducing variability increases the relative effect size.Most measures of brain activity noisy.
354.) Sample SizeA final way to increase our power is to collect more data.We can sample a person’s brain activity on many similar trials.We can test more people.The disadvantage is time and money.Increasing the sample size is often our only option for increasing statistical power.
36In graphs below, same is used. ReflectionStatistically, relative ‘effect size’ and ‘variability’ are equivalent.Our confidence is the ratio of effect size versus variability (signal versus noise).In graphs below, same is used.=
37Alpha level Statistics allow us to estimate our confidence. is our statistical threshold: it measures our chance of Type I error.An alpha level of 5% means only 1/20 chance of false alarm (we will only accept p < 0.05).An alpha level of 1% means only 1/100 chance of false alarm (p< 0.01).Therefore, a 1% alpha is more conservative than a 5% alpha.
38Multiple Comparison Problem Assume a 1% alpha for drug testing.An innocent athlete only has 1% chance of being accused.Problem: 10,500 athletes in the Olympics.If all innocent, and = 1%, we will wrongly accuse 105 athletes (0.01*10500)!This is the multiple comparison problem.
39Multiple ComparisonsThe gray matter volume ~900cc (900,000mm3)Typical fMRI voxel is 3x3x3mm (27mm3)Therefore, we will conduct >30,000 testsWith 5% alpha, we will make >1500 false alarms!
40Multiple Comparison Problem If we conduct 20 tests, with an = 5%, we will on average make one false alarm (20x0.05).If we make twenty comparisons, it is possible that we may be making 0, 1, 2 or in rare cases even more errors.The chance we will make at least one error is given by the formula: 1- (1- )C: if we make twenty comparisons at p < .05, we have a 1-(.95) 20 = 64% chance that we are reporting at least one erroneous finding. This is our familywise error (FWE) rate.
41Bonferroni Correction Bonferroni Correction: controls FWE.For example: if we conduct 10 tests, and want a 5% chance of any errors, we will adjust our threshold to be p < (0.05/10).Benefits: Controls for FWE.Problem: Very conservative = very little chance of detecting real effects = low power.
42Random Field Theory5mmWe spatially smooth our data – peaks due to noise should be attenuated by neighbors.Worsley et al, HBM 4:58-73, 1995.RFT uses resolution elements (resels) instead of voxels.If we smooth our data with 8mm FWHM, then resel size is 8mm.SPM uses RFT for FWE correction: only requires statistical map, smoothness and cluster size threshold.Euler characteristic: unsmoothed noise will have high peaks but few clusters, smoothed data will be have lower peaks but show clustering.RFT has many unchecked assumptions (Nichols)Works best for heavily smoothed data (x3 voxel size)10mm15mmImage from Nichols
43Permutation Thresholding Group 1Group 2Prediction: Label ‘Group 1’ and ‘Group 2’ mean something.Null Hypothesis (Ho): Labels are meaningless.If Ho true, we should get similar t-scores if we randomly scramble order.
44Permutation Thresholding Group 1Group 2Observed, max T = 4.1Permutation 1, max T = 3.2Permutation 2, max T = 2.9Permutation 3, max T = 3.3Permutation 4, max T = 2.8Permutation 5, max T = 3.5…Permutation 1000, max T = 3.1……
45Permutation Thresholding Compute maximum T-score for 1000 permutations.Find 5th Percentile max T.Any voxel in our observed dataset that exceeds this threshold has only 5% probability of being noise.5T= 3.9Max TPercentile5%
46Permutation Thresholding Permutation Thresholding offers the same protection against false alarms as Bonferroni.Typically, much more powerful than Bonferroni.Implementations include SnPM, FSL’s randomise, and my own NPM.Disadvantage: computing 1000 permutations means it takes x1000 times longer than typical analysis!Simulation data from Nichols et al.: Permutation always optimal. Bonferroni typically conservative. Random Fields only accurate with high DF and heavily smoothed.
47False Discovery RateTraditional statistics attempts to control the False Alarm rate.‘False Discovery Rate’ controls the ratio of false alarms to hits.It often provides much more power than Bonferroni correction.
48Assume Olympics where no athletes took EPO: FDRAssume Olympics where no athletes took EPO:Assume Olympics where some cheat:When we conduct many tests, we can estimate the amount of real signal
49Bonferroni FWE applies same threshold to each data set FDR vs FWEBonferroni FWE applies same threshold to each data setFDR is dynamic: threshold based on signal detected.5% Bonferroni: only a 5% chance an innocent athlete will be accused.5% FDR: only 5% of expelled athletes are innocent.
50Controlling for multiple comparisons Bonferroni correctionWe will often fail to find real results.RFT correctionTypically less conservative than Bonferroni.Requires large DF and broad smoothing.Permutation ThresholdingOffers same inference as Bonferroni correction.Typically much less conservative than Bonferroni.Computationally very slowFDR correctionAt FDR of .05, about 5% of ‘activated’ voxels will be false alarms.If signal is only tiny proportion of data, FDR will be similar to Bonferroni.
51Alternatives to voxelwise analysis Conventional fMRI statistics compute one statistical comparison per voxel.Advantage: can discover effects anywhere in brain.Disadvantage: low statistical power due to multiple comparisons.Small Volume Comparison: Only test a small proportion of voxels. (Still have to adjust for RFT).Region of Interest: Pool data across anatomical region for single statistical test.Example: how many comparisons on this slice?SPM: 1600SVC: 57ROI: 1SPMSVCROI
52ROI analysisIn voxelwise analysis, we conduct an indepent test for every voxelEach voxel is noisyHuge number of tests, so severe penalty for multiple comparisonsAlternative: pool data from region of interest.Averaging across meaningful region should reduce noise.One test per region, so FWE adjustment less severe.Region must be selected independently of statistical contrast!Anatomically predefinedDefined based on previous localizer sessionSelected based on combination of conditions you will contrast.M1: movementS1: sensation
53Inference from fMRI statistics fMRI studies have very low power.Correction for multiple comparisonsPoor signal to noiseVariability in functional anatomy between people.Null results impossible to interpret. (Hard to say an area is not involved with task).
54Between and Within Subject Variance Consider experiment to see if music influences typing speed.Possible effect will be small.Large variability between people: some people much better typist than others.Solution: repeated measure design to separate between and within subject variability.7060AliceBob50Donna40NickTyping speed: words per minuteSam302010BachRockSilent
55Multiple Subject Analysis: Mixed Model Model all of the data at onceBetween and within subject variation is accounted forCan’t apply mixed model directly to fMRI data because there is so much data!Z statsGroupSub 1Sub 2Sub 3Sub 4
56Multiple Subject Analysis: SPM2 First estimate each subject’s contrast effect sizes (copes)Run a t-test on the copesHolmes and Friston assume within subject variation is same for all subjects, this allows them to ignore it at the group levelNot equivalent to a mixed modelResultsGroup: T-testcopescopescopescopesSub 1Sub 2Sub 3Sub 4
57Multiple Subject Analysis: FSL First estimate each subject’s copes and cope variability (varcopes)Then enter the copes and varcopes into group modelvarcopes supply within subject variationBetween subject variation and group level means are then estimatedEquivalent to mixed modelMuch slower than SPMZ statsGroupcopes varcopescopes varcopescopes varcopescopes varcopesSub 1Sub 2Sub 3Sub 4