Statistics – Modelling Your Data

Slides:



Advertisements
Similar presentations
Hypothesis Testing Steps in Hypothesis Testing:
Advertisements

FMRI Data Analysis: I. Basic Analyses and the General Linear Model
1 Statistics – Understanding your findings Chris Rorden 1.Modeling data: Signal, Error and Covariates Statistical contrasts 2.Thresholding Results: Statistical.
Hypothesis: It is an assumption of population parameter ( mean, proportion, variance) There are two types of hypothesis : 1) Simple hypothesis :A statistical.
Multiple testing Justin Chumbley Laboratory for Social and Neural Systems Research University of Zurich With many thanks for slides & images to: FIL Methods.
Topological Inference Guillaume Flandin Wellcome Trust Centre for Neuroimaging University College London SPM Course London, May 2014 Many thanks to Justin.
Designing a behavioral experiment
Classical inference and design efficiency Zurich SPM Course 2014
Significance Testing Chapter 13 Victor Katch Kinesiology.
Multiple testing Justin Chumbley Laboratory for Social and Neural Systems Research Institute for Empirical Research in Economics University of Zurich With.
07/01/15 MfD 2014 Xin You Tai & Misun Kim
Multiple testing Justin Chumbley Laboratory for Social and Neural Systems Research Institute for Empirical Research in Economics University of Zurich With.
Multiple comparison correction Methods & models for fMRI data analysis 18 March 2009 Klaas Enno Stephan Laboratory for Social and Neural Systems Research.
MARE 250 Dr. Jason Turner Hypothesis Testing II To ASSUME is to make an… Four assumptions for t-test hypothesis testing: 1. Random Samples 2. Independent.
MARE 250 Dr. Jason Turner Hypothesis Testing II. To ASSUME is to make an… Four assumptions for t-test hypothesis testing:
Differentially expressed genes
Comparison of Parametric and Nonparametric Thresholding Methods for Small Group Analyses Thomas Nichols & Satoru Hayasaka Department of Biostatistics U.
Multiple comparison correction Methods & models for fMRI data analysis 29 October 2008 Klaas Enno Stephan Branco Weiss Laboratory (BWL) Institute for Empirical.
1 Overview of Hierarchical Modeling Thomas Nichols, Ph.D. Assistant Professor Department of Biostatistics Mixed Effects.
Review of Stats Fundamentals
PSY 307 – Statistics for the Behavioral Sciences Chapter 19 – Chi-Square Test for Qualitative Data Chapter 21 – Deciding Which Test to Use.
Today Concepts underlying inferential statistics
FMRI – Week 9 – Analysis I Scott Huettel, Duke University FMRI Data Analysis: I. Basic Analyses and the General Linear Model FMRI Undergraduate Course.
Inferential Statistics
1st Level Analysis Design Matrix, Contrasts & Inference
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
INFERENTIAL STATISTICS – Samples are only estimates of the population – Sample statistics will be slightly off from the true values of its population’s.
General Linear Model & Classical Inference Guillaume Flandin Wellcome Trust Centre for Neuroimaging University College London SPM M/EEGCourse London, May.
2nd Level Analysis Jennifer Marchant & Tessa Dekker
Multiple Comparison Correction in SPMs Will Penny SPM short course, Zurich, Feb 2008 Will Penny SPM short course, Zurich, Feb 2008.
Multiple testing in high- throughput biology Petter Mostad.
Overview Definition Hypothesis
Jeopardy Hypothesis Testing T-test Basics T for Indep. Samples Z-scores Probability $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500 $400.
Inference in practice BPS chapter 16 © 2006 W.H. Freeman and Company.
Random Field Theory Will Penny SPM short course, London, May 2005 Will Penny SPM short course, London, May 2005 David Carmichael MfD 2006 David Carmichael.
Basics of fMRI Inference Douglas N. Greve. Overview Inference False Positives and False Negatives Problem of Multiple Comparisons Bonferroni Correction.
Statistics Primer ORC Staff: Xin Xin (Cindy) Ryan Glaman Brett Kellerstedt 1.
Copyright © 2012 by Nelson Education Limited. Chapter 7 Hypothesis Testing I: The One-Sample Case 7-1.
With a focus on task-based analysis and SPM12
7/16/2014Wednesday Yingying Wang
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
SPM Course Zurich, February 2015 Group Analyses Guillaume Flandin Wellcome Trust Centre for Neuroimaging University College London With many thanks to.
Chapter 10: Analyzing Experimental Data Inferential statistics are used to determine whether the independent variable had an effect on the dependent variance.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Methods for Dummies Random Field Theory Annika Lübbert & Marian Schneider.
Classical Inference on SPMs Justin Chumbley SPM Course Oct 23, 2008.
Experimental Design and Statistics. Scientific Method
Contrasts & Statistical Inference
Hypothesis Testing. Why do we need it? – simply, we are looking for something – a statistical measure - that will allow us to conclude there is truly.
Random Field Theory Will Penny SPM short course, London, May 2005 Will Penny SPM short course, London, May 2005.
Data Analysis.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
1 Identifying Robust Activation in fMRI Thomas Nichols, Ph.D. Assistant Professor Department of Biostatistics University of Michigan
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Statistical Analysis An Introduction to MRI Physics and Analysis Michael Jay Schillaci, PhD Monday, April 7 th, 2007.
FMRI Modelling & Statistical Inference Guillaume Flandin Wellcome Trust Centre for Neuroimaging University College London SPM Course Chicago, Oct.
Multiple comparisons problem and solutions James M. Kilner
Topological Inference Guillaume Flandin Wellcome Trust Centre for Neuroimaging University College London SPM Course London, May 2015 With thanks to Justin.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
HYPOTHESIS TESTING FOR DIFFERENCES BETWEEN MEANS AND BETWEEN PROPORTIONS.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
Group Analyses Guillaume Flandin SPM Course London, October 2016
Topological Inference
Methods for Dummies Random Field Theory
Topological Inference
Contrasts & Statistical Inference
I. Statistical Tests: Why do we use them? What do they involve?
Contrasts & Statistical Inference
Contrasts & Statistical Inference
Presentation transcript:

Statistics – Modelling Your Data Chris Rorden Modelling data: Signal, Error and Covariates Parametric Statistics Thresholding Results: Statistical power and statistical errors The multiple comparison problem Familywise error and Bonferroni Thresholding Permutation Thresholding False Discovery Rate Thresholding Implications: null results uninterruptible

The fMRI signal Last lecture: we predict areas that are involved with a task will become brighter (after a delay) Therefore, we expect that if someone repeatedly does a task for 12 seconds, and rests for 12 seconds, our signal should look like this:

Calculating statistics Does this brain area change brightness when we do the task? Top panel: very good predictor (very little error) Lower panel: somewhat less good predictor

General Linear Model The observed data is composed of a signal that is predicted by our model and unexplained noise (Boynton et al., 1996). Amplitude (solve for) Design Model Measured Data Noise

What is your model? Model is predicted effect. Consider Block desgin experiment: Three conditions, each for 11.2sec Press left index finger when you see ç Press right index finger when you see è Do nothing when you see é Intensity Time

FSL/SPM display of model Analysis programs display model as grid. Each column is regressor e.g. left / right arrows. Each row is a volume of data for within-subject fMRI = time Brightness of row is model’s predicted intensity. Time Intensity Time

Statistical Contrasts fMRI inference based on contrast. Consider study with left arrow and right arrow as regressors [1 0] identifies activation correlated with left arrows: we could expect visual and motor effects. [1 –1] identifies regions that show more response to left arrows than right arrows. Visual effects should be similar, so should select contralateral motoric. Choice of contrasts crucial to inference.

Statistical Contrasts t-Test is one tailed, F-test is two-tailed. T-test: [1 –1] mutually exclusive of [-1 1]: left>right vs right>left. F-test: [1 –1] = [-1 1]: difference between left and right. Choice of test crucial to inference.

How many regressors? We collected data during a block design, where the participant completed 3 tasks Left hand movement Right hand movement Rest We are only interested in the brain areas involed with Left hand movement. Should we include uninteresting right hand movement as a regressor in our statistical model? I.E. Is a [1] analysis the same as a [1 0]? Is a [1 0] analysis identical, better, worse or different from a [1] analysis? =?

Meaningful regressors decrease noise Meaningful regressors can explain some of the variability. Adding a meaningful regressor can reduce the unexplained noise from our contrast.

Correlated regressors decrease signal If a regressor is strongly correlated with our effect, it can reduce the residual signal Our signal is excluded as the regressor explains this variability. Example: responses highly correlated to visual stimuli

Explained Variance Unexplained Variance Single factor… Consider a test to see how well height predicts weight. Weight Height Explained Variance Unexplained Variance t= Small t-score height only weakly predicts weight High t-score height strongly predicts weight

Adding a second factor… How does an additional factor influence our test? E.G. We can add waist diameter as a regressor. Does this regressor influence the t-test regarding how well height predicts weight? Consider ratio of cyan to green. Weight Height Waist Increased t Waist explains portion of weight not predicted by height. Decreased t Waist explains portion of weight predicted by height.

Regressors and statistics Our analysis identifies three classes of variability: Signal: Predicted effect of interest Noise (aka Error): Unexplained variance Covariates: Predicted effects that are not relevant. Statistical significance is the ratio: Covariates will Improve sensitivity if they reduce error (explain otherwise unexplained variance). Reduce sensitivity if they reduce signal (explain variance that is also predicted by our effect of interest). Signal Noise t=

Regressors should be orthogonal Summary Regressors should be orthogonal Each regressor describes independent variance. Variance should not be explained by more than one regressor. E.G. we will see that including temporal derivatives as regressors tend to help event related designs (temporal processing lecture).

Group Analysis We typically want to make inferences about the general population Conduct time course analysis on many people. Identify which patterns are consistent across group.

Parametric Statistics SPM and FSL conduct parametric statistics. T-test, F-test, Correlation These make assumptions about data. We will not check to see if these assumptions are valid.

Parametric Statistics Parameters = Assumptions Parametric Statistics assume that data can be accurately defined by two values: Mean = measure of central tendency Variance = measure of noise Means differ Variabilities Differ

Parametric Statistics Parametric Statistics are popular Simple (complex data described by two numbers: mean and variability) Flexible (can look at how multiple factors interact) Powerful: very sensitive at detecting real effects Robust: usually work even if assumptions violated Tend to fail graciously: by becoming more conservative

Parametric Statistics Assume Bell-Shaped data: Normal Distribution Parametric Statistics Assume Bell-Shaped data: Often, this is wrong. Mean may not be a good measure: Positive Skew: response times, hard exam Negative Skew: easy exam Bimodal: some students got it

Rank-Order Statistics Rank-order statistics make fewer assumptions. Have less power (if data is normal) Require more measurements May fail to detect real results Computationally slow Classic examples: Wilcoxon Mann-Whitney Fligner and Policello’s robust rank order test

Problem with rank order statistics While rank-order statistics are often referred to as non-parametric, most make assumptions: WMW: assume both distributions have same shape. FP: assume both distributions are symmetrical. Both these tests become liberal if their assumptions are not met. They fail catastrophically.

What to do? In general, use parametric tests. In face of violations, you will simply lose power One alternative is to use permutation testing, e.g. SnPM. Permuation testing is only as powerful as the test statistic it uses: SnPM uses the t-test, which is sensitive to changes in mean (so it can be blind to changes in median). Recent alternative is truly non-parametric test of Brunner and Munzel. Can offer slightly better power than t-test if data is skewed. Rorden et al. 2007.

Statistical Thresholding Type I/II Errors Power Multiple Comparison Problem Bonferroni Correction Permutation Thresholding False Discovery Rate ROI Analysis

Statistics E.G. erythropoietin (EPO) doping in athletes In endurance athletes, EPO improves performance ~ 10% Races often won by less than 1% Without testing, athletes forced to dope to be competitive Dangers: Carcinogenic and can cause heart-attacks Therefore: Measure haematocrit level to identify drug users… haematocrit 30% 50% If there was no noise in our measure, it would be easy to identify EPO doping:

The problem of noise Science tests hypotheses based on observations We need statistics because our data is noisy In the real world, haematocrit levels vary This unrelated noise in our measure is called ‘error’ How to we identify dopers? 50% In the real world, hematocrit varies between people 30% hematocrit

Statistical Threshold hematocrit 30% 50% If we set the threshold too low, we will accuse innocent people (high rate of false alarms). hematocrit 30% 50% If we set the threshold too high, we will fail to detect dopers (high rate of misses).

Possible outcomes of drug test Reality (unknown) nonDoper EPO Doper Accuse and expel Innocent accused (false alarm) Type I error Doper expelled (hit) Allow to compete Innocent competes (correct rejection) Doper sneaks through (miss) Type II error Decision

Errors With noisy data, we will make mistakes. Statistics allows us to Estimate our confidence Bias the type of mistake we make (e.g. we can decide whether we will tend to make false alarms or misses) We can be liberal: avoiding misses We can be conservative: avoiding false alarms. We want liberal tests for airport weapons detection (X-ray often leads to innocent cases being opened). Our society wants conservative tests for criminal conviction: avoid sending innocent people to jail.

Liberal vs Conservative Thresholds A low threshold, we will accuse innocent people (high rate of false alarms, Type I). CONSERVATIVE A high threshold, we will fail to detect dopers (high rate of misses, Type II).

Statistical Power is our probability of making a Hit. It reflects our ability to detect real effects. To make new discoveries, we need to optimize power. There are 4 ways to increase power… Type II error Correct rejection Accept Ho Hit Type I error Reject Ho Ho false Ho true Decision Reality

1.) Alpha and Power By making alpha less strict, we can increase power. (e.g. p < 0.05 instead of 0.01) However, we increase the chance of a Type I error!

2.) Effect Size and Power Power will increase if the effect size increases. (e.g. higher dose of drug, 7T MRI instead of 1.5T). Unfortunately, effect sizes are often small and fixed.

3.) Variability and Power Reducing variability increases the relative effect size. Most measures of brain activity noisy.

4.) Sample Size A final way to increase our power is to collect more data. We can sample a person’s brain activity on many similar trials. We can test more people. The disadvantage is time and money. Increasing the sample size is often our only option for increasing statistical power.

In graphs below, same  is used. Reflection Statistically, relative ‘effect size’ and ‘variability’ are equivalent. Our confidence is the ratio of effect size versus variability (signal versus noise). In graphs below, same  is used. =

Alpha level  Statistics allow us to estimate our confidence.  is our statistical threshold: it measures our chance of Type I error. An alpha level of 5% means only 1/20 chance of false alarm (we will only accept p < 0.05). An alpha level of 1% means only 1/100 chance of false alarm (p< 0.01). Therefore, a 1% alpha is more conservative than a 5% alpha.

Multiple Comparison Problem Assume a 1% alpha for drug testing. An innocent athlete only has 1% chance of being accused. Problem: 10,500 athletes in the Olympics. If all innocent, and  = 1%, we will wrongly accuse 105 athletes (0.01*10500)! This is the multiple comparison problem.

Multiple Comparisons The gray matter volume ~900cc (900,000mm3) Typical fMRI voxel is 3x3x3mm (27mm3) Therefore, we will conduct >30,000 tests With 5% alpha, we will make >1500 false alarms!

Multiple Comparison Problem If we conduct 20 tests, with an  = 5%, we will on average make one false alarm (20x0.05). If we make twenty comparisons, it is possible that we may be making 0, 1, 2 or in rare cases even more errors. The chance we will make at least one error is given by the formula: 1- (1- )C: if we make twenty comparisons at p < .05, we have a 1-(.95) 20 = 64% chance that we are reporting at least one erroneous finding. This is our familywise error (FWE) rate.

Bonferroni Correction Bonferroni Correction: controls FWE. For example: if we conduct 10 tests, and want a 5% chance of any errors, we will adjust our threshold to be p < 0.005 (0.05/10). Benefits: Controls for FWE. Problem: Very conservative = very little chance of detecting real effects = low power.

Random Field Theory 5mm We spatially smooth our data – peaks due to noise should be attenuated by neighbors. Worsley et al, HBM 4:58-73, 1995. RFT uses resolution elements (resels) instead of voxels. If we smooth our data with 8mm FWHM, then resel size is 8mm. SPM uses RFT for FWE correction: only requires statistical map, smoothness and cluster size threshold. Euler characteristic: unsmoothed noise will have high peaks but few clusters, smoothed data will be have lower peaks but show clustering. RFT has many unchecked assumptions (Nichols) Works best for heavily smoothed data (x3 voxel size) 10mm 15mm Image from Nichols

Permutation Thresholding Group 1 Group 2 Prediction: Label ‘Group 1’ and ‘Group 2’ mean something. Null Hypothesis (Ho): Labels are meaningless. If Ho true, we should get similar t-scores if we randomly scramble order.

Permutation Thresholding Group 1 Group 2 Observed, max T = 4.1 Permutation 1, max T = 3.2 Permutation 2, max T = 2.9 Permutation 3, max T = 3.3 Permutation 4, max T = 2.8 Permutation 5, max T = 3.5 … Permutation 1000, max T = 3.1 … …

Permutation Thresholding Compute maximum T-score for 1000 permutations. Find 5th Percentile max T. Any voxel in our observed dataset that exceeds this threshold has only 5% probability of being noise. 5 T= 3.9 Max T 0 100 Percentile 5%

Permutation Thresholding Permutation Thresholding offers the same protection against false alarms as Bonferroni. Typically, much more powerful than Bonferroni. Implementations include SnPM, FSL’s randomise, and my own NPM. Disadvantage: computing 1000 permutations means it takes x1000 times longer than typical analysis! Simulation data from Nichols et al.: Permutation always optimal. Bonferroni typically conservative. Random Fields only accurate with high DF and heavily smoothed.

False Discovery Rate Traditional statistics attempts to control the False Alarm rate. ‘False Discovery Rate’ controls the ratio of false alarms to hits. It often provides much more power than Bonferroni correction.

Assume Olympics where no athletes took EPO: FDR Assume Olympics where no athletes took EPO: Assume Olympics where some cheat: When we conduct many tests, we can estimate the amount of real signal

Bonferroni FWE applies same threshold to each data set FDR vs FWE Bonferroni FWE applies same threshold to each data set FDR is dynamic: threshold based on signal detected. 5% Bonferroni: only a 5% chance an innocent athlete will be accused. 5% FDR: only 5% of expelled athletes are innocent.

Controlling for multiple comparisons Bonferroni correction We will often fail to find real results. RFT correction Typically less conservative than Bonferroni. Requires large DF and broad smoothing. Permutation Thresholding Offers same inference as Bonferroni correction. Typically much less conservative than Bonferroni. Computationally very slow FDR correction At FDR of .05, about 5% of ‘activated’ voxels will be false alarms. If signal is only tiny proportion of data, FDR will be similar to Bonferroni.

Alternatives to voxelwise analysis Conventional fMRI statistics compute one statistical comparison per voxel. Advantage: can discover effects anywhere in brain. Disadvantage: low statistical power due to multiple comparisons. Small Volume Comparison: Only test a small proportion of voxels. (Still have to adjust for RFT). Region of Interest: Pool data across anatomical region for single statistical test. Example: how many comparisons on this slice? SPM: 1600 SVC: 57 ROI: 1 SPM SVC ROI

ROI analysis In voxelwise analysis, we conduct an indepent test for every voxel Each voxel is noisy Huge number of tests, so severe penalty for multiple comparisons Alternative: pool data from region of interest. Averaging across meaningful region should reduce noise. One test per region, so FWE adjustment less severe. Region must be selected independently of statistical contrast! Anatomically predefined Defined based on previous localizer session Selected based on combination of conditions you will contrast. M1: movement S1: sensation

Inference from fMRI statistics fMRI studies have very low power. Correction for multiple comparisons Poor signal to noise Variability in functional anatomy between people. Null results impossible to interpret. (Hard to say an area is not involved with task).

Between and Within Subject Variance Consider experiment to see if music influences typing speed. Possible effect will be small. Large variability between people: some people much better typist than others. Solution: repeated measure design to separate between and within subject variability. 70 60 Alice Bob 50 Donna 40 Nick Typing speed: words per minute Sam 30 20 10 Bach Rock Silent

Multiple Subject Analysis: Mixed Model Model all of the data at once Between and within subject variation is accounted for Can’t apply mixed model directly to fMRI data because there is so much data! Z stats Group Sub 1 Sub 2 Sub 3 Sub 4

Multiple Subject Analysis: SPM2 First estimate each subject’s contrast effect sizes (copes) Run a t-test on the copes Holmes and Friston assume within subject variation is same for all subjects, this allows them to ignore it at the group level Not equivalent to a mixed model Results Group: T-test copes copes copes copes Sub 1 Sub 2 Sub 3 Sub 4

Multiple Subject Analysis: FSL First estimate each subject’s copes and cope variability (varcopes) Then enter the copes and varcopes into group model varcopes supply within subject variation Between subject variation and group level means are then estimated Equivalent to mixed model Much slower than SPM Z stats Group copes varcopes copes varcopes copes varcopes copes varcopes Sub 1 Sub 2 Sub 3 Sub 4