T tests, ANOVAs and regression

Slides:



Advertisements
Similar presentations
T-tests, ANOVA and regression Methods for Dummies February 1 st 2006 Jon Roiser and Predrag Petrovic.
Advertisements

Carles Falcon & Suz Prejawa
Andrea Banino & Punit Shah. Samples vs Populations Descriptive vs Inferential William Sealy Gosset (Student) Distributions, probabilities and P-values.
General Linear Model Beatriz Calvo Davina Bristow.
Simple Linear Regression Analysis
Multiple Regression and Model Building
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Kin 304 Regression Linear Regression Least Sum of Squares
Correlation and regression
Jennifer Siegel. Statistical background Z-Test T-Test Anovas.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Objectives (BPS chapter 24)
SIMPLE LINEAR REGRESSION
Chapter Topics Types of Regression Models
Topics: Regression Simple Linear Regression: one dependent variable and one independent variable Multiple Regression: one dependent variable and two or.
REGRESSION AND CORRELATION
SIMPLE LINEAR REGRESSION
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Linear Regression and Linear Prediction Predicting the score on one variable.
Correlation 1. Correlation - degree to which variables are associated or covary. (Changes in the value of one tends to be associated with changes in the.
Lorelei Howard and Nick Wright MfD 2008
Introduction to Regression Analysis, Chapter 13,
Simple Linear Regression Analysis
Relationships Among Variables
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
Lecture 5 Correlation and Regression
Correlation and Regression
Lecture 16 Correlation and Coefficient of Correlation
SIMPLE LINEAR REGRESSION
Section #6 November 13 th 2009 Regression. First, Review Scatter Plots A scatter plot (x, y) x y A scatter plot is a graph of the ordered pairs (x, y)
Correlation and Regression
BPS - 3rd Ed. Chapter 211 Inference for Regression.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
Introduction to Linear Regression
Correlation and Regression Used when we are interested in the relationship between two variables. NOT the differences between means or medians of different.
Warsaw Summer School 2015, OSU Study Abroad Program Regression.
Chapter 14 Inference for Regression AP Statistics 14.1 – Inference about the Model 14.2 – Predictions and Conditions.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
MARKETING RESEARCH CHAPTER 18 :Correlation and Regression.
Chapter 14 Correlation and Regression
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Simple Linear Regression Analysis Chapter 13.
Regression Analysis Deterministic model No chance of an error in calculating y for a given x Probabilistic model chance of an error First order linear.
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 7: Regression.
BPS - 5th Ed. Chapter 231 Inference for Regression.
Stats Methods at IC Lecture 3: Regression.
Chapter 13 Simple Linear Regression
The simple linear regression model and parameter estimation
Regression Analysis AGEC 784.
Statistics for Managers using Microsoft Excel 3rd Edition
Chapter 11 Simple Regression
12 Inferential Analysis.
Correlation and Simple Linear Regression
CHAPTER 29: Multiple Regression*
6-1 Introduction To Empirical Models
Correlation and Regression
Correlation and Simple Linear Regression
Simple Linear Regression
12 Inferential Analysis.
Basic Practice of Statistics - 3rd Edition Inference for Regression
Simple Linear Regression
Product moment correlation
Inferential Statistics
Chapter 14 Inference for Regression
Introduction to Regression
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

T tests, ANOVAs and regression Tom Jenkins Ellen Meierotto SPM Methods for Dummies 2007

Why do we need t tests?

Objectives Types of error Probability distribution Z scores T tests ANOVAs

Error Null hypothesis Type 1 error (α): false positive Type 2 error (β): false negative

Normal distribution

Z scores Standardised normal distribution µ = 0, σ = 1 Need to know population standard deviation Z=(x-μ)/σ for one point compared to pop.

T tests Comparing means 1 sample t 2 sample t Paired t

Different sample variances

2 sample t tests Pooled standard error of the mean

1 sample t test

The effect of degrees of freedom on t distribution

Paired t tests

T tests in SPM: Did the observed signal change occur by chance or is it stat. significant? Recall GLM. Y= X β + ε β1 is an estimate of signal change over time attributable to the condition of interest Set up contrast (cT) 1 0 for β1: 1xβ1+0xβ2+0xβn/s.d Null hypothesis: cTβ=0 No significant effect at each voxel for condition β1 Contrast 1 -1 : Is the difference between 2 conditions significantly non-zero? t = cTβ/sd[cTβ] – 1 sided

ANOVA Variances not means Total variance= model variance + error variance Results in F score- corresponding to a p value Variance F test = Model variance /Error variance

Partitioning the variance Group 1 Group 2 Group 1 Group 2 Group 1 Group 2 Total = Model + (Between groups) Error (Within groups)

T vs F tests F tests- any differences between multiple groups, interactions Have to determine where differences are post-hoc SPM- T- one tailed (con) SPM- F- two tailed (ess)

Conclusions T tests describe how unlikely it is that experimental differences are due to chance Higher the t score, smaller the p value, more unlikely to be due to chance Can compare sample with population or 2 samples, paired or unpaired ANOVA/F tests are similar but use variances instead of means and can be applied to more than 2 groups and other more complex scenarios

Acknowledgements MfD slides 2004-2006 Van Belle, Biostatistics Human Brain Function Wikipedia

Correlation and Regression

Topics Covered: Is there a relationship between x and y? What is the strength of this relationship Pearson’s r Can we describe this relationship and use it to predict y from x? Regression Is the relationship we have described statistically significant? F- and t-tests Relevance to SPM GLM We often want to know whether various variables are ‘linked’, i.e., correlated. This can be interesting in itself, but is also important if we want to predict one variable’s value given a value of the other.

Relationship between x and y Correlation describes the strength and direction of a linear relationship between two variables Regression tells you how well a certain independent variable predicts a dependent variable CORRELATION  CAUSATION In order to infer causality: manipulate independent variable and observe effect on dependent variable means that correlation cannot be validly used to infer a causal relationship between the variables not be taken to mean that correlations cannot indicate causal relations. However, the causes underlying the correlation, if any, may be indirect and unknown A correlation between age and height in children is fairly causally transparent, but a correlation between mood and health in people is less so. Does improved mood lead to improved health? Or does good health lead to good mood? Or does some other factor underlie both? Or is it pure coincidence? In other words, a correlation can be taken as evidence for a possible causal relationship, but cannot indicate what the causal relationship, if any, might be.

Scattergrams Positive correlation Negative correlation No correlation Y Y Y Y Y X Y X X Positive correlation Negative correlation No correlation

Variance vs. Covariance Do two variables change together? Variance ~ DX * DX Variance is just a definition. The reason we’re squaring it is to have it get a positive value whether dx is negative or positive, so that we can sum them and positives and negatives will not cancel out. Variance is spread around a mean, covariance is the measure of how much x and y change together; very similar: multiply 2 variables rather than square 1 Covariance ~ DX * DY

Covariance When X and Y : cov (x,y) = pos. When X and Y : cov (x,y) = neg. When no constant relationship: cov (x,y) = 0

Example Covariance What does this number tell us? x ( )( ) 3 2 1 4 6 9 y i - ( )( ) 3 2 1 4 6 9 = å 7 What does this number tell us?

Example of how covariance value relies on variance   High variance data  Low variance data Subject x y x error * y error X error * y error 1 101 100 2500 54 53 9 2 81 80 900 52 4 3 61 60 51 50 5 41 40 49 6 21 20 48 7 47 Mean Sum of x error * y error : 7000 28 Covariance: 1166.67 4.67 Problem with Covariance: The value obtained by covariance is dependent on the size of the data’s standard deviations: if large, the value will be greater than if small… even if the relationship between x and y is exactly the same in the large versus small standard deviation datasets.

Pearson’s R Covariance does not really tell us anything Solution: standardise this measure Pearson’s R: standardise by adding std to equation: Can only compare covariances between different variables to see which is greater.

Basic assumptions Normal distributions Variances are constant and not zero Independent sampling – no autocorrelations No errors in the values of the independent variable All causation in the model is one-way (not necessary mathematically, but essential for prediction)

Pearson’s R: degree of linear dependence The distance of r from 0 indicates strength of correlation r = 1 or r = (-1) means that we can predict y from x and vice versa with certainty; all data points are on a straight line. i.e., y = ax + b The correlation is 1 in the case of an increasing linear relationship, −1 in the case of a decreasing linear relationship, and some value in between in all other cases, indicating the degree of linear dependence between the variables. The closer the coefficient is to either −1 or 1, the stronger the correlation between the variables. the correlation coefficient detects only linear dependencies between two variables

Limitations of r r is actually r = true r of whole population = estimate of r based on data r is very sensitive to extreme values: X av = 3, y av = 2 If extreme value such as y = 5 is possible (graph) X – 1,2,3,4,5, Y – 1,2,3,4,0

In the real world… r is never 1 or –1 interpretations for correlations in psychological research (Cohen) Correlation Negative Positive Small -0.29 to -0.10 00.10 to 0.29 Medium -0.49 to -0.30 0.30 to 0.49 Large -1.00 to -0.50 0.50 to 1.00 the interpretation of a correlation coefficient depends on the context and purposes. A correlation of 0.9 may be very low if one is verifying a physical law using high-quality instruments, but may be regarded as very high in the social sciences where there may be a greater contribution from complicating factors.

Regression Correlation tells you if there is an association between x and y but it doesn’t describe the relationship or allow you to predict one variable from the other. To do this we need REGRESSION!

Best-fit Line ŷ = ax + b ε = ŷ, predicted value = y i , true value Aim of linear regression is to fit a straight line, ŷ = ax + b, to data that gives best prediction of y for any value of x This will be the line that minimises distance between data and fitted line, i.e. the residuals ε ŷ = ax + b ε = residual error = y i , true value slope intercept So to understand the relationship between two variables, we want to draw the ‘best’ line through the cloud – find the best fit. This is done using the principle of least squares = ŷ, predicted value

Least Squares Regression To find the best line we must minimise the sum of the squares of the residuals (the vertical distances from the data points to our line) Model line: ŷ = ax + b a = slope, b = intercept Residual (ε) = y - ŷ Sum of squares of residuals = Σ (y – ŷ)2 we must find values of a and b that minimise Σ (y – ŷ)2

Finding b First we find the value of b that gives the min sum of squares b b ε ε b Trying different values of b is equivalent to shifting the line up and down the scatter plot

Finding a Now we find the value of a that gives the min sum of squares b b b Trying out different values of a is equivalent to changing the slope of the line, while b stays constant

Minimising sums of squares Values of a and b sums of squares (S) Gradient = 0 min S Need to minimise Σ(y–ŷ)2 ŷ = ax + b so need to minimise: Σ(y - ax - b)2 If we plot the sums of squares for all different values of a and b we get a parabola, because it is a squared term So the min sum of squares is at the bottom of the curve, where the gradient is zero.

The maths bit So we can find a and b that give min sum of squares by taking partial derivatives of Σ(y - ax - b)2 with respect to a and b separately Then we solve these for 0 to give us the values of a and b that give the min sum of squares

The solution Doing this gives the following equations for a and b: a = r sy sx r = correlation coefficient of x and y sy = standard deviation of y sx = standard deviation of x You can see that: A low correlation coefficient gives a flatter slope (small value of a) Large spread of y, i.e. high standard deviation, results in a steeper slope (high value of a) Large spread of x, i.e. high standard deviation, results in a flatter slope (high value of a)

The solution cont. r sy x b = y - sx Our model equation is ŷ = ax + b This line must pass through the mean so: y = ax + b b = y – ax b = y – ax We can put our equation into this giving: b = y - r sy sx r = correlation coefficient of x and y sy = standard deviation of y sx = standard deviation of x x The smaller the correlation, the closer the intercept is to the mean of y

Back to the model We can calculate the regression line for any data, but the important question is: How well does this line fit the data, or how good is it at predicting y from x?

How good is our model? Variance of predicted y values (ŷ): sy2 = ∑(y – y)2 n - 1 SSy dfy = Total variance of y: Variance of predicted y values (ŷ): This is the variance explained by our regression model sŷ2 = ∑(ŷ – y)2 n - 1 SSpred dfŷ = Error variance: This is the variance of the error between our predicted y values and the actual y values, and thus is the variance in y that is NOT explained by the regression model serror2 = ∑(y – ŷ)2 n - 2 SSer dfer =

How good is our model cont. Total variance = predicted variance + error variance sy2 = sŷ2 + ser2 Conveniently, via some complicated rearranging sŷ2 = r2 sy2 r2 = sŷ2 / sy2 so r2 is the proportion of the variance in y that is explained by our regression model

How good is our model cont. Insert r2 sy2 into sy2 = sŷ2 + ser2 and rearrange to get: ser2 = sy2 – r2sy2 = sy2 (1 – r2) From this we can see that the greater the correlation the smaller the error variance, so the better our prediction

Is the model significant? i.e. do we get a significantly better prediction of y from our regression equation than by just predicting the mean? F-statistic: complicated rearranging F (dfŷ,dfer) = sŷ2 ser2 r2 (n - 2)2 1 – r2 =......= And it follows that: r (n - 2) So all we need to know are r and n ! t(n-2) = (because F = t2) √1 – r2

General Linear Model Linear regression is actually a form of the General Linear Model where the parameters are a, the slope of the line, and b, the intercept. y = ax + b +ε A General Linear Model is just any model that describes the data in terms of a straight line

Multiple regression Multiple regression is used to determine the effect of a number of independent variables, x1, x2, x3 etc., on a single dependent variable, y The different x variables are combined in a linear way and each has its own regression coefficient: y = a1x1+ a2x2 +…..+ anxn + b + ε The a parameters reflect the independent contribution of each independent variable, x, to the value of the dependent variable, y. i.e. the amount of variance in y that is accounted for by each x variable after all the other x variables have been accounted for

SPM Linear regression is a GLM that models the effect of one independent variable, x, on ONE dependent variable, y Multiple Regression models the effect of several independent variables, x1, x2 etc, on ONE dependent variable, y Both are types of General Linear Model GLM can also allow you to analyse the effects of several independent x variables on several dependent variables, y1, y2, y3 etc, in a linear combination This is what SPM does and will be explained soon…