Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistics 1: tests and linear models. How to get started? Exploring data graphically: Scatterplot HistogramBoxplot.

Similar presentations


Presentation on theme: "Statistics 1: tests and linear models. How to get started? Exploring data graphically: Scatterplot HistogramBoxplot."— Presentation transcript:

1 Statistics 1: tests and linear models

2 How to get started? Exploring data graphically: Scatterplot HistogramBoxplot

3 Important things to check Are all the variables in correct format? Do there seem to be outliers? –Mistake in data coding? Initial structure of the analyses What is the response variable? What are the explanatory variables? Explore patterns visually –Correlations? –Differences between groups?

4 Summary statistics summary(data), summary(x) mean(x), median(x) range(x) var(x), sd(x) min(x), max(x) quantile(x,p) tapply(), table()

5 Tests Test for normality –Shapiro’s test: shapiro.test() –QQ plot: qqnorm(), qqline() Homogeneity of variance –var.test (for two groups) –bartlett.test (for several groups)

6 Tests for differences in means Student’s t-test: t.test() –One or two sample test Testing if sample mean differs e.g. from 0 Testing if sample means of two groups differ –Paired/non paired Are pairs of measurements associated? –Variance homogeneous/non homogeneous –Assumes normally distributed data Wilcoxon’s test: wilcox.test() –Normality not required –paired/non paired DEMO 1

7 Correlation cor(x,y) calculated correlation coefficient between two numeric variables –close to 0: no correlation –close to 1: strong correlation Is the correlation significant –cor.test(y,x) –Note: check also graphically!!!

8 Confidence intervals and standard errors Typical ways of describing uncertainty in a parameter value (e.g. mean) –Standard error (SE of mean is sqrt(var(xx)/n) –Confidence interval (95%) The range within which the value is with the probability of 95% Normal approximation: 1.96*SE, so that 95% CI for mean(xx) [mean(xx) - 1.96*SE(xx), mean(xx) + 1.96*SE(xx)] If data not normally distributed bootstrapping can be helpful –Let’s assume we have measured age at death for 100 rats 95% CI for mean age at death can be derived by »1. take a sample of 100 rats with replacement from the original data »2. calculate mean »3. repeat 1 & 2 e.g. 1000 times and always record the mean »4. Now 2.5 and 97.5% quantiles of the means give the 95% CI for mean EXERCISE TOMORROW!

9 Linear model and regression Models the response variable through additive effects of explanatory variables –E.g. how does stopping distance of a car depend on speed? –Or how does weight of an animal depend on it’s length?

10 The formula Y = a + b 1 x 1 + … + b n x n + ε Response variable Intercept Explanatory variables Normally distributed error term, i.e. ‘random noise’ Regression, ANOVA or ANCOVA?

11 How to interprete… Intercept: –Baseline value for Y –The value that Y is expected to get if all the predictors are 0 –If one/some of the predictors are factors, then this is the value predicted for the reference levels of the factors Coefficients b n –If x n is numeric variable, then increment of x n with one unit increases the value of Y with b n –If x n is a factor, then parameter b n gets different value for each factor level, so that Y increases with the value b n corresponding to the level of x n Note, reference level of x is included to the intercept

12 Fitting the model in R lm(y~x,data=“ name of your dataset ”) Formula: y~x intercept + the effect of x y~x-1 no intercept y~x+z multiple regression with main effects y~x*z multiple regression with main effects and interactions Exploring the model: summary(), anova(), plot( “model” )

13 plot() command in lm Produced four figures 1.Residuals against fitted values 2.QQ plot for residuals 3.Standardized residuals 4.‘Influence’ plotted against residuals: identifies outliers Residuals should be normally distributed and not show any systematic trends. If not OK, then: -> transformation of response: sqrt(), ln(),… -> transformations of explanatory variables -> should generalized linear model be used?

14 How to predict? Y = a + b 1 x 1 + … + b n x n Expected value of Y Values of predictors Estimated model parameters In R, predict() function.

15 Briefly about model selection The aim: simplest adequate model –Few parameters preferred over many –Main effects preferred over interactions –Untransformed variables preferred over transformed –Model should still not be oversimplified Simplifying a model –Are effects of explanatory variables significant? –Does deletion of a term increase residual variation significantly? Model selection tools: –anova() Tests difference in residual variation between alternative models –step() Stepwise model selection based on AIC values DEMO 2


Download ppt "Statistics 1: tests and linear models. How to get started? Exploring data graphically: Scatterplot HistogramBoxplot."

Similar presentations


Ads by Google