Statistical Modelling Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Slides:



Advertisements
Similar presentations
Using R Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Advertisements

Topic 12: Multiple Linear Regression
The Multiple Regression Model.
Correlation and regression
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Proportion Data Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Count Data Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Log-linear and logistic models
Statistics 350 Lecture 10. Today Last Day: Start Chapter 3 Today: Section 3.8 Homework #3: Chapter 2 Problems (page 89-99): 13, 16,55, 56 Due: February.
Multiple Linear Regression
Contrasts Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Lorelei Howard and Nick Wright MfD 2008
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Slide 1 Testing Multivariate Assumptions The multivariate statistical techniques which we will cover in this class require one or more the following assumptions.
Chapter 12 Section 1 Inference for Linear Regression.
So are how the computer determines the size of the intercept and the slope respectively in an OLS regression The OLS equations give a nice, clear intuitive.
Multiple Linear Regression A method for analyzing the effects of several predictor variables concurrently. - Simultaneously - Stepwise Minimizing the squared.
Correlation and Regression
Regression Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Objectives of Multiple Regression
Regression Analysis Regression analysis is a statistical technique that is very useful for exploring the relationships between two or more variables (one.
Chapter 13: Inference in Regression
Comparing Two Samples Harry R. Erwin, PhD
Lecture 6 Generalized Linear Models Olivier MISSA, Advanced Research Skills.
Analysis of Covariance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
Chapter 3: Generalized Linear Models 3.1 The Generalization 3.2 Logistic Regression Revisited 3.3 Poisson Regression 1.
Central Tendency Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
+ Chapter 12: Inference for Regression Inference for Linear Regression.
Experimental Design and Modelling Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Variance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Chapter 14 Inference for Regression AP Statistics 14.1 – Inference about the Model 14.2 – Predictions and Conditions.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Review of Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
+ Chapter 12: More About Regression Section 12.1 Inference for Linear Regression.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
Analysis of Residuals ©2005 Dr. B. C. Paul. Examining Residuals of Regression (From our Previous Example) Set up your linear regression in the Usual manner.
Binary Response Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Multiple Regression Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Statistical Inference Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
Statistics 2: generalized linear models. General linear model: Y ~ a + b 1 * x 1 + … + b n * x n + ε There are many cases when general linear models are.
KNN Ch. 3 Diagnostics and Remedial Measures Applied Regression Analysis BUSI 6220.
© Department of Statistics 2012 STATS 330 Lecture 24: Slide 1 Stats 330: Lecture 24.
Remembering way back: Generalized Linear Models Ordinary linear regression What if we want to model a response that is not Gaussian?? We may have experiments.
4 basic analytical tasks in statistics: 1)Comparing scores across groups  look for differences in means 2)Cross-tabulating categoric variables  look.
Dependent Variable Discrete  2 values – binomial  3 or more discrete values – multinomial  Skewed – e.g. Poisson Continuous  Non-normal.
Regression Chapter 5 January 24 – Part II.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
Stats Methods at IC Lecture 3: Regression.
CHAPTER 12 More About Regression
The simple linear regression model and parameter estimation
BINARY LOGISTIC REGRESSION
Logistic Regression APKC – STATS AFAC (2016).
Inference for Regression (Chapter 14) A.P. Stats Review Topic #3
CHAPTER 7 Linear Correlation & Regression Methods
Statistical Quality Control, 7th Edition by Douglas C. Montgomery.
CHAPTER 12 More About Regression
Analysis of Variance Harry R. Erwin, PhD
CHAPTER 29: Multiple Regression*
CHAPTER 26: Inference for Regression
Prepared by Lee Revere and John Large
When You See (This), You Think (That)
M248: Analyzing data Block D UNIT D2 Regression.
CHAPTER 12 More About Regression
Chapter 14 Inference for Regression
CHAPTER 12 More About Regression
Presentation transcript:

Statistical Modelling Harry R. Erwin, PhD School of Computing and Technology University of Sunderland

Resources Crawley, MJ (2005) Statistics: An Introduction Using R. Wiley. Gonick, L., and Woollcott Smith (1993) A Cartoon Guide to Statistics. HarperResource (for fun).

Statistical Modelling Suppose you have a response variable, y, that varies when independent factors or measurements, x 1, x 2, …, x N, vary. (One covariate can be written x.) What you want is a model that predicts the value of y as a function of the x i. Statistical models are written: –g( h(y) ~ f(x i ) ) –Where g() describes the model—lm() or aov() is usual. –Where h() describes how to transform the response variable. –Where f(x i ) describes what covariates you want in (and out) of the model. To add a covariate, you use +, to remove it, -.

Fitting Models to Data is what R is designed to do There are five kinds of models –The saturated model—one parameter per data point—you’re drawing the response line through all the data points. This tells you nothing. –The maximal model—containing all factors, interactions, and covariates of interest that can be fit given the available data. (You need at least three data points for fitting each covariate in your model.) –The current model, usually smaller than the maximal model. –The minimum adequate model—smaller than the maximal model, but not significantly smaller. –The null model—one parameter, the overall mean response, y mean. (y~1)

Definitions Covariate—an explanatory variable that is possibly predictive of the outcome under study. Factor—a covariate that takes a finite number of values, in no specific order. A boolean value is a factor. Continuous explanatory variable—a numerical covariate. It may be restricted to integer values, to represent an ordering. Interaction—a covariate that involves more than one explanatory variable. Power—a covariate that involves a polynomial of degree 2 or greater in one or more continuous explanatory variables.

General Process of Fitting 1.Fit the maximal model. Note down the residual deviance. Possibly check for overdispersion (advanced topic). 2.Begin model simplification. Use update -, to remove the least significant terms first. Start with the highest-order interactions and powers. 3.If the resulting increase in deviance is not significant, use the reduced model and continue 4.If the increase in deviance is significant, go back to the unreduced model and look further. 5.Repeat steps 3 and 4 until only significant terms remain.

Parsimony Suggests Less parameters Less explanatory variables A linear model A model without a hump (a power > 1) A model without interactions A model with easily measured variables A model that reflects how the process operates.

Actions you can take Remove non-significant interactions, higher order terms, explanatory variables Group together factor levels (advanced topic) In ANCOVA (mixed models with continuous variables and factors), set slopes to zero if possible Rescale (advanced) if necessary to give –constancy of variance –approximately normal errors –additivity Don't go to extremes, though. “If you torture the data, it will confess.”

Model Formulae response ~ explanatory variables –The right hand side describes the variables, their interactions, and their non-linearity—it isn’t arithmetic! –To include something in the model: + something –To remove something from the model: - something –Interactions are written * (or A:B) y ~ A + B + A:B is the same as y~A*B) –Nesting: / (A/B is A+A:B or A+B%in%A) y ~ A/B –Conditioning is written | y ~ x | z

Expanded Forms A*B*C is all the interactions up to A:B:C A/B/C is A+B%in%A+C%in%B%in%A (A+B+C)^3 is A*B*C (A+B+C)^2 is A*B*C - A:B:C poly(x,n) is a polynomial regression of degree n I(formula) means the formula ‘as written’ in R. 1 labels the intercept Error(A/B/C) can be specified

Use of update “model <- lm(y~A*B)” “model2 <- update(model, ~.-A:B)” to get rid of the interaction. –model2 is now lm(y~A+B)

Transforms (Advanced) Sometimes you need to directly transform the left hand side to get constant variance. This looks like: –lm(log(y)~ I(1/x) + sqrt(z)) If the left hand side has constant variance already, you can use a ‘link’, defined by family=something in the model formula, and a general linear model (glm) Families available: –Normal (bell-shaped curve, default) –poisson (count data) –binomial (proportions or binary data) –Gamma (special)

Factor/Parameter Transforms Used to make the shape of variables closer to ‘normal’, and with constant variance. More advanced topic, will be mentioned in the summary lecture. If it turns out you need to do this, I’ll provide consulting support (X3227).

Types of Models lm(y~x)—linear model (x is a continuous variable) aov(y~x)—analysis of variance (x is a factor) aov(y~x+z)—analysis of covariance if x and z include a factor and a continuous explanatory variable. glm(y~x)—general linear model (like lm but advanced) –Non-constant variance –Non-normal errors –Options include family=poisson, binomial, Gamma, Normal gam(y~x)—generalised additive model (complex, advanced) also lme(y~x), nls(y~x), nlme(y~x), loess(y~x), tree(y~x)

Modelling Demonstration A demonstration will be given in the analysis of covariance slides.

Model Checking (demoed later) Plot the residuals against –fitted values –explanatory values –sequence of data collection –standard normal deviates Demo of mcheck

mcheck mcheck <- function (obj,... ) { rs<-obj$resid fv<-obj$fitted par(mfrow=c(1,2)) plot(fv,rs,xlab="Fitted values",ylab="Residuals") abline(h=0, lty=2) qqnorm(rs,xlab="Normal scores",ylab="Ordered residuals") qqline(rs,lty=2) par(mfrow=c(1,1)) invisible(NULL) }

Observations Order of factor deletion will matter –Delete the high order interactions (A:B), and high- order terms (I(x^2)) first. –Then delete the remaining terms in decreasing order of importance. The test you use is anova(), because you’re comparing two models. Be pragmatic.