Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia

Slides:



Advertisements
Similar presentations
Linear regression models in R (session 1) Tom Price 3 March 2009.
Advertisements

Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia
Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.
Review of Univariate Linear Regression BMTRY 726 3/4/14.
Class 16: Thursday, Nov. 4 Note: I will you some info on the final project this weekend and will discuss in class on Tuesday.
Regression with ARMA Errors. Example: Seat-belt legislation Story: In February 1983 seat-belt legislation was introduced in UK in the hope of reducing.
Multiple Regression Predicting a response with multiple explanatory variables.
x y z The data as seen in R [1,] population city manager compensation [2,] [3,] [4,]
Correlation and Regression. Spearman's rank correlation An alternative to correlation that does not make so many assumptions Still measures the strength.
SPH 247 Statistical Analysis of Laboratory Data 1April 23, 2010SPH 247 Statistical Analysis of Laboratory Data.
Regression Hal Varian 10 April What is regression? History Curve fitting v statistics Correlation and causation Statistical models Gauss-Markov.
Final Review Session.
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Stat 112: Lecture 13 Notes Finish Chapter 5: –Review Predictions in Log-Log Transformation. –Polynomials and Transformations in Multiple Regression Start.
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
Crime? FBI records violent crime, z x y z [1,] [2,] [3,] [4,] [5,]
Multiple Regression Analysis. General Linear Models  This framework includes:  Linear Regression  Analysis of Variance (ANOVA)  Analysis of Covariance.
Regression Transformations for Normality and to Simplify Relationships U.S. Coal Mine Production – 2011 Source:
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
How to plot x-y data and put statistics analysis on GLEON Fellowship Workshop January 14-18, 2013 Sunapee, NH Ari Santoso.
Regression Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
BIOL 582 Lecture Set 19 Matrices, Matrix calculations, Linear models using linear algebra.
PCA Example Air pollution in 41 cities in the USA.
9/14/ Lecture 61 STATS 330: Lecture 6. 9/14/ Lecture 62 Inference for the Regression model Aim of today’s lecture: To discuss how we assess.
Lecture 6 Generalized Linear Models Olivier MISSA, Advanced Research Skills.
Analysis of Covariance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression BMTRY 701 Biostatistical Methods II.
 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
Choosing and using statistics to test ecological hypotheses
Lecture 4: Inference in SLR (continued) Diagnostic approaches in SLR BMTRY 701 Biostatistical Methods II.
23-1 Analysis of Covariance (Chapter 16) A procedure for comparing treatment means that incorporates information on a quantitative explanatory variable,
6 Mar 2007EMBnet Course – Introduction to Statistics for Biologists Linear Models I Correlation and Regression.
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
Regression For the purposes of this class: –Does Y depend on X? –Does a change in X cause a change in Y? –Can Y be predicted from X? Y= mX + b Predicted.
Ch4 Describing Relationships Between Variables. Section 4.1: Fitting a Line by Least Squares Often we want to fit a straight line to data. For example.
Regression and Analysis Variance Linear Models in R.
Statistics 1: tests and linear models. How to get started? Exploring data graphically: Scatterplot HistogramBoxplot.
Lecture 9: ANOVA tables F-tests BMTRY 701 Biostatistical Methods II.
Scatterplot and trendline. Scatterplot Scatterplot explores the relationship between two quantitative variables. Example:
Regression Model Building LPGA Golf Performance
Using R for Marketing Research Dan Toomey 2/23/2015
Exercise 1 The standard deviation of measurements at low level for a method for detecting benzene in blood is 52 ng/L. What is the Critical Level if we.
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Multiple regression.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Lecture 3 Linear Models II Olivier MISSA, Advanced Research Skills.
Linear Models Alan Lee Sample presentation for STATS 760.
© Department of Statistics 2012 STATS 330 Lecture 24: Slide 1 Stats 330: Lecture 24.
1 Regression Review Population Vs. Sample Regression Line Residual and Standard Error of Regression Interpretation of intercept & slope T-test, F-test.
STATS 10x Revision CONTENT COVERED: CHAPTERS
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
R by Example CNR SBCS workshop Gary Casterline. Experience and Applications Name and Dept/Lab Experience with Statistics / Software Your research and.
1 Analysis of Variance (ANOVA) EPP 245/298 Statistical Analysis of Laboratory Data.
Before the class starts: Login to a computer Read the Data analysis assignment 1 on MyCourses If you use Stata: Start Stata Start a new do file Open the.
Transforming the data Modified from:
Chapter 12 Simple Linear Regression and Correlation
CHAPTER 7 Linear Correlation & Regression Methods
(Residuals and
Checking Regression Model Assumptions
Checking Regression Model Assumptions
Statistics review Basic concepts: Variability measures Distributions
Chapter 12 Simple Linear Regression and Correlation
Regression Transformations for Normality and to Simplify Relationships
Multi Linear Regression Lab
Presentation transcript:

Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia

Start by loading your Lakedata_06 dataset: diane<-read.table(file.choose(),sep=";",header=TRUE)

Dataframes Two ways to identify a column (called "treatment") in your dataframe (called "diane"): diane$treatment OR attach(diane); treatment At end of session, remember to: detach(diane)

Summary statistics length (x) mean (x) var (x) cor (x,y) sum (x) summary (x) minimum, maximum, mean, median, quartiles What is the correlation between two variables in your dataset?

Factors A factor has several discrete levels (e.g. control, herbicide) If a vector contains text, R automatically assumes it is a factor. To manually convert numeric vector to a factor: x <- as.factor(x) To check if your vector is a factor, and what the levels are: is.factor(x) ; levels(x)

Exercise Make lake area into a factor called AreaFactor: Area 0 to 5 ha: small Area 5.1 to 10: medium Area > 10 ha: large You will need to: 1. Tell R how long AreaFactor will be. AreaFactor<-Area; AreaFactor[1:length(Area)]<-"medium" 2. Assign cells in AreaFactor to each of the 3 levels 3. Make AreaFactor into a factor, then check that it is a factor

Exercise Make lake area into a factor called AreaFactor: Area 0 to 5 ha: small Area 5.1 to 10: medium Area > 10 ha: large You will need to: 1. Tell R how long AreaFactor will be. AreaFactor<-Area; AreaFactor[1:length(Area)]<-"medium" 2. Assign cells in AreaFactor to each of the 3 levels AreaFactor[Area 10]<-“large" 3. Make AreaFactor into a factor, then check that it is a factor AreaFactor<-as.factor(AreaFactor); is.factor(AreaFactor)

Linear regression model <- lm (y ~ x, data = diane) invent a name for your model linear model insert your y vector name here insert your x vector name here insert your dataframe name here ALT+126

Linear regression model <- lm (Species ~ Elevation, data = diane) summary (model) Call: lm(formula = Species ~ Elevation, data = diane) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) *** Elevation Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 29 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: on 1 and 29 DF, p-value:

Linear regression model2 <- lm (Species ~ AreaFactor, data = diane) summary (model2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-09 *** AreaFactormedium AreaFactorsmall e-05 *** Large has mean species richness of 11.8 Medium has mean species richness of = 9.4 Small has a mean species richness of = 3.5 mean(Species[AreaFactor=="medio"])

ANOVA model2 <- lm (Species ~ AreaFactor, data = diane) anova (model2) Analysis of Variance Table Response: Species Df Sum Sq Mean Sq F value Pr(>F) AreaFactor e-05 *** Residuals

F tests in regression model3 <- lm (Species ~ Elevation + pH, data = diane) anova (model, model3) Model 1: Species ~ Elevation Model 2: Species ~ Elevation + pH Res.Df RSS Df Sum of Sq F Pr(>F) ** F 1, 28 = 9.43

Fit the model: Species~pH Fit the model: Species~pH+pH2 ("pH2" is just pH 2 ) Use the ANOVA command to decide whether species richness is a linear or quadratic function of pH Exercise

Distributions: not so normal! Review assumptions for parametric stats (ANOVA, regressions) Why don’t transformations always work? Introduce non-normal distributions

Tests before ANOVA, t-tests Normality Constant variances

Tests for normality: exercise data<-c(rep(0:6,c(42,30,10,5,5,4,4)));data How many datapoints are there?

Tests for normality: exercise Shapiro-Wilks (if sig, not normal) shapiro.test (data) If error message, make sure the stats package is loaded, then try again: library(stats); shapiro.test (data)

Tests for normality: exercise Kolmogorov-Smirnov (if sig, not normal) ks.test(data,”pnorm”,mean(data),sd=sqrt( var(data)))

Tests for normality: exercise Quantile-quantile plot (if wavers substantially off 1:1 line, not normal) par(pty="s") qqnorm(data); qqline(data) opens up a single plot window

Tests for normality: exercise

If the distribution isn´t normal, what is it? freq.data<-table(data); freq.data barplot(freq.data)

Non-normal distributions Poisson (count data, left-skewed, variance = mean) Negative binomial (count data, left-skewed, variance >> mean) Binomial (binary or proportion data, left- skewed, variance constrained by 0,1) Gamma (variance increases as square of mean, often used for survival data)

Exercise model2 <- lm (Species ~ AreaFactor, data = diane) anova (model2) 1. Test for normality of residuals resid2<- resid (model2)...you do the rest! 2. Test for homogeneity of variances summary (lm (abs (resid2) ~ AreaFactor))

Regression diagnostics 1.Residuals are normally distributed 2.Absolute value of residuals do not change with predicted value (homoscedastcity) 3.Residuals show no pattern with predicted values (i.e. the function “fits”) 4.No datapoint has undue leverage on the model.

Regression diagnostics model3 <- lm (Species ~ Elevation + pH, data = diane) par(mfrow=c(2,2)); plot(model3)

1. Residuals are normally distributed Straight “Normal Q-Q plot” Theoretical Quantiles Std. deviance resid.

2. Absolute residuals do not change with predicted values No fan shape in Residuals vs fitted plot No upward (or downward) trend in Scale-location plot Fitted values Sqrt (abs (SD resid.)) Fitted values Residuals MALO BUENO

Examples of neutral and fan-shapes

3. Residuals show no pattern Curved residual plots result from fitting a straight line to non-linear data (e.g. quadratic)

4. No unusual leverage Cook’s distance > 1 indicates a point with undue leverage (large change in model fit when removed)

Transformations Try transforming your y-variable to improve the regression diagnostic plot replace Species with log(Species) replace Species with sqrt(Species)

Poisson distribution Frequency data Lots of zero (or minimum value) data Variance increases with the mean

1.Correct for correlation between mean and variance by log-transforming y (but log (0) is undefined!!) 2.Use non-parametric statistics (but low power) 3.Use a “generalized linear model” specifying a Poisson distribution What do you do with Poisson data?

The problem: Hard to transform data to satisfy all requirements! Tarea: Janka example Janka dataset: Asks if Janka hardness values are a good estimate of timber density? N=36