Correlation and Regression Used when we are interested in the relationship between two variables. NOT the differences between means or medians of different.

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

Managerial Economics in a Global Economy
 Population multiple regression model  Data for multiple regression  Multiple linear regression model  Confidence intervals and significance tests.
Hypothesis Testing Steps in Hypothesis Testing:
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
Correlation and regression Dr. Ghada Abo-Zaid
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
1 SSS II Lecture 1: Correlation and Regression Graduate School 2008/2009 Social Science Statistics II Gwilym Pryce
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Linear regression models
Correlation and Regression
1 Lecture 2: ANOVA, Prediction, Assumptions and Properties Graduate School Social Science Statistics II Gwilym Pryce
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.1 CorrelationCorrelation The underlying principle of correlation analysis.
1 Lecture 2: ANOVA, Prediction, Assumptions and Properties Graduate School Social Science Statistics II Gwilym Pryce
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Chapter 10 Simple Regression.
Chapter 13 Introduction to Linear Regression and Correlation Analysis
The Simple Regression Model
Final Review Session.
Pengujian Parameter Koefisien Korelasi Pertemuan 04 Matakuliah: I0174 – Analisis Regresi Tahun: Ganjil 2007/2008.
Correlation and Regression. Correlation What type of relationship exists between the two variables and is the correlation significant? x y Cigarettes.
Introduction to Probability and Statistics Linear Regression and Correlation.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Chapter 14 Introduction to Linear Regression and Correlation Analysis
Correlation and Regression Analysis
Simple Linear Regression and Correlation
Introduction to Regression Analysis, Chapter 13,
Simple Linear Regression Analysis
Relationships Among Variables
Correlation & Regression
Regression and Correlation Methods Judy Zhong Ph.D.
The Chi-Square Distribution 1. The student will be able to  Perform a Goodness of Fit hypothesis test  Perform a Test of Independence hypothesis test.
Introduction to Linear Regression and Correlation Analysis
Simple Linear Regression
Correlation and Regression
Chapter 15 Correlation and Regression
Ch4 Describing Relationships Between Variables. Pressure.
© 2003 Prentice-Hall, Inc.Chap 13-1 Basic Business Statistics (9 th Edition) Chapter 13 Simple Linear Regression.
+ Chapter 12: Inference for Regression Inference for Linear Regression.
Introduction to Linear Regression
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
Ch4 Describing Relationships Between Variables. Section 4.1: Fitting a Line by Least Squares Often we want to fit a straight line to data. For example.
Examining Relationships in Quantitative Research
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.3 Using Multiple Regression to Make Inferences.
ITEC6310 Research Methods in Information Technology Instructor: Prof. Z. Yang Course Website: c6310.htm Office:
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
VI. Regression Analysis A. Simple Linear Regression 1. Scatter Plots Regression analysis is best taught via an example. Pencil lead is a ceramic material.
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
Examining Relationships in Quantitative Research
Correlation & Regression Analysis
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Simple Linear Regression Analysis Chapter 13.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
Nonparametric Statistics
The “Big Picture” (from Heath 1995). Simple Linear Regression.
Chapter 11 Linear Regression and Correlation. Explanatory and Response Variables are Numeric Relationship between the mean of the response variable and.
Slide 1 Copyright © 2004 Pearson Education, Inc. Chapter 10 Correlation and Regression 10-1 Overview Overview 10-2 Correlation 10-3 Regression-3 Regression.
Chapter 13 Simple Linear Regression
Nonparametric Statistics
BPK 304W Correlation.
Correlation and Regression
CHAPTER 29: Multiple Regression*
Nonparametric Statistics
Association, correlation and regression in biomedical research
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Correlation and Regression

Used when we are interested in the relationship between two variables. NOT the differences between means or medians of different groups.

Correlation and Regression Used when we are interested in the relationship between two variables. NOT the differences between means or medians of different groups. The reverse is also true… so in your paper, you should not have written: “There was a correlation between number of pupae and presence of an interspecific competitor.” Rather, the correct way would be: “There was a difference between the mean number of pupae produced between treatments with and without an interspecific competitor.”

Correlation This is used to: - describe the strength of a relationship between two variables…. This is the “r value” and it can vary from -1.0 to 1.0

Correlation This is used to: - describe the strength of a relationship between two variables…. This is the “r value” and it can vary from -1.0 to determine the probability that two UNRELATED variables would produce a relationship this strong, just by chance. This is the “p value”.

IF N = 62, then r crit = for p = 0.05, r crit = for p = 0.01

Correlation Important Note: – Correlation does not imply causation - the variables are related, but one does not cause the second.

“spurious” correlation

Correlation Important Note: – Correlation does not imply causation - the variables are related, but one does not cause the second. – Often, the variables are both dependent variables in the experiment… such as mean mass of flies and number of offspring. - so it is incorrect to think of one variable as ‘causing’ the other…. As number increases, amount of food per individual declines, and flies grow to a smaller size. Or, as flies grow, small ones need less and so more small ones can survive together than large ones.

Correlation Parametric test - the Pearson Correlation coefficient. – If the data is normally distributed, then you can use a parametric test to determine the correlation coefficient - the Pearson correlation coefficient.

NOTE: no lines drawn through points! negative

Pearson’s Correlation Assumptions of the Test – Random sample from the populations – Both variables are approximately normally distributed – Measurement of both variables is on an interval or ratio scale – The relationship between the 2 variables, if it exists, is linear. Thus, before doing any correlation, plot the relationship to see if its linear!

Pearson’s Correlation How to calculate the Pearson’s correlation coefficient n = sample size

Testing r Calculate t using above formula Compare to tabled t-value with n-2 df Reject null if calculated value > table value But SPSS will do all this for you, so you don’t need to!

Example The heights and arm spans of 10 adult males were measured in cm. Is there a correlation between these two measurements?

Example Height (cm)Arm Span (cm)

Step 1 – plot the data

Example Step 2 – Calculate the correlation coefficient - r = Step 3 – Test the significance of the relationship - p =

Nonparametric correlation Spearman’s test This is the most commonly used test when one of the assumptions of the parametric test cannot be met - usually because it is non-normal, non-linear, or uses ordinal data. The only assumptions of the Spearman’s r test is that the data is randomly collected and that the scale of measurement is at least ordinal.

Spearman’s test Like most non-parametric tests, the data are first ranked from smallest to largest – in this case, each column is ranked independently of the other. Then (1) subtract each rank from the other, (2) square the difference, (3) sum the values, and (4) plug into the following formula to calculate the Spearman correlation coefficient.

Spearman’s test Calculating Spearman’s correlation coefficient

Testing r The null hypothesis for a Spearman’s correlation test is also that: –  = 0; i.e., H 0 :  = 0; H A :  ≠ 0 When we reject the null hypothesis we can accept the alternative hypothesis that there is a correlation, or relationship, between the two variables.

Testing r Calculate t using above formula Compare to tabled t-value with n-2 df Reject null if calculated value > table value But SPSS will do all this for you, so you don’t need to!

Example The mass (in grams) of 13 adult male tuataras and the size of their territories (in square meters) was measured. Are territory size and the size of the adult male tuatara related?

Example Observation numberMassTerritory size

Step 1 – plot the data Note - not very linear

numberMassmRANKTerritorytRANKdd2d (60) 13(168) = r s = 1 - = 0.835

Example Step 2 – Calculate the correlation coefficient Step 3 – Test the significance of the relationship  = 0.835, p = = 5.03

Linear Regression Here we are testing a causal relationship between the two variables. We are hypothesizing a functional relationship between the two variables that allows us to predict a value of the dependent variable, y, corresponding to a given value of the independent variable, x.

Regression Unlike correlation, regression does imply causality An independent and a dependent variable can be identified in this situation. – This is most often seen in experiments, where you experimentally assign the independent variable, and measure the response as the dependent variable. Thus, the independent variable is not normally distributed (indeed, it has no variance associated with it!) - as it is usually selected by the investigator.

Linear Regression For a linear regression, this can be written as: –  y =  +  x (or y = mx + b) – where  y = population mean value of y at any value of x –  = the population (y) intercept, and –  = population slope. You can use this equation to make predictions - although of course these are usually estimated by sample statistics rather than population parameters.

Linear Regression Assumptions – 1. The independent variable (X) is fixed and measured without error – no variance. – 2. For any value of the independent variable (X), the dependent variable (Y) is normally distributed, and the population mean of these values of y,  y is:  y =  +  x

Linear Regression Assumptions – 3. For any value of x, any particular value of y is: y i =  +  x + e Where e, the residual, is the amount by which any observed value of y differs from the mean value of y (analogous to “random error”) Residuals will follow a standard normal distribution

Linear Regression Assumptions – 4. The variances of the y variable for all values of x are equal – 5. Observations are independent – each individual is measured only once.

Y X OK

Y X Not OK

Estimating the Regression Function and Line A regression line always passes through the point: “mean x, mean y”.

Example - Juniper pythons measured single, randomly selected snakes at different temperatures (one snake per temp). Temperature (˚C)Heart Rate Mean (x) = 10Mean (y) = 19.88

Example

Mean x = 10; Mean y = How much each value of y (y i ) deviates from the mean of y… y – y i The horizontal line represents a regression line for y when x (temperature) is not considered. Residuals are very large!

To measure total error, you want to sum the residuals… but they will cancel out… so you must square the differences, then sum. Now we have the TOTAL SUM OF SQUARES (SS T ) The sum of squares of the residuals is thus: Thus, you see a lot of variance in y when x is not taken into account. How much of the variance in y can be attributed to the relationship with x? Estimating the Regression Function and Line T

Example Mean x = 10; Mean y = The “line of best fit” minimizes the residual sum of squares. The best fit line represents a regression line for y when x (temperature) is considered. Now the residuals are very small – in fact, the smallest sum possible.

Estimating the Regression Function and Line This “line of best fit” minimizes the y sum of squares, and accounts for how x, the independent variable, influences y, the dependent variable. The difference between the observed values and this “line of best fit” are the residuals – the “error” left over when the relationship is included.

Estimating the Regression Function and Line The sum of squares of these regression residuals is now: This is equivalent to the ERROR SS = (SS e ); it is the variance “left over” after the realtionship with x has been included.

Estimating the Regression Function and Line How do we get this best fit line? Based on the principles we just went over, you can calculate the slope and the intercept of the best fit line.

Estimating the Regression Function and Line

Testing the Significance of the Regression Line In a regression, you test the null hypothesis – H  :  = 0; H A :  ≠ 0 This is done using an ANOVA procedure. To do this, you calculate sums of squares, their corresponding degrees of freedom, mean squares, and finally an F value (just like an ANOVA!)

Sums of Squares SS t - this is the value for sums of squares for y when x is not considered (the total sums of squares) SS e - this is the value for the sums of squares of the residuals - in other words, it represents the variance in y that is still present even when x is considered (the error sums of squares) SS r – this is the variation in y accounted for by the relationship with x. It can be calculated two ways: - by subtraction (SS t – SS e ) - directly using formula

Sums of Squares

Regression ANOVA Table (see p. 120) Source of Variation Sum of Squares dfMSF RegressionSS R 1 MS R /MS E ErrorSS E n-2SS E /n-2 TotalSS T n-1SS T /n-1

Testing the Significance of the Regression Line Interpret exactly as for an ANOVA

Coefficient of determination The coefficient of determination, or r 2, tells you what proportion of the variance in y is explained by its dependence on x. r 2 = SS R /SS T e.g., if r 2 = 0.98, then 98% of the variance in y is dependent on x - or 2% of the variance is unexplained.

Example Suppose you want to describe the effects of temperature on development time in Drosophila. You let flies lay eggs (on mushrooms in 30 vials) for one day You select 3 temperature treatments (20, 25, 30oC) and randomly assign 10 vials to each treatment. You count the number of flies that emerge each day. From these data, you compute two variables, number of flies and mean number of days to develop. Number of flies is not a dependent variable, because this did not vary as a consequence of temperature – eggs were laid before vials were placed in the temperature treatments. But, you know that the number of flies – and competitive stress – might cause a change in developmental rate. So, it is a potential correlate.

OUTPUT – Linear Regression

OUTPUT: Multiple Regression – Abundance and Temp

Multiple regression – Stepwise SourceSS dfMSF P Total Abundance Temp Regression Residual

ANCOVA: Comparing means between treatments (NOT looking for linear relationship), while accounting for variability due to correlated variables. ANOVA ALONE:

ANCOVA: Comparing means between treatments, while accounting for variability due to correlated variables. ANCOVA: Analysis of Covariance

Diffs in PUT mean male mass between treat 1 vs. 3