Correlation Coefficients Pearson’s Product Moment Correlation Coefficient interval or ratio data only What about ordinal data?
Spearman’s Rank Correlation Coefficient r s = 1 - di2di2 i=1 i=n n 3 - n 6
Spearman’s Rank Correlation Coefficient: Example
A Significance Test for r s SE r s = 1 n -1 t test = rsrs SE r s = r s n -1 df = n - 1
Spearman’s Rank Correlation Coefficient: Example
Pearson’s r - Assumptions 1.Interval or ratio scale data 2.Selected randomly 3.Linear 4.Joint bivariate normal distribution S-Plus (qqnorm)
Spearman’s Rank Correlation Coefficient Ordinal data already in a ranked form Interval or ratio data convert it to rankings
Spearman’s Rank Correlation Coefficient TVDI (x) Rank (x) Theta (y) Rank (y) Difference (d i )
A Significance Test for r s
S-Plus
TVDI (x) Theta (y)
Correlation Direction & Strength We might wish to go a little further Rate of change Predictability Correlation Regression
Deterministic perfect knowledge Probabilistic estimate not with absolute accuracy (or certainty) Two Sorts of Bivariate Relationships
Travel at a constant speed Deterministic time spent driving vs. distance traveled A Deterministic Relationship s = s 0 + vt s: distance traveled s 0 : initial distance v: speed t: time traveled time (t) distance (s) slope (v) intercept (s 0 ) Truly deterministic rare
More often probabilistic e.g., ages vs. heights (2 – 20 yrs) A Probabilistic Relationship age (years) height (meters) Good relationship Unpredictability or error
Sampling and Regression Our expectation (less than perfect) Collecting data measurement errors height Other factors (not accounted for in the model) plant growth vs. T
Simple vs. Multiple Regression Simple linear regression y x Multiple linear regression y x 1, x 2, … x n
Model y = a + bx + e Simple Linear Regression x: independent variable y: dependent variable b: slope a: intercept e: error term x (independent) y (dependent) b a error:
Scatterplot fitting a line Fitting a Line to a Set of Points x (independent) y (dependent) Least squares method Minimize the error term e
Sampling and Regression Sampled data model y = a + bx + e Attempt to estimate a “true” regression line y = + x + Multiple samples several similar regression lines the population regression line
Minimize the error term e The line of best fit ŷ = a + b Least Squares Method y ŷ = a + bx ŷ (y - ŷ)
Estimates and Residuals Errors e = y – ŷ Residuals Underestimate Overestimate
Errors (residuals) e = (y - ŷ) Overall error Simply sum these error terms 0 Square the differences and then sum them up to create a useful estimate Minimizing the Error Term SSE = (y - ŷ) 2 i = 1 n
Minimizing the SSE (y - ŷ) 2 i = 1 n min a,b n (y i - a - bx i ) 2 i = 1 min a,b =
Least squares method Finding Regression Coefficients (x i - x) (y i - y) i = 1 n b = (x i - x) 2 i = 1 n a = y - bx
Interpreting Slope (b) Slope of the line (b the change in y due to a unit change in x b > 0 b < 0
Regression Slope and Correlation (x i - x)(y i - y) i=1 i=n (n - 1) s X s Y r = (x i - x) (y i - y) i = 1 n b = (x i - x) 2 i = 1 n b = r sysy sxsx