Presentation is loading. Please wait.

Presentation is loading. Please wait.

Linear correlation and linear regression + summary of tests Dr. Omar Al Jadaan Assistant Professor – Computer Science & Mathematics.

Similar presentations


Presentation on theme: "Linear correlation and linear regression + summary of tests Dr. Omar Al Jadaan Assistant Professor – Computer Science & Mathematics."— Presentation transcript:

1 Linear correlation and linear regression + summary of tests Dr. Omar Al Jadaan Assistant Professor – Computer Science & Mathematics

2 Recall: Covariance

3 cov(X,Y) > 0 X and Y are positively correlated cov(X,Y) < 0 X and Y are inversely correlated cov(X,Y) = 0 X and Y are independent Interpreting Covariance

4 Correlation coefficient Pearson’s Correlation Coefficient is standardized covariance (unitless):

5 Correlation Measures the relative strength of the linear relationship between two variables Unit-less Ranges between –1 and 1 The closer to –1, the stronger the negative linear relationship The closer to 1, the stronger the positive linear relationship The closer to 0, the weaker any positive linear relationship

6 Scatter Plots of Data with Various Correlation Coefficients Y X Y X Y X Y X Y X r = -1 r = -.6r = 0 r = +.3 r = +1 Y X r = 0

7 Y X Y X Y Y X X Linear relationshipsCurvilinear relationships Linear Correlation

8 Y X Y X Y Y X X Strong relationshipsWeak relationships Linear Correlation

9 Y X Y X No relationship

10 Some calculation formulas… Note: Easier computation formulas:

11 Sampling distribution of correlation coefficient: *note, like a proportion, the variance of the correlation coefficient depends on the correlation coefficient itself  substitute in estimated r The sample correlation coefficient follows a T-distribution with n-2 degrees of freedom (since you have to estimate the standard error).

12 What is “Linear”? Remember this: Y=mX+B? B m

13 What’s Slope? A slope of 2 means that every 1-unit change in X yields a 2-unit change in Y.

14 Simple linear regression The linear regression model: Love of Math = 5 +.01*math SAT score interceptslope P=.22; not significant

15 Prediction If you know something about X, this knowledge helps you predict something about Y. (Sound familiar?…sound like conditional probabilities?)

16 EXAMPLE The distribution of baby weights at Stanford ~ N(3400, 360000) Your “Best guess” at a random baby’s weight, given no information about the baby, is what? 3400 grams But, what if you have relevant information? Can you make a better guess?

17 Predictor variable X=gestation time Assume that babies that gestate for longer are born heavier, all other things being equal. Pretend (at least for the purposes of this example) that this relationship is linear. Example: suppose a one-week increase in gestation, on average, leads to a 100-gram increase in birth-weight

18 Y depends on X Y=birth- weight (g) X=gestation time (weeks) Best fit line is chosen such that the sum of the squared (why squared?) distances of the points (Y i ’s) from the line is minimized: Or mathematically… (remember max and mins from calculus)… Derivative[  (Y i -(mx+b)) 2 ]=0

19 Prediction A new baby is born that had gestated for just 30 weeks. What’s your best guess at the birth-weight? Are you still best off guessing 3400? NO!

20 Y=birth- weight (g) X=gestation time (weeks) At 30 weeks… 3000 30

21 Y=birth weight (g) X=gestation time (weeks) At 30 weeks… (x,y)= (30,3000) 3000 30

22 At 30 weeks… The babies that gestate for 30 weeks appear to center around a weight of 3000 grams. In Math-Speak… E(Y/X=30 weeks)=3000 grams Note the conditional expectation

23 But… Note that not every Y-value (Y i ) sits on the line. There’s variability. Y i =3000 + random error i In fact, babies that gestate for 30 weeks have birth-weights that center at 3000 grams, but vary around 3000 with some variance  2 Approximately what distribution do birth-weights follow? Normal. Y/X=30 weeks ~ N(3000,  2 )

24 Y=birth- weight (g) X=gestation time (weeks) And, if X=20, 30, or 40… 203040

25 Y=baby weights (g) X=gestation times (weeks) If X=20, 30, or 40… 203040 Y/X=40 weeks ~ N(4000,  2 ) Y/X=30 weeks ~ N(3000,  2 ) Y/X=20 weeks ~ N(2000,  2 )

26 Mean values fall on the line E(Y/X=40 weeks)=4000 E(Y/X=30 weeks)=3000 E(Y/X=20 weeks)=2000 E(Y/X)=  Y/X = 100 grams/week*X weeks

27 Linear Regression Model Y’s are modeled… Y i = 100*X + random error i Follows a normal distribution Fixed – exactly on the line


Download ppt "Linear correlation and linear regression + summary of tests Dr. Omar Al Jadaan Assistant Professor – Computer Science & Mathematics."

Similar presentations


Ads by Google