Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistics for clinicians Biostatistics course by Kevin E. Kip, Ph.D., FAHA Professor and Executive Director, Research Center University of South Florida,

Similar presentations


Presentation on theme: "Statistics for clinicians Biostatistics course by Kevin E. Kip, Ph.D., FAHA Professor and Executive Director, Research Center University of South Florida,"— Presentation transcript:

1 Statistics for clinicians Biostatistics course by Kevin E. Kip, Ph.D., FAHA Professor and Executive Director, Research Center University of South Florida, College of Nursing Professor, College of Public Health Department of Epidemiology and Biostatistics Associate Member, Byrd Alzheimer’s Institute Morsani College of Medicine Tampa, FL, USA 1

2 SECTION 6.1 Correlation versus linear regression

3 Learning Outcome: Distinguish the relationship between correlation and linear regression

4 Correlation and Regression are both measures of association Some Terms for “association” variables: Variable 1:“x” variable independent variable predictor variable exposure variable Variable 2:“y” variable dependent variable outcome variable

5 Correlation Coefficient Computation Form: Pearson correlation (“r”) where x and y are the sample means of X and Y, s x and s y are the sample standard deviations of X and Y. Co-variation

6 Introduction to Linear Regression Like correlation, the data are pairs of independent (e.g. “X”) and dependent (e.g. “Y” variables {(x i,y i ): i=1,...,n}. However, here we seek to predict values of Y from X. The fitted equation is written: y = b 0 + b 1 x where y is the predicted value of the response (e.g. blood pressure) obtained by using the equation. This equation of the line best represents the association between the independent variable and the dependent variable The residuals are the differences between the observed and the predicted values: {(y i – y i ): i=1,…,n}

7 Introduction to Linear Regression r = 0.76 Best fitting line Minimize distance between predicted and actual values

8 Introduction to Linear Regression y = b 0 + b 1 x y = predicted value of response (outcome) variable b 0 = constant: the intercept (the value of y when x = 0). b 1 = constant: coefficient for slope of regression line – the expected change in y for a one-unit change in x Note: unlike the correlation coefficient, b is unbounded. x i = values of independent (predictor) variable for subject i

9 9 SECTION 6.2 Least squares regression and predicted values

10 Learning Outcomes: Describe the theoretical basis of least squares regression Calculate and interpret predicted values from a linear regression model

11 Introduction to Linear Regression y = b 0 + b 1 x In the above equation, the values of the slope (b 1 ) and intercept (b 0 ) represent the line that best predicts Y from X. More precisely, the goal of regression is to minimize the sum of the squares of the vertical distances of the points from the line. i.e. minimize ∑(y – y) 2 This is frequently done by the method of “least squares” regression.

12 Least squares estimates: s y b 1 =r s x b 0 = Y – b 1 X Example: We wish to estimate total cholesterol level (y) from BMI (x) Assume r xy = 0.78; Y = 205.9s y = 30.8 X = 27.4s x = 3.7 s y 30.8 b 1 =r = 0.78= 6.49 s x 3.7 b 0 = Y – b 1 X = 205.9 – 6.49(27.4) = 28.07 The equation of the regression line is: y = 28.07 + 6.49(BMI)

13 Least squares estimates: (Practice) s y b 1 =r s x b 0 = Y – b 1 X Example: We wish to estimate systolic blood pressure (y) from BMI (x) Assume r xy = 0.46; Y = 133.8s y = 18.4 X = 26.6s x = 3.5 s y b 1 =r = s x b 0 = Y – b 1 X = The equation of the regression line is: y =

14 Least squares estimates: (Practice) s y b 1 =r s x b 0 = Y – b 1 X Example: We wish to estimate systolic blood pressure (y) from BMI (x) Assume r xy = 0.46; Y = 133.8s y = 18.4 X = 26.6s x = 3.5 s y 18.4 b 1 =r = 0.46= 2.42 s x 3.5 b 0 = Y – b 1 X = 133.8 – 2.42(26.6) = 69.43 The equation of the regression line is: y = 69.43 + 2.42(BMI)

15 Least squares estimates: (Practice) The equation of the regression line is: y = 69.43 + 2.42(BMI) Predict systolic blood pressure for the following 3 individuals: Person 1 has BMI of 26.4 Person 2 has BMI of 28.9 Person 3 has BMI of 34.8 y 1 = y 2 = y 3 =

16 Least squares estimates: (Practice) The equation of the regression line is: y = 69.43 + 2.42(BMI) Predict systolic blood pressure for the following 3 individuals: Person 1 has BMI of 26.4 Person 2 has BMI of 28.9 Person 3 has BMI of 34.8 y 1 = 69.43 + 2.42(26.4)=133.3 y 2 = 69.43 + 2.42(28.9)=139.4 y 3 = 69.43 + 2.42(34.8)=153.6

17 17 SECTION 6.3 Assumptions and sources of variation in linear regression

18 18 Learning Outcomes: Describe the assumptions required for valid use of the linear regression model Describe the partitioning of sum of squares in the linear regression model

19 Introduction to Linear Regression Some assumptions for linear regression:  Dependent variable Y has a linear relationship to the independent variable X This includes checking whether the dependent variable is approximately normally distributed.  Independence of the errors (no serial correlation)

20 Y = 90.681 + 0.945(age) R = 0.597

21 IDXY 11624 2128 31419 41114 52428 61822 7137 8298 942100 10712 1121 121417 13 26 141921 15248 161721 17358 182221 19189 2011 R0.573

22 IDXY 11624 2128 31419 41114 52428 61822 7137 8298 942100 10712 1121 121417 13 26 141921 15248 161721 17358 182221 19189 2011 R0.573

23 IDXY LOG_ Y 116243.178 21282.079 314192.944 411142.639 524283.332 618223.091 71371.946 82982.079 9421004.605 107122.485 1121 3.045 1214172.833 13 263.258 1419213.045 152482.079 1617213.045 173582.079 1822213.045 191892.197 2011 2.398

24 IDXY LOG_ Y 116243.178 21282.079 314192.944 411142.639 524283.332 618223.091 71371.946 82982.079 9421004.605 107122.485 1121 3.045 1214172.833 13 263.258 1419213.045 152482.079 1617213.045 173582.079 1822213.045 191892.197 2011 2.398

25 Fundamental Equations for Regression Coefficient of determination (r 2 )  Proportion of variation in Y “explained by the regression on X explained variationSSR SSE R 2 = ----------------------- =-----=1 - ------ total variationSST SST

26 Example: Fundamental Equations for Regression IDYX 12012 2 7 31413 4176 589 61612 71510 8137 91810 14 11 8 12104 NMean 1214.009.33 SD 3.463.06 Sum 168112 Y X y = b 0 + b 1 x y = 9.545 + 0.477(x) r = 0.42

27 Example: Fundamental Equations for Regression IDYXY(Y i - Y) 2 (T)(T)(R)(R)(E)(E) 1201215.27361.6222.35 212712.8941.240.79 3141315.7503.06 417612.4192.5321.08 58913.84360.0334.12 6161215.2741.620.53 7151014.3210.100.46 813712.8911.240.01 9181014.32160.1013.56 1014 16.2304.96 11 813.3690.405.59 1210411.45166.482.12 NMean 1214.009.3314.0011.001.959.05 SD 3.463.061.4612.962.0311.20 Sum 16811216813223109 SST=132SSR=23SSE=109 R0.42 R2R2 0.18 y = 9.545 + 0.477(x) SST = 132, df T = 11 SSR = 23, df R = 1 SSE = 109, df E = 10 SSR R 2 =-----= 0.18 SST

28 Practice: Fundamental Equations for Regression IDYXY(Y i - Y) 2 (T)(T)(R)(R)(E)(E) 1188 2136 31114 4164 598 6129 7 9 8810 9148 101211 NMean 10 12.508.70 Sum 12587 SST=_____SSR=_____SSE=_____ y = 17.17247 - 0.53707(x) SST = _____, df T = ____ SSR = ______, df R = ____ SSE = ______, df E = ____ SSR R 2 =-----= _______ SST Complete the entries in the table below to determine SST, SSR, SSE, and R 2

29 Practice: Fundamental Equations for Regression IDYXY(Y i - Y) 2 (T)(T)(R)(R)(E)(E) 1188 12.8830.250.1426.26 2136 13.950.252.100.90 31114 9.652.258.101.81 4164 15.0212.256.370.95 598 12.8812.250.1415.02 6129 12.340.250.030.11 7129 12.340.250.030.11 8810 11.8020.250.4914.45 9148 12.882.250.141.26 101211 11.260.251.530.54 NMean 10 12.508.7012.508.051.916.14 Sum 1258712580.519.161.4 SST=80.5SSR=19.1SSE=61.4 y = 17.17247 - 0.53707(x) SST = 80.5, df T = 9 SSR = 19.1, df R = 1 SSE = 61.4, df E = 8 SSR R 2 =-----= 0.24 SST

30 30 SECTION 6.4 Multiple linear regression model

31 31 Learning Outcome: Calculate and interpret predicted values from the multiple regression model

32 Multiple Linear Regression  Extension of simple linear regression to assess the association between 2 or more independent variables and a single continuous dependent variable.  The multiple linear regression equation is:  Each regression coefficient represents the change in y relative to a one unit change in the respective independent variable holding the remaining independent variables constant.  The R 2 from the multiple linear regression model represents percentage of variation in the dependent variable “explained” by the set of predictors.

33 Multiple Linear Regression Example: Predictors of systolic blood pressure: Independent Variable Regression Coefficient tp-value Intercept68.1526.330.0001 BMI (per 1 unit)0.5810.300.0001 Age (in years)0.6520.220.0001 Male gender0.941.580.1133 Treatment for hypertension6.449.740.0001 y = 68.15 + 0.58(BMI) + 0.65(age) + 0.94(male) + 6.44 (tx-hypertension)

34 Practice: Estimate systolic blood pressure for the following persons: Independent Variable Regression Coefficient tp-value Intercept68.1526.330.0001 BMI (per 1 unit)0.5810.300.0001 Age (in years)0.6520.220.0001 Male gender (1=yes)0.941.580.1133 Treatment for hypertension (1=yes)6.449.740.0001 Person 1: BMI=27.9; age=54; female; on treatment for hypertension Person 2: BMI=34.9; age=66; male; on treatment for hypertension Person 3: BMI=24.8; age=47; female; not on treatment for hypertension y 1 = y 2 = y 3 =

35 Practice: Estimate systolic blood pressure for the following persons: Independent Variable Regression Coefficient tp-value Intercept68.1526.330.0001 BMI (per 1 unit)0.5810.300.0001 Age (in years)0.6520.220.0001 Male gender (1=yes)0.941.580.1133 Treatment for hypertension (1=yes)6.449.740.0001 Person 1: BMI=27.9; age=54; female; on treatment for hypertension Person 2: BMI=34.9; age=66; male; on treatment for hypertension Person 3: BMI=24.8; age=47; female; not on treatment for hypertension y 1 = 68.15 + 0.58(27.9) + 0.65(54) + 0.94(0) + 6.44(1) = 125.9 y 2 = 68.15 + 0.58(34.9) + 0.65(66) + 0.94(1) + 6.44(1) = 138.7 y 3 = 68.15 + 0.58(27.9) + 0.65(54) + 0.94(0) + 6.44(0) = 113.1

36 Framingham Risk Calculation (10-Year Risk): Dependent Variable: 10-year risk of CVD Independent Variables: Age, gender, total cholesterol, HDL cholesterol, smoker, systolic BP On medication for BP http://hp2010.nhlbihin.net/atpiii/calculator.asp

37 37 SECTION 6.5 SPSS for linear regression analysis

38 38 Learning Outcome: Analyze and interpret linear regression models using SPSS

39 SPSS Analyze Regression Linear Dependent Variable Independent Variable(s) Statistics ---Estimates ---Confidence intervals ---Model fit ---Partial correlations ---Descriptives Example: Dependent variable:HDL Cholesterol Independent variable:BMI

40 y = 70.141 – 0.442(BMI)

41 SPSS Analyze Regression Linear Dependent Variable Independent Variable(s) Statistics ---Estimates ---Confidence intervals ---Model fit ---Partial correlations ---Descriptives Example: Dependent variable:HDL Cholesterol Independent variable(s):BMI, gender (1=male, 2=female)

42 y = 53.494 – 0.481(BMI) + 10.663(female)

43 SPSS Analyze Regression Linear Dependent Variable Independent Variable(s) Statistics ---Estimates ---Confidence intervals ---Model fit ---Partial correlations ---Descriptives Example: Dependent variable:HDL Cholesterol Independent variable(s):BMI, gender, age

44 y = 43.026 – 0.464(BMI) + 10.735(female) + 1.66(age)

45 Practice: Estimate HDL cholesterol levels for the following persons: Person 1: BMI=25.7; female; age=60 Person 2: BMI=36.9; male; age=66 Person 3: BMI=31.8; female; age=51 y 1 = y 2 = y 3 =

46 Practice: Estimate HDL cholesterol levels for the following persons: Person 1: BMI=25.7; female; age=60 Person 2: BMI=36.9; male; age=66 Person 3: BMI=31.8; female; age=51 y 1 = 43.026 – 0.464(25.7) + 10.735(1) + 0.166(60) = 51.8 y 2 = 43.026 – 0.464(36.9) + 10.735(0) + 0.166(66) = 36.9 y 3 = 43.026 – 0.464(31.8) + 10.735(1) + 0.166(51) = 47.5


Download ppt "Statistics for clinicians Biostatistics course by Kevin E. Kip, Ph.D., FAHA Professor and Executive Director, Research Center University of South Florida,"

Similar presentations


Ads by Google