Presentation is loading. Please wait.

Presentation is loading. Please wait.

4/9/2005 11:38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to23.

Similar presentations


Presentation on theme: "4/9/2005 11:38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to23."— Presentation transcript:

1 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to Principles of Biostatistics Chapter 18 Simple Linear Regression 宇传华

2 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to Terminology Linear regression 线性回归 Response (dependent) variable 反应 ( 应 ) 变量 Explanatory (independent) variable 解释 ( 自 ) 变量 Linear regression model 线性回归模型 Regression coefficient 回归系数 Slope 斜率 Intercept 截距 Method of least squares 最小二乘法 Error sum of squares or residual sum of squares 残差(剩余)平方和 Coefficient of Determination 决定系数 Outlier 异常点 ( 值 ) Homoscedasticity 方差齐同 heteroscedasticity 方差非齐同

3 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to An example 18.2 The Simple Linear Regression Model 18.3 Estimation: The Method of Least Squares 18.4 Error Variance and the Standard Errors of Regression Estimators 18.5 Confidence Intervals for the Regression Parameters 18.6 Hypothesis Tests about the Regression Relationship 18.7 How Good is the Regression? 18.8 Analysis of Variance Table and an F Test of the Regression Model 18.9 Residual Analysis Prediction Interval and Confidence Interval Contents

4 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to An example Table18.1 IL-6 levels in brain and serum (pg/ml) of 10 patients with subarachnoid hemorrhage ( 蛛网膜下腔出血 ) Patient i Serum IL-6 (pg/ml) x Brain IL-6 (pg/ml) y

5 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to This scatterplot locates pairs of observations of serum IL-6 on the x-axis and brain IL-6 on the y-axis. We notice that: Larger (smaller) values of brain IL-6 tend to be associated with larger (smaller) values of serum IL-6. The scatter of points tends to be distributed around a positively sloped straight line. The pairs of values of serum IL-6 and brain IL-6 are not located exactly on a straight line. The scatter plot reveals a more or less strong tendency rather than a precise linear relationship. The line represents the nature of the relationship on average. Scatterplot

6 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to X Y X Y X Y X Y X Y X Y Examples of Other Scatterplots

7 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to The inexact nature of the relationship between serum and brain suggests that a statistical model might be useful in analyzing the relationship. A statistical model separates the systematic component of a relationship from the random component. The inexact nature of the relationship between serum and brain suggests that a statistical model might be useful in analyzing the relationship. A statistical model separates the systematic component of a relationship from the random component. Data Statistical model Systematic component + Random errors In ANOVA, the systematic component is the variation of means between samples or treatments (SSTR) and the random component is the unexplained variation (SSE). In regression, the systematic component is the overall linear relationship, and the random component is the variation around the line. In ANOVA, the systematic component is the variation of means between samples or treatments (SSTR) and the random component is the unexplained variation (SSE). In regression, the systematic component is the overall linear relationship, and the random component is the variation around the line. Model Building

8 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to The population simple linear regression model: y=  +  x +  or  y|x =  +  x Nonrandom or Random Systematic Component Component Where y is the dependent (response) variable, the variable we wish to explain or predict; x is the independent (explanatory) variable, also called the predictor variable; and  is the error term, the only random component in the model, and thus, the only source of randomness in y.  y|x is the mean of y when x is specified, all called the conditional mean of Y.  is the intercept of the systematic component of the regression relationship.  is the slope of the systematic component. The population simple linear regression model: y=  +  x +  or  y|x =  +  x Nonrandom or Random Systematic Component Component Where y is the dependent (response) variable, the variable we wish to explain or predict; x is the independent (explanatory) variable, also called the predictor variable; and  is the error term, the only random component in the model, and thus, the only source of randomness in y.  y|x is the mean of y when x is specified, all called the conditional mean of Y.  is the intercept of the systematic component of the regression relationship.  is the slope of the systematic component The Simple Linear Regression Model

9 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to The simple linear regression model posits ( 假定 ) an exact linear relationship between the expected or average value of Y, the dependent variable Y, and X, the independent or predictor variable:  y|x =  +  x Actual observed values of Y (y) differ from the expected value (  y|x ) by an unexplained or random error(  ): y =  y|x +  =  +  x +  The simple linear regression model posits ( 假定 ) an exact linear relationship between the expected or average value of Y, the dependent variable Y, and X, the independent or predictor variable:  y|x =  +  x Actual observed values of Y (y) differ from the expected value (  y|x ) by an unexplained or random error(  ): y =  y|x +  =  +  x +  X Y  y|x =  +  x x } }  = Slope 1 y { Error:  Regression Plot Picturing the Simple Linear Regression Model 0 {  = Intercept

10 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to The relationship between X and Y is a straight-Line 线性 relationship. The values of the independent variable X are assumed fixed (not random); the only randomness in the values of Y comes from the error term . The errors  are uncorrelated (i.e. Independent 独立 ) in successive observations. The errors  are Normally 正态 distributed with mean 0 and variance  2 (Equal variance 等 方差 ). That is:  ~ N(0,  2 ) The relationship between X and Y is a straight-Line 线性 relationship. The values of the independent variable X are assumed fixed (not random); the only randomness in the values of Y comes from the error term . The errors  are uncorrelated (i.e. Independent 独立 ) in successive observations. The errors  are Normally 正态 distributed with mean 0 and variance  2 (Equal variance 等 方差 ). That is:  ~ N(0,  2 ) X Y LINE assumptions of the Simple Linear Regression Model Identical normal distributions of errors, all centered on the regression line. Assumptions of the Simple Linear Regression Model  y|x =  +  x x y N(  y|x,  y|x 2 )

11 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to  Estimation of a simple linear regression relationship involves finding estimated or predicted values of the intercept and slope of the linear regression line. The estimated regression equation: y= a+ bx + e where a estimates the intercept of the population regression line,  ; b estimates the slope of the population regression line,  ; and e stands for the observed errors the residuals from fitting the estimated regression line a+ bx to a set of n points. Estimation of a simple linear regression relationship involves finding estimated or predicted values of the intercept and slope of the linear regression line. The estimated regression equation: y= a+ bx + e where a estimates the intercept of the population regression line,  ; b estimates the slope of the population regression line,  ; and e stands for the observed errors the residuals from fitting the estimated regression line a+ bx to a set of n points Estimation: The Method of Least Squares The estimated regression line: + where (y (y-hat) is the value of Y lying on the fitted regression line for a given value of X.  y abx 

12 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to Fitting a Regression Line X Y Data X Y Three errors from a fitted line X Y Three errors from the least squares regression line e X Errors from the least squares regression line are minimized

13 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to { Y X Errors in Regression xixi yiyi

14 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to Least Squares Regression a SSE b Least squares a Least squares b The sum of squared errors in regressionis: SSE= e (y The is that which the SSE with respect to theestimates a and b. i 2 i=1 n i i=1 n    )y i 2 least squares regression lineminimizes SSE: 残差平方和

15 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to Sums of Squares, Cross Products, and Least Squares Estimators 最小二乘回归直线一定经过均数这一点

16 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to Example 18-1

17 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to The results on the bottom are the output created by selecting REGRESSION ( 回归) option from the DATA ANALYSIS (数据分析) toolkit. Example 18-1: Using Computer-Excel 完全安装 Office 后,点击菜单 “ 工具 ”  “ 加载宏 ” 可安装 “ 数据分析 ” 插件

18 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to Y X What you see when looking at the total variation of Y. X What you see when looking along the regression line at the error variance of Y. Y Total Variance and Error Variance

19 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to X Y Square and sum all regression errors to find SSE Error Variance and the Standard Errors of Regression Estimators

20 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to Standard Errors of Estimates in Regression

21 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to Confidence Intervals for the Regression Parameters

22 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to Y X Y X Y X Constant YUnsystematic VariationNonlinear Relationship A hypothesis test for the existence of a linear relationship between X and Y: H 0 H 1 Test statistic for the existence of a linear relationship between X and Y: where is the least-squares estimate ofthe regression slope and is the standard error of When thenull hypothesis is true, the statistic has a distribution with- degrees offreedom. : :     b sbsb b tn 18.6 Hypothesis Tests about the Regression Relationship H 0 :  =0

23 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to Hypothesis Tests for the Regression Slope

24 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to The coefficient of determination, R 2, is a descriptive measure of the strength of the regression relationship, a measure how well the regression line fits the data.. { Y X { } Total Deviation Explained Deviation Unexplained Deviation Percentage of total variation explained by the regression How Good is the Regression? R2=R2= R 2 :决定系数

25 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to Y X R 2 =0SSE SST Y X R 2 =0.90 SSESSE SST SSR Y X R 2 =0.50 SSE SST SSR The Coefficient of Determination 决定系数

26 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to Analysis of Variance Table and an F Test of the Regression Model

27 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to Residual Analysis

28 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to Residual Analysis. The plot shows the a curve relationship between the residuals and the X-values (serum IL - 6). Example 18-1: Using Computer-Excel

29 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to Point Prediction – A single-valued estimate of Y for a given value of X obtained by inserting the value of X in the estimated regression equation. Prediction Interval – For a value of Y given a value of X Variation in regression line estimate Variation of points around regression line – For confidence interval of an average value of Y given a value of X Variation in regression line estimate Point Prediction – A single-valued estimate of Y for a given value of X obtained by inserting the value of X in the estimated regression equation. Prediction Interval – For a value of Y given a value of X Variation in regression line estimate Variation of points around regression line – For confidence interval of an average value of Y given a value of X Variation in regression line estimate Prediction Interval and Confidence Interval

30 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to Prediction Interval for a Value of Y

31 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to Confidence Interval for the Average Value of Y

32 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to Confidence Interval for the Average Value of Y and Prediction Interval for the Individual Value of Y

33 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to Summary 1. Regression analysis is applied for prediction while control effect of independent variable X. 2. The principle of least squares in solution of regression parameters is to minimize the residual sum of squares The coefficient of determination, R 2, is a descriptive measure of the strength of the regression relationship. 4. There are two confidence bands: one for mean predictions and the other for individual prediction values 5. Residual analysis is used to check goodness of fit for models

34 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to Assignments 1. What is the main distinctions and assossiations between correlation analysis and simple linear regression? 2. What is the least squares method to estimate regression line? 3. Please describe the main steps for fitting a simple linear regression model with data.

35 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to main distinctions Difference: 1. Data source: correlation analysis is required that both x and y follow normal distribution; but for simple linear regression, only y is required following normal distribution. 2. application: correlation analysis is employed to measure the association between two random variables (both x and y are treated symmetrically) simple linear regression is employed to measure the change in y for x (x is the independent varible, y is the dependent variable) 3. r is a dimensionless number, it has no unit of measurement; but b has its unit which relate to y.

36 4/9/ :38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to main associations relationship: 1. t r =t b 2. Have same sign between r and b.


Download ppt "4/9/2005 11:38 AM Department of Epidemiology and Health Statistics,Tongji Medical College (Dr. Chuanhua Yu)http://statdtedm.6to23."

Similar presentations


Ads by Google