CSC323 – Week 3 Regression line

Slides:



Advertisements
Similar presentations
Chapter 4 The Relation between Two Variables
Advertisements

Chapter 3 Bivariate Data
AP Statistics Chapters 3 & 4 Measuring Relationships Between 2 Variables.
Multiple regression analysis
Multiple Regression models Estimation Goodness of fit tests
The Simple Regression Model
Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall.
CHAPTER 3 Describing Relationships
Correlation and Regression Analysis
1 Simple Linear Regression 1. review of least squares procedure 2. inference for least squares lines.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Chapter 5 Regression. Chapter 51 u Objective: To quantify the linear relationship between an explanatory variable (x) and response variable (y). u We.
Regression and Correlation Methods Judy Zhong Ph.D.
Introduction to Linear Regression and Correlation Analysis
Inference for regression - Simple linear regression
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
Correlation Correlation measures the strength of the LINEAR relationship between 2 quantitative variables. Labeled as r Takes on the values -1 < r < 1.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation.
1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.
Notes Bivariate Data Chapters Bivariate Data Explores relationships between two quantitative variables.
Notes Bivariate Data Chapters Bivariate Data Explores relationships between two quantitative variables.
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 19 Linear Patterns.
6-1 Introduction To Empirical Models Based on the scatter diagram, it is probably reasonable to assume that the mean of the random variable Y is.
Relationships If we are doing a study which involves more than one variable, how can we tell if there is a relationship between two (or more) of the.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 3 Describing Relationships 3.2 Least-Squares.
Examining Bivariate Data Unit 3 – Statistics. Some Vocabulary Response aka Dependent Variable –Measures an outcome of a study Explanatory aka Independent.
CHAPTER 5 Regression BPS - 5TH ED.CHAPTER 5 1. PREDICTION VIA REGRESSION LINE NUMBER OF NEW BIRDS AND PERCENT RETURNING BPS - 5TH ED.CHAPTER 5 2.
1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School.
Residuals Recall that the vertical distances from the points to the least-squares regression line are as small as possible.  Because those vertical distances.
BPS - 5th Ed. Chapter 231 Inference for Regression.
1 Experimental Statistics - week 11 Chapter 11: Linear Regression and Correlation.
Week 2 Normal Distributions, Scatter Plots, Regression and Random.
Inference for Least Squares Lines
Statistics 101 Chapter 3 Section 3.
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
LSRL Least Squares Regression Line
Daniela Stan Raicu School of CTI, DePaul University
Regression and Residual Plots
CHAPTER 3 Describing Relationships
^ y = a + bx Stats Chapter 5 - Least Squares Regression
Chapter 3: Describing Relationships
Least-Squares Regression
Chapter 3: Describing Relationships
Daniela Stan Raicu School of CTI, DePaul University
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Least-Squares Regression
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
Presentation transcript:

CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression

Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall pattern Fitting a line means drawing a line that is as close as possible to the points: the “best” straight line is the regression line. Birth rate (1,000 pop) Log G.N.P.

Prediction errors For a given x, use the regression line to predict the response The accuracy of the prediction depends on how much spread out the observations are around the line. Y Observed value y Error Predicted value       x

Simple Example: Productivity level To see how productivity was related to level of maintenance, a firm randomly selected 5 of its high speed machines for an experiment. Each machine was randomly assigned a different level of maintenance X and then had its average number of stoppage Y recorded. These are the data:   Correlation coefficient r =  0.94 Simple Example: Productivity level To see how productivity was related to level of maintenance, a firm randomly selected 5 of its high speed machines for an experiment. Each machine was randomly assigned a different level of maintenance X and then had its average number of stoppage Y recorded. 1.8  1.6  1.4  1.2  1.0  0.8  0.6  0.4  0.2   Hours X Average int. Y 4 1.6 6 1.2 8 1.1 10 0.5 12 0.6 Ave(x)=8 s(x)=3.16 Ave(y)=1s(y)=0.45 r=–0.94     # interruptions   | | | | | | | | 2 4 6 8 10 12 14 16 X Hours of maintenance

Least squares regression line Definition The regression line of y on x is the line that makes the sum of the squares of the vertical distances (deviations) of the data points from the line as small as possible It is defined as Note: b has the same sign of r Where b = r*s.d.(y)/s.d.(x) a = ave(y) – b*ave(x) We use to distinguish between the values predicted from the regression line and the observed values

Example: cont. The regression line of the number of interruptions and the hours of maintenance per week is calculated as follows. The descriptive statistics for x and y are: Ave(x)=8 s(x)=3.16; Ave(y)=1 s(y)=0.45 and r=–0.94 Slope Intercept a=ave(y) –b  ave(x)=1– (–0.135)  8=2.08   Regression Line: = 2.08 –0.135 x = 2.08 –0.135 hours

Regression line = 2.08 –0.135 hours To draw a line: find two points that satisfy the regression equation and connect them!  Point of averages (8,1)  Point on the line: (6,1.27) found by plugging x=6 into the regression equation, s.t. y=2.08-0.135*6=1.27 Hours of maintenance 1.8  1.6  1.4  1.2  1.0  0.8  0.6  0.4  0.2  | | | | | | | | 2 4 6 8 10 12 14 16 X  # interruptions r=–0.94 residual Point of averages

Example: CPU Usage A study was conducted to examine what factors affect the CPU usage. A set of 38 processes written in a programming language was considered. For each program, data were collected on the CPU usage (time) in seconds of time, and the number of lines (line) in thousands generated by the program execution. CPU usage The scatter plot shows a clear positive association. We’ll fit a regression line to model the association! Number of lines

The regression line is Variable N Mean Std Dev Sum Minimum Maximum Y time 38 0.15710 0.13129 5.96980 0.01960 0.46780 X linet 38 3.16195 3.96094 120.15400 0.10200 14.87200 Pearson Correlation Coefficients =0.89802 The regression line is

Coefficient of determination Goodness of fit measures Coefficient of determination R2 = (correlation coefficient)2 describes how good the regression line is in explaining the response y. fraction of the variation in the values of y that is explained by the regression line of y on x. Varies between 0 and 1. Values close to 1, then the regression line provides a good explanation of the data ; close to zero, then the regression line is not able to capture the variability in the data EXAMPLE (cont.): The correlation coefficient is r = –0.94. The regression line is able to capture 88.3% of the variability in the data. 

A prediction error in statistics is called residual 2. Residuals   The vertical distances between the observed points and the regression line can be regarded as the “left-over” variation in the response after fitting the regression line. A residual is the difference between an observed value of the response variable y and the value predicted by the regression line. Residual e = observed y – predicted = y – A special property: the average of the residuals is always zero. A prediction error in statistics is called residual

EXAMPLE: Residuals for the regression line = 2.08 – 0.135 x for the number of interruptions Y on the hours of maintenance X. Hours X Average interr. Y Predicted Interr. Residual y – 4 1.6 2.08 – 0.135*4=1.54 1.6 – 1.54=0.06 6 1.2 2.08 – 0.135*6=1.27 1.2 – 1.27 = –0.07 8 1.1 2.08 – 0.135*8=1 1.1 –1=0.1 10 0.5 2.08 – 0.135*10=0.73 0.5 –0.73= –0.23 12 0.6 2.08 – 0.135*12=0.46 0.6 –0.46=0.14 Average=0  

3. Accuracy of the predictions If the cloud of points is football-shaped, the prediction errors are similar along the regression line. One possible measure of the accuracy of the regression predictions is given by the root mean square error (r.m.s. error). The r.m.s. error is defined as the square root of the average squared residual: This is an estimate of the variation of y about the regression line.

1 r.m.s. error Roughly 68% of the points 2 r.m.s. errors Roughly 95% of the points

Computing the r.m.s.error: Hours X Average interr. Y Predicted Interr. Residual Squared Residual 4 1.6 1.54 0.06 0.0036 6 1.2 1.27 0.07 0.0049 8 1.1 1 0.1 0.01 10 0.5 0.73 –0.23 0.053 12 0.6 0.46 0.14 0.0196 Total 0.0911 Computing the r.m.s.error: The r.m.s. error is  (0.0911/4) = 0.151 If the company will schedule 7 hours of maintenance per week, the predicted weekly number of interruptions of the machine will be =2.08 – 0.1357=1.135 on average. Using the r.m.s. error, more likely the number of interruptions will be between 1.135–2*0.151=0.833 and 1.135+2*0.151=1.437.

Looking at vertical strips When all the vertical strips in a scatter plot show similar amount of spread then the diagram is said to be homoscedastic. A football-shaped cloud of points is homoscedastic!! Consider the data on the birth rate and the GNP index in 97 countries. Birth rate (1,000 pop)  predicted points in corresponding strips   Log G.N.P.

In a football-shaped scatter diagram, consider the points in a vertical strip. The value predicted by the regression line can be regarded as the average of their y-values. Their standard deviation is about equal to the r.m.s. error of the regression line. is the average of the y-values in the strip    s.d. is roughly the r.m.s error Log G.N.P.

Computing the r.m.s. error In large data sets, the r.m.s. error is approximately equal to Consider the example on birth rate & GNP index Average st. dev. Birth rate Y 29.33 13.55 Log G.N.P. X 7.51 1.65 r = – 0.74 The regression line is For x=8 the predicted birth rate is How accurate is this prediction?

 The r.m.s.error is =sqrt(1 –0.74^2)*13.55=9.11 Thus 68% of the countries with log GNP=8, equal to about 3000$ per capita have birth rate between 26.35 – 9.11=17.24 and 26.35+9.11=35.46. Most likely the countries with log GNP =8 have birth rate between (8.13, 44.57) since 26.35 – 2*9.11=8.13 and 26.13+2*9.11=44.57 Birth rate 8.13 95% of the points in the strip for logGNP=8  44.57 Log G.N.P.

Detect problems in the regression analysis: the residual plots The analysis of the residuals is useful to detect possible problems and anomalies in the regression A residual plot is a scatter plot of the regression residuals against the explanatory variable. Points should be randomly scattered inside a band centered around the horizontal line at zero (the mean of the residuals).

“Good case” X “Bad cases” Non linear relationship Residual X “Bad cases” Non linear relationship Variation of y changing with x

Anomalies in the regression analysis If the residual plot displays a curve  the straight line is not a good description of the association between x and y If the residual plot is fan-shaped  the variation of y is not constant. In the figure above, predictions on y will be less precise as x increases, since y shows a higher variability for higher values of x. Be careful if you use r.m.s. error!!

Example of CPU usage data Residual plot Do you see any striking pattern?

Example: 100 meter dash At the 1987 World Championship in Rome, Ben Johnson set a new world record in the 100-meter dash. Scatter plot for Johnson’s times The data: Y=the elapsed time from the start of the race in 10-meter increments for Ben Johnson, X= meters Elapsed time Meters Johnson Average 55 5.83 St. dev. 30.27 2.52 Correlation = 0.999 Meters

Regression Line The fitted regression line is =1.11+0.09 meters. Elapsed time Meters The fitted regression line is =1.11+0.09 meters. The value of R2 is 0.999, therefore 99.9% of the variability in the data is explained by the regression line.

Does the graph show any anomaly? Residual Plot Residual Meters Does the graph show any anomaly?

Confounding factor A confounding factor is a variable that has an important effect on the relationship among the variables in a study but it is not included in the study. Example: The mathematics department of a large university must plan the timetable for the following year. Data are collected on the enrollment year, the number x of first-year students and the number y of students enrolled in elementary math courses. The fitted regression line has equation: =2491.69+1.0663 x R2=0.694.

Residual Analysis Do the residuals have a random pattern?

Scatter plot of residuals vs year 1991 1992 1993 1994 1995 1996 1997 Enrollment year The plot of the residuals against the year suggests that a change took place between 1994 and 1995. This caused a higher number of students to take math courses (one school changed its curriculum).

Outliers and Influential points An outlier is an observation that lies outside the overall pattern of the other observation  Large residual                 outlier      

Influential Point An observation is influential for the regression line, if removing it would change considerably the fitted line. An influential point pulls the regression line towards itself. Regression line if  is omitted                  Influential point      

Example: house prices in Albuquerque. The coefficient of determination is R2=0.4274. Annual tax What does the value of R2 say? Selling price

Are there any possible outliers &/or influential point? Diagnostics Plots Residual Plot DFFITS plot Residual DFFITS Price Studentized Residual Price Studentized Residual Are there any possible outliers &/or influential point? Price

New analysis: omitting the influential points Previous regression line The regression line is   =-55.364+0.8483 price The coefficient of determination is R2=0.8273   Annual tax Selling price The new regression line explains 82% of the variation in y .

Extrapolation Extrapolation is when we use a regression equation to predict values outside of this range. This is dangerous and often inappropriate, and may produce unreasonable answers. Example: a linear model which relates weight gain to age for young children. Applying such a model to adults, or even teenagers, would be absurd. Example on selling price of houses, the regression line should not be used to predict the annual taxes for expensive houses that cost over 500,000 dollars

Summary – Warnings Correlation measures linear association, regression line should be used only when the association is linear Extrapolation – do not use the regression line to predict values outside the observed range – predictions are not reliable Correlation and regression line are sensitive to influential / extreme points Check residual plots to detect anomalies and “hidden” patterns which are not captured by the regression line

Example of regression analysis Leaning Tower of Pisa response variable the lean (Y) = the distance between where a point at the top of the tower is and where it would be if the tower were straight. The units for the lean are tenths of a millimeter above 2.9 meters. explanatory variable time (X) (1975-1987) Steps of our analysis: plot fit a line predict the future lean

Regression line The equation of the regression line is = -61.12+9.32

Residual Plot Normal probability plot

Obs YEAR LEAN Value Residual Predict Obs YEAR LEAN Value Residual 1 75 642.0 637.8 4.2198 2 76 644.0 647.1 -3.0989 3 77 656.0 656.4 -0.4176 4 78 667.0 665.7 1.2637 5 79 673.0 675.1 -2.0549 6 80 688.0 684.4 3.6264 7 81 696.0 693.7 2.3077 8 82 698.0 703.0 -5.0110 9 83 713.0 712.3 0.6703 10 84 717.0 721.6 -4.6484 11 85 725.0 731.0 -5.9670 12 86 742.0 740.3 1.7143 13 87 757.0 749.6 7.3956 Prediction in 2002 14 102 889.4

PROC REG in SAS PROC REG; MODEL yvar=xvar1; PLOT yvar*xvar/nostat;  draw scatter plot and regression line PLOT residual.*xvar1 residual.*predicted.; residual plots PLOT npp.*residual.;  probability plot for the residuals PLOT yvar*xvar/PRED;  draw scatter plot & upper and lower prediction bounds. RUN; The option nostat in the PLOT statement eliminates the equation of the regression line that is displayed in the regression plot. The option lineprinter produces lineprinter plots (low-level graphics)

SAS Proc GPLOT & REG SAS Data Step data pisa; input year lean @@; datalines; 75 642 76 644 77 656 78 667 79 673 80 688 81 696 82 698 83 713 84 717 85 725 86 742 87 757 ; SAS Proc GPLOT & REG proc gplot; plot lean*year; proc reg; model lean=year; plot lean*year/pred; plot residual.*year; plot npp.*residual.; run;

The REG Procedure Dependent Variable: lean lean in tenths of millimeter Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 15804 15804 904.12 <.0001 Error 11 192.28571 17.48052 Corrected Total 12 15997 Root MSE 4.18097 R-Square 0.9880 Dependent Mean 693.69231 Adj R-Sq 0.9869 Coeff Var 0.60271 Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept 1 -61.12088 25.12982 -2.43 0.0333 year 1 9.31868 0.30991 30.07 <.0001